(MM '23) POV: Prompt-Oriented View-Agnostic Learning for
Egocentric Hand-Object Interaction in the Multi-View World
Abstract
We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture fine-grained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple third-person views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization.
|
Method
POV Framework
Our prompt-oriented view-agnostic learning framework, which trains a model through two optimization sub-tasks and one optional sub-task: (1) prompt-based action understanding, which incorporates interactive masking prompts into frames to pre-train the entire model on third-person videos; (2) view-agnostic prompt tuning, where only view-aware prompts are fine-tuned through cross-view alignment and cross-entropy loss. (3) egocentric fine-tuning, where the model is optionally fine-tuned on limited egocentric videos.
Specific Pipeline
The training and inference pipeline of our POV on Assembly101 and its view split:The training and inference pipeline of our POV on H2O and its view split:
Experiment
Quantatitive Results
We evaluate our POV framework on two benchmarks, Assembly101 and H2O, which are established for view adaptation and view generalization, respectively. Table 1 and Table 2 shows the view adaptation results of our POV framework on Assembly101 and H2O. We can see that our POV framework achieves the best performance on both benchmarks.
After fine-tuning on unlabeled egocentric videos, the results of our POV on Assembly101 and H2O datasets:
After fine-tuning on labeled egocentric videos, the results of our POV on Assembly101 and H2O datasets:
Table 3 and Table 4 shows the view generalization results on egocentric view an third-person view, respectively. We can see that our POV framework achieves promising performance on both benchmarks.