EgoMimic | Scaling Imitation Learning through Egocentric Video

1Georgia Tech, 2Stanford University

Abstract

The scale and diversity of demonstration data required for imitation learning is a significant challenge. We present EgoMimic, a full-stack framework that scales manipulation through egocentric-view human demonstrations. EgoMimic achieves this through: (1) an ergonomic human data collection system using the Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, (3) cross-domain data alignment techniques, and (4) an imitation learning architecture that co-trains on hand and robot data. Compared to prior works that only extract high-level intent from human videos, our approach treats human and robot data equally as embodied demonstration data and learns a unified policy from both data sources. EgoMimic achieves significant improvement on a diverse set of long-horizon, single-arm and bimanual manipulation tasks over state-of-the-art imitation learning methods and enables generalization to entirely new scenes. Finally, we show that scaling 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data.

Method



EgoMimic. (a) A stack to collect human data for manipulation using the Project Aria glasses. (b) A low cost and capable humanoid that can readily leverage egocentric manipulation data. Our approach lets us collect large scale human data which closely matches the observations seen by a human.
    

Unified Policy from Human and Robot Data. The model processes normalized hand and robot data through shared vision and policy network, outputting 2 separate predictions: Pose for both human and robot data, and joint actions for robot data. The framework uses masked images to mitigate human-robot appearance gaps and incorporates wrist camera views for the robot.

Results

(A) In-domain Performance

    

    

Replanning and robustness to object and robot perturbations

    

(B) Generalization

EgoMimic generalizes to new scenes by co-training on robot data in original scene and human data in new scene


EgoMimic show impressive zero-shot generalization to new scenes

    
    

(C) Scaling

Scaling hand data

Scaling hand data v/s scaling robot data


EgoMimic trained on 2 hours robot data + 1 hour hand data (Blue) strongly outperforms ACT trained on 3 hours of robot data (Orange) in the Continuous Object-in-Bowl task. This shows that scaling hand data using our method outperforms scaling an equivalent amount of robot data.

Conclusion

We present EgoMimic, a framework to co-train manipulation policies from human egocentric videos and teleoperated robot data. By leveraging Project Aria glasses, a low-cost bimanual robot setup, cross-domain alignment techniques, and a unified policy learning architecture, EgoMimic improves over state-of-the-art baselines on three challenging real-world tasks and shows generalization to new scenes as well as favorable scaling properties. For future work, we plan to explore the possibility of generalizing to new robot embodiments and entirely new behaviors demonstrated only in human data. Overall, we believe our work opens up exciting new venues of research on scaling robot data via passive data collection.

BibTeX

@misc{kareer2024egomimicscalingimitationlearning,
      title={EgoMimic: Scaling Imitation Learning via Egocentric Video}, 
      author={Simar Kareer and Dhruv Patel and Ryan Punamiya and Pranay Mathur and Shuo Cheng and Chen Wang and Judy Hoffman and Danfei Xu},
      year={2024},
      eprint={2410.24221},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2410.24221},
    }