EgoMimic : Scaling Imitation Learning via Egocentric Video

Abstract

The scale and diversity of demonstration data required for imitation learning is a significant challenge. We present EgoMimic, a full-stack framework that scales manipulation through egocentric-view human demonstrations. EgoMimic achieves this through: (1) an ergonomic human data collection system using the Project Aria glasses, (2) a low-cost bimanual manipulator that minimizes the kinematic gap to human data, (3) cross-domain data alignment techniques, and (4) an imitation learning architecture that co-trains on hand and robot data. Compared to prior works that only extract high-level intent from human videos, our approach treats human and robot data equally as embodied demonstration data and learns a unified policy from both data sources. EgoMimic achieves significant improvement on a diverse set of long-horizon, single-arm and bimanual manipulation tasks over state-of-the-art imitation learning methods and enables generalization to entirely new scenes. Finally, we show that scaling 1 hour of additional hand data is significantly more valuable than 1 hour of additional robot data.

Method

EgoMimic. (a) A stack to collect human data for manipulation using the Project Aria glasses. (b) A low cost and capable humanoid that can readily leverage egocentric manipulation data. Our approach lets us collect large scale human data which closely matches the observations seen by a human.

Unified Policy from Human and Robot Data. The model processes normalized hand and robot data through shared vision and policy network, outputting 2 separate predictions: Pose for both human and robot data, and joint actions for robot data. The framework uses masked images to mitigate human-robot appearance gaps and incorporates wrist camera views for the robot.

Results

(A) In-domain Performance

(a) Continuous object-in-bowl: the robot continuously picks a small toy, places it in a bowl, and resets the scene.

Conclusion

We present EgoMimic, a framework to co-train manipulation policies from human egocentric videos and teleoperated robot data. By leveraging Project Aria glasses, a low-cost bimanual robot setup, cross-domain alignment techniques, and a unified policy learning architecture, EgoMimic improves over state-of-the-art baselines on three challenging real-world tasks and shows generalization to new scenes as well as favorable scaling properties. For future work, we plan to explore the possibility of generalizing to new robot embodiments and entirely new behaviors demonstrated only in human data. Overall, we believe our work opens up exciting new venues of research on scaling robot data via passive data collection.

BibTeX

@misc{kareer2024egomimicscalingimitationlearning, title={EgoMimic: Scaling Imitation Learning via Egocentric Video}, author={Simar Kareer and Dhruv Patel and Ryan Punamiya and Pranay Mathur and Shuo Cheng and Chen Wang and Judy Hoffman and Danfei Xu}, year={2024}, eprint={2410.24221}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2410.24221}, }

EgoMimic | Scaling Imitation Learning through Egocentric Video

Abstract

Method

Results

(A) In-domain Performance

Replanning and robustness to object and robot perturbations

(B) Generalization

EgoMimic generalizes to new scenes by co-training on robot data in original scene and human data in new scene

EgoMimic show impressive zero-shot generalization to new scenes

(C) Scaling

Scaling hand data

Scaling hand data v/s scaling robot data

Conclusion

BibTeX