RESEARCH
The field has converged on human video as the foundation for robot learning. The remaining question is quality.
THE CONVERGENCE ON HUMAN VIDEO
Every major robotics lab publishing in 2025-2026 uses egocentric human video as the pretraining foundation. NVIDIA found a log-linear scaling law. The question shifted from “should we use human video?” to “how much, and how to combine it with robot data?”
But vision-only misses half the story. On contact-rich tasks, vision-only averages 21% success. Adding tactile pushes that to 71%. For force-sensitive, bimanual manipulation — the data must include what cameras cannot see.
SCALING EGOCENTRIC VIDEO FOR ROBOT LEARNING
How large-scale human video trains better robot policies
EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data
NVIDIA GEAR Lab
First scaling law for human video in robotics. 20,854 hours, log-linear relationship between data volume and policy performance (R²=0.9983). Massive human pretraining + lightweight robot mid-training.
In-N-On: Scaling Egocentric Manipulation with In-the-Wild and On-Task Data
1,000+ hours curated into in-the-wild and on-task datasets. Trains Human0 — a language-conditioned flow matching policy with few-shot humanoid transfer.
Masquerade: Learning from In-the-Wild Human Videos using Data-Editing
Inpaints human arms with rendered robot overlays to close the visual embodiment gap. 5-6x better zero-shot generalization with only 50 robot demos per task.
EgoMimic: Scaling Imitation Learning via Egocentric Video
Pairs egocentric human video with 3D hand tracking via Project Aria. Co-trains a unified human-robot policy that improves long-horizon task performance.
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
VLA model that maps human demonstrations to a unified action space via MANO hand model. Zero-shot generalization and humanoid transfer with minimal fine-tuning.
CROSSING THE EMBODIMENT GAP
Transferring human manipulation skills to robot hardware
Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration
Trains dexterous manipulation policies from a single human RGB-D video. Extracts 6D pose trajectory to guide robot configuration in a digital twin. Zero-shot real-world deployment.
HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data
Four-stage pipeline integrating teleoperation, mocap, and raw video for mobile robots with dexterous hands. Vision-based sim-to-real transfer.
UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations
Learns embodiment-agnostic skill representations from unaligned human and robot videos. Robots imitate complex behaviors without paired demonstrations.
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos
Extracts 3D hand trajectories from monocular RGB using depth models and SfM. Coarse-to-fine affordance learning enables zero-shot cross-embodiment deployment.
Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids
Practical sim-to-real RL recipe for multi-fingered bimanual manipulation. Autotuned robot modeling and generalizable contact-goal reward designs.
THE MISSING MODALITY: TOUCH
Why vision alone is insufficient for contact-rich manipulation
OpenTouch: Bringing Full-Hand Touch to Real-World Interaction
First in-the-wild egocentric full-hand tactile dataset. 5.1 hours of synchronized video, 3D hand pose, and dense touch via an open-source sensing glove.
MAPLE: Encoding Dexterous Robotic Manipulation Priors from Egocentric Videos
Learns manipulation priors from egocentric video to predict contact points and hand poses at contact. Improves sample efficiency for dexterous manipulation in sim and real.