RESEARCH
The field has converged on human video as the foundation for robot learning. The remaining question is quality.
THE CONVERGENCE ON HUMAN VIDEO
Every major robotics lab that published results in 2025-2026 is using egocentric human video as the pretraining foundation layer before touching robot demonstrations. NVIDIA discovered a log-linear scaling law. Physical Intelligence found that at sufficient scale, human hands and robot grippers converge to the same internal representation. Multiple companies have demonstrated that human video alone can bootstrap manipulation policies.
The question has shifted from “should we use human video?” to “how much, how to bridge the embodiment gap, and how to combine it with targeted robot data?” That shift is a consensus, not a trend.
But vision-only capture misses half the story. In contact-rich benchmarks, vision-only systems average 21% success rates on tasks like insertion and assembly. Adding tactile feedback pushes that to 71%. For the manipulation tasks that matter most — force-sensitive, bimanual, contact-rich — the data must include what cameras cannot see.
SCALING EGOCENTRIC VIDEO FOR ROBOT LEARNING
How large-scale human video trains better robot policies
EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data
NVIDIA GEAR Lab
First published scaling law for human video in robotics. 20,854 hours of egocentric video. Log-linear relationship between data volume and policy performance (R²=0.9983). Two-stage transfer recipe: massive human pretraining followed by lightweight robot mid-training.
In-N-On: Scaling Egocentric Manipulation with In-the-Wild and On-Task Data
Scales egocentric manipulation by curating 1,000+ hours into in-the-wild and on-task datasets. Trains Human0 — a language-conditioned flow matching policy with domain adaptation for few-shot learning and robust humanoid transfer.
Masquerade: Learning from In-the-Wild Human Videos using Data-Editing
Closes the visual embodiment gap by inpainting human arms with rendered robot overlays. Co-training on edited human video achieves 5-6x better zero-shot generalization in novel environments with only 50 robot demos per task.
EgoMimic: Scaling Imitation Learning via Egocentric Video
Full-stack framework pairing egocentric human video with 3D hand tracking via Project Aria. Treats human and robot data equally, co-training a unified policy that substantially improves long-horizon task performance.
EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos
VLA model leveraging massive egocentric human video to address data scarcity. Maps human demonstrations to a unified action space using the MANO hand model. Enables zero-shot generalization and humanoid transfer with minimal fine-tuning.
CROSSING THE EMBODIMENT GAP
Transferring human manipulation skills to robot hardware
Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration
HUMAN2SIM2ROBOT framework trains robust dexterous manipulation policies from a single human RGB-D video. Extracts object 6D pose trajectory and pre-manipulation hand pose to guide robot configuration in a digital twin, enabling zero-shot real-world deployment.
HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data
Four-stage pipeline for mobile robots with dual dexterous hands. Integrates learning from teleoperation, mocap, and unstructured raw video with vision-based sim-to-real transfer and closed-loop navigation.
UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations
Learns universal, embodiment-agnostic skill representations from massive datasets of unaligned human and robot videos. Inverse and forward skill dynamics models allow robots to imitate complex compositional behaviors without paired demonstrations.
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos
Extracts temporally consistent 3D hand trajectories from monocular RGB-only video using depth models and structure-from-motion. Coarse-to-fine affordance learning enables zero-shot deployment across novel scenes and robot embodiments.
Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids
Practical recipe for multi-fingered bimanual manipulation using sim-to-real RL. Overcomes environment modeling gaps through autotuned robot modeling, hybrid object representations, and generalizable contact-goal reward designs.
THE MISSING MODALITY: TOUCH
Why vision alone is insufficient for contact-rich manipulation
OpenTouch: Bringing Full-Hand Touch to Real-World Interaction
First in-the-wild egocentric full-hand tactile dataset. 5.1 hours of synchronized video, 3D hand pose, and dense touch signals using an open-source tactile sensing glove. Establishes benchmarks for cross-sensory retrieval and proves how multimodal touch data grounds robotic perception.
MAPLE: Encoding Dexterous Robotic Manipulation Priors from Egocentric Videos
Learns manipulation priors from large-scale egocentric video to predict object contact points and hand poses at the moment of contact. Significantly improves sample efficiency for downstream dexterous manipulation in simulation and real-world settings.