RESEARCH

The field has converged on human video as the foundation for robot learning. The remaining question is quality.

THE CONVERGENCE ON HUMAN VIDEO

Every major robotics lab that published results in 2025-2026 is using egocentric human video as the pretraining foundation layer before touching robot demonstrations. NVIDIA discovered a log-linear scaling law. Physical Intelligence found that at sufficient scale, human hands and robot grippers converge to the same internal representation. Multiple companies have demonstrated that human video alone can bootstrap manipulation policies.

The question has shifted from “should we use human video?” to “how much, how to bridge the embodiment gap, and how to combine it with targeted robot data?” That shift is a consensus, not a trend.

But vision-only capture misses half the story. In contact-rich benchmarks, vision-only systems average 21% success rates on tasks like insertion and assembly. Adding tactile feedback pushes that to 71%. For the manipulation tasks that matter most — force-sensitive, bimanual, contact-rich — the data must include what cameras cannot see.

SCALING EGOCENTRIC VIDEO FOR ROBOT LEARNING

How large-scale human video trains better robot policies

EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data

NVIDIA GEAR Lab

First published scaling law for human video in robotics. 20,854 hours of egocentric video. Log-linear relationship between data volume and policy performance (R²=0.9983). Two-stage transfer recipe: massive human pretraining followed by lightweight robot mid-training.

In-N-On: Scaling Egocentric Manipulation with In-the-Wild and On-Task Data

Scales egocentric manipulation by curating 1,000+ hours into in-the-wild and on-task datasets. Trains Human0 — a language-conditioned flow matching policy with domain adaptation for few-shot learning and robust humanoid transfer.

Masquerade: Learning from In-the-Wild Human Videos using Data-Editing

Closes the visual embodiment gap by inpainting human arms with rendered robot overlays. Co-training on edited human video achieves 5-6x better zero-shot generalization in novel environments with only 50 robot demos per task.

EgoMimic: Scaling Imitation Learning via Egocentric Video

Full-stack framework pairing egocentric human video with 3D hand tracking via Project Aria. Treats human and robot data equally, co-training a unified policy that substantially improves long-horizon task performance.

EgoVLA: Learning Vision-Language-Action Models from Egocentric Human Videos

VLA model leveraging massive egocentric human video to address data scarcity. Maps human demonstrations to a unified action space using the MANO hand model. Enables zero-shot generalization and humanoid transfer with minimal fine-tuning.

CROSSING THE EMBODIMENT GAP

Transferring human manipulation skills to robot hardware

Crossing the Human-Robot Embodiment Gap with Sim-to-Real RL using One Human Demonstration

HUMAN2SIM2ROBOT framework trains robust dexterous manipulation policies from a single human RGB-D video. Extracts object 6D pose trajectory and pre-manipulation hand pose to guide robot configuration in a digital twin, enabling zero-shot real-world deployment.

HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data

Four-stage pipeline for mobile robots with dual dexterous hands. Integrates learning from teleoperation, mocap, and unstructured raw video with vision-based sim-to-real transfer and closed-loop navigation.

UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations

Learns universal, embodiment-agnostic skill representations from massive datasets of unaligned human and robot videos. Inverse and forward skill dynamics models allow robots to imitate complex compositional behaviors without paired demonstrations.

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos

Extracts temporally consistent 3D hand trajectories from monocular RGB-only video using depth models and structure-from-motion. Coarse-to-fine affordance learning enables zero-shot deployment across novel scenes and robot embodiments.

Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids

Practical recipe for multi-fingered bimanual manipulation using sim-to-real RL. Overcomes environment modeling gaps through autotuned robot modeling, hybrid object representations, and generalizable contact-goal reward designs.

THE DATA THESE MODELS NEED

Egocentric video is the pretraining foundation. Tactile sensing is the missing modality. We deliver both.