ALL ARTICLES

PHYSICAL AI DATA QUALITY — PART 2

Volume Is a Curse Without Metadata

More egocentric hours, dumped on a policy without annotation, makes the model worse. The 2026 version of “at scale” is engineered, annotated, and contextual — not just more footage.

June 3, 20268 MIN READSYNGRAPH

For two years the implicit plan for robot data was simple: collect enough first-person human video — headcams, bodycams, AR glasses — dump it into a foundation model, and let scale do the work. The plan was half right. Scale matters. But the 2026 evidence is blunt about the other half: raw volume, without annotation, does not just fail to help. It actively makes policies worse.

The averaging trap

The mechanism is intuitive once you see it. Ten people pour a glass of water ten different ways. Some are fast, some careful, some brace the glass, some don't. Feed all hundred demonstrations to a policy with no signal about which strategy is which, and the model does the only thing it can: it averages them. The result is a blurred behavior that matches no single competent strategy. Add more unlabeled demonstrations and the blur gets worse, not better.

Physical Intelligence's π0.7 (arXiv 2604.15483) is the cleanest demonstration of the alternative. Its gains over prior generalist policies came not from more data but from how the data was annotated, conditioned, and trained. The mix spans nearly everything — teleoperation, autonomous rollouts, reinforcement-learning trajectories, outright failures, egocentric human video, and web data — and the leverage is in the labels, not the hours. Even EgoVerse, a 1,362-hour, 2,087-demonstrator egocentric corpus, reports that performance scales with human data only when that data is aligned with the robot learning objective. Volume is necessary. It is not sufficient.

What “annotation density” actually means

The word annotation undersells it. The signals that make a dataset trainable in this regime are specific:

  • Quality and strategy labels — was this the fast way or the safe way? A grade per episode lets the model condition on intent instead of averaging across it.
  • Subtask breakdowns — long demonstrations decomposed into the steps a policy can actually learn and recombine.
  • Subgoal images — a single frame of “what success looks like” a few seconds out. π0.7 includes these in a quarter of its training batches. They turn open-loop action prediction into an inverse-dynamics question — what action gets me to that state? — and they let instruction-following work even when language is rare in the data.
  • Episode context built for dropout — π0.7 randomly drops metadata during training so the model learns to operate with partial information at test time. That only works if the metadata existed in the first place, captured in a consistent schema.

None of this can be reconstructed after the fact from a folder of MP4s. It has to be captured at collection time, in a schema that downstream trainers can rely on.

Three consequences for anyone collecting human data

Treat ego-data as one ingredient, not the recipe

π0.7 succeeded by blending human egocentric video with robot teleop, autonomous rollouts, RL trajectories, failures, and web data. Pure human video does not carry embodiment-specific signal. The collection has to be interoperable — timestamped, synchronized, and labeled in the same schema as the robot data it will be co-trained with — or it sits in a silo no trainer can mix in cleanly.

Synchronize at capture time

Ego video plus tactile plus multiple angles is only valuable if the streams are frame-aligned. Tactile is especially high-leverage right now because vision alone is ambiguous at contact — but a force signal that is even tens of milliseconds out of register with the video teaches the wrong cause and effect. Synchronization is not a post-processing convenience; it is a precondition for the data being usable at all.

Capture for the missing-information case

Because robust policies train with dropout on affordances, instructions, and metadata, the collection should produce modular, separable signals — subgoal frames, language, quality flags, contact events — that a downstream trainer can selectively withhold. Monolithic clips can't be decomposed later.

The moat moved

The data moat has shifted from quantity to quality and context. “At scale” in 2026 means engineered, annotated, multi-modal, and aligned with how the model will be trained — not more headcam footage. That is the standard we build to: every episode carries its metadata, its contact ground truth, and its synchronized modalities, in a schema that drops into a training pipeline without a human in the loop. Volume is table stakes. The leverage is everything wrapped around it.

AnnotationCo-trainingData strategy

BUILDING PHYSICAL AI?

We deliver synchronized, integrity-checked, contact-rich manipulation episodes in LeRobot v3 format. Tell us what your training corpus is missing.