PHYSICAL AI DATA QUALITY — PART 1
Synchronization: The Unbuilt Foundation of Physical AI
The field is racing toward multi-modal sensorimotor human data. The temporal layer underneath it has not been built — and nothing in the pipeline warns you when it is broken.
Before you scale multi-modal, get this right. Physical AI is racing toward sensorimotor human data. Synchronization is the foundation underneath it. It has not been built.
The bet on human data
Physical AI is shifting fast toward training on human behavioral data. UMI (Stanford / Columbia / TRI, RSS 2024) showed that manipulation policies can be trained entirely from human demonstrations, no robot required. EgoScale (NVIDIA GEAR Lab) scaled this to more than 20,000 hours and found a log-linear relationship between human data volume and downstream policy performance. Humans are general-purpose robots deployed at a scale of eight billion. Just as language models learned from humanity's accumulated text, Physical AI will learn from humanity's accumulated behavior. The difference: text already existed on the internet. Behavioral data has to be collected from scratch.
World models raise the bar further. Learning physical causality requires multiple viewpoints and modalities capturing the same scene simultaneously. Video prediction is emerging as the core paradigm — models like Meta's V-JEPA 2 show that video-scale human data can be distilled into genuine physical understanding. For all of these, temporal quality directly determines the accuracy of the physics a model can learn.
The problem most teams haven't hit yet
Right now, most teams collecting human data for Physical AI are capturing single-stream egocentric video. One camera, one file, no sync needed.
But the field is moving toward multi-modal sensorimotor data: ego camera plus wrist camera plus IMU plus tactile sensors plus force feedback. This is where the real value lives, because manipulation is not just about what you see — it is about the relationship between vision, proprioception, and contact.
The moment you add a second sensor, synchronization becomes your problem. Most teams discover this the hard way, because nothing in their pipeline warns them that it is broken. The sync errors below are not edge cases that surface occasionally. They are structural. They exist in every multi-device recording by default, and they will silently corrupt your training data unless explicitly solved.
Why human data has this problem and robot data doesn't
In robot data collection, synchronization is barely a concern. A single controller governs every sensor. Joint encoders, force/torque sensors, and cameras are all wired to one host, sharing one system clock. ROS provides a unified time reference. Synchronization is built into robot systems by design.
Human data is structurally different, and the reason is not a technical limitation — it is a prerequisite for good data. Human data is valuable precisely because people behave naturally: using tools, opening drawers, moving objects without constraint. To capture that, the rig must not interfere with movement. Cables, heavy rigs, and tethered sensors destroy the naturalism that makes the data worth collecting.
So human data collection takes this form:
- Lightweight egocentric camera on the head (GoPro, Project Aria)
- Wrist-mounted camera
- IMU sensors on body or hands (wireless BLE)
- External tripod cameras (third-person view)
- Optionally: tactile sensors, eye trackers
Each device has its own hardware, its own clock, its own storage. They cannot be wired to a single host without destroying the freedom of movement that makes the data valuable. The paradox: the properties that make human data worth collecting are exactly what create the synchronization problem.
Why most teams underestimate sync
Most teams assume synchronization is straightforward. Here is why every common shortcut falls short.
“NTP will handle it”
NTP accuracy is roughly ±1–10 ms on LAN and ±10–50 ms over WiFi. At 30 fps, one frame is 33.3 ms — so WiFi NTP carries one to two frames of error by default. More fundamentally, NTP synchronizes host wall-clocks. GoPros do not use NTP. BLE IMUs are not network-connected. Without a common time reference between devices, NTP has nothing to align.
“Just compare timestamps”
“Timestamp” means different things on different devices. A GoPro stamps frames against its own internal clock, never exposed to any host. iPhone CMSampleBuffer presentation timestamps live on the iOS media clock. BLE IMU samples carry their own sensor clock, unrelated to host receive time. Even if every device reports a timestamp, those timestamps live in different clock domains and cannot be compared directly. Resolution isn't the problem; the absence of a shared reference is.
“Use PTP or hardware sync”
PTP achieves sub-microsecond precision — but it requires dedicated hardware, doesn't work over WiFi, and costs thousands per setup. Hardware genlock works between industrial cameras but not on consumer wearables. Precise sync technology exists. It is incompatible with the way human data has to be collected.
“30 fps means 33.3 ms intervals”
It doesn't. OS scheduling, USB contention, and sensor exposure variance make frame intervals vary. Many consumer cameras are effectively variable frame rate. Computing time as frame_number × (1 / fps) introduces per-frame errors that accumulate differently on every device.
Four core problems
Multi-modal synchronization decomposes into four distinct problems. Each has a different physical cause and needs a different solution.
1. Offset — the time axes start at different points
Each device defines “time zero” independently. To compare them you need the offset between their axes. On the same host, a shared system clock makes this possible. Across hosts — body-worn cameras versus external tripods on separate machines — no shared reference exists. Asking “what time is it?” over a network doesn't help, because the network itself injects timing uncertainty on every packet.
2. Drift — the time axes run at different speeds
Even after correcting offset, crystal oscillators vibrate at slightly different frequencies (±10–100 ppm for consumer-grade parts). At 40 ppm relative drift: about 24 ms after 10 minutes, 72 ms after 30 minutes, 144 ms after an hour. Drift is an error that grows with time. It requires rescaling the time axis, not just shifting it.
3. Jitter — frame intervals are not uniform
Even after correcting offset and drift, frame_number × (1 / fps) is wrong. Actual capture intervals vary per frame due to OS scheduling, USB contention, and exposure variance. You need real per-frame timestamps, not nominal calculations.
4. Rate mismatch — streams run at different frequencies
A 30 fps camera, a 200 Hz IMU, a 1000 Hz tactile sensor. By what standard do you even define “synchronized”? A single video frame overlaps roughly seven IMU samples and thirty-three tactile readings. Which samples belong to that frame, and how do you map one to the other? This is the frame-mapping problem.
The errors compound
These four problems do not occur in isolation. They stack. In a typical setup — ego cam, wrist cam, two external cameras, a BLE IMU — every stream pair faces a different combination. Same-host cameras: drift + jitter + rate mismatch. Cross-host cameras: add offset. Camera-to-sensor: all four.
At scale, variance explodes. A collection effort spanning thousands of demonstrators means different devices, environments, and firmware — no single calibration works everywhere. Foundation models ingest this data wholesale, unable to distinguish temporally corrupted samples from clean ones. Contact events in manipulation happen on the order of tens of milliseconds. World models need precise temporal correspondence between visual contact and force/IMU readings to learn correct causality. In that regime, tens of milliseconds of misalignment shifts the apparent cause and effect a model sees during training — especially for contact-rich tasks, where the signal window is itself only tens of milliseconds wide.
All of this is invisible. Human perception tolerates tens of milliseconds of audio-visual asynchrony without noticing — playback looks fine. But models learn pixel- and sample-level correspondence, encoding the misalignment as pattern. The feedback loop from collection to training failure is long enough that sync is rarely identified as the cause.
The case for infrastructure
Today, most teams handle sync through ad-hoc methods: handclaps, LED flashes, per-project scripts. These work at small scale. They do not survive production. The cost is real. Undetected misalignment causes policy failures that take weeks to diagnose. Scripts break when device setups change. Quality is assumed, never verified. When multiple sites contribute to a shared dataset, there is no standard for temporal consistency.
Some groups have invested seriously. UMI built a pipeline around GoPro GPMF telemetry. Meta built Project Aria with custom timing hardware. But UMI works only with GoPros, and Aria depends on purpose-built hardware most teams cannot replicate. Everyone else is left with consumer devices and no general-purpose synchronization layer.
What is missing is not another script. It is an infrastructure layer: device-agnostic, producing verifiable alignment with per-stream confidence metrics, handling mixed sampling rates, and scaling from one session to thousands of hours without manual intervention. Everything built on human data — foundation models, world models, manipulation policies — depends on the reliability of that layer.
That is the layer we build. Frame-accurate alignment, integrity-checked at the episode boundary, before a single sample reaches your training corpus.