General-purpose AI 'knows' how to move robots better than the models that were specifically trained to move robots.
April 15, 2026
Original Paper
LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment
arXiv · 2604.11689
The Takeaway
The LARY benchmark shows that general visual models, trained on nothing but internet videos, outperform specialized robotic 'embodied' models in control tasks. This suggests that the 'rules of the physical world' are already implicitly encoded in our general datasets. We don't need specialized, expensive robotic data to teach AI how to act; we just need better ways to 'align' its existing visual knowledge to a robot's arms. This is a massive paradigm shift for robotics: the path to humanoid robots might go through YouTube, not through specialized lab training. Generalization is winning.
From the abstract
While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Represe