Synthetic videos of simple geometric shapes are more effective than massive real-world datasets for teaching video-language models fundamental temporal reasoning.
March 19, 2026
Original Paper
Learning Transferable Temporal Primitives for Video Reasoning via Synthetic Videos
arXiv · 2603.17693
The Takeaway
The paper demonstrates that 'temporal primitives' (speed, direction, state) learned from 7.7K synthetic samples outperform models trained on much larger real-world datasets like Video-R1. This shifts the focus from scale to structural curriculum in video reasoning, proving that abstract primitives transfer better to real scenarios than noisy real data.
From the abstract
The transition from image to video understanding requires vision-language models (VLMs) to shift from recognizing static patterns to reasoning over temporal dynamics such as motion trajectories, speed changes, and state transitions. Yet current post-training methods fall short due to two critical limitations: (1) existing datasets often lack temporal-centricity, where answers can be inferred from isolated keyframes rather than requiring holistic temporal integration; and (2) training data genera