VAMPO optimizes visual dynamics in video models using policy gradients to fix precision-critical errors in robotic manipulation.
March 23, 2026
Original Paper
VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models
arXiv · 2603.19370
The Takeaway
Standard diffusion models are trained on likelihood, which often misses subtle contact physics. By treating denoising as a sequential decision process and optimizing for latent expert rewards, this framework produces video predictions that are physically grounded for robot control.
From the abstract
Video action models are an appealing foundation for Vision--Language--Action systems because they can learn visual dynamics from large-scale video data and transfer this knowledge to downstream robot control. Yet current diffusion-based video predictors are trained with likelihood-surrogate objectives, which encourage globally plausible predictions without explicitly optimizing the precision-critical visual dynamics needed for manipulation. This objective mismatch often leads to subtle errors in