Fixes physically impossible video generation by disentangling semantic prompts from physical dynamics during training.
March 30, 2026
Original Paper
DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation
arXiv · 2603.25931
The Takeaway
Standard flow-matching video models often ignore physics because text prompts conflate what an object 'is' with how it 'moves.' This method uses a dual-scale contrastive loss to separate these signals, allowing models to generate temporally consistent video that actually obeys kinematics and forces.
From the abstract
Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement.