AI & ML Paradigm Shift

Video models perform reasoning during the diffusion denoising steps rather than sequentially across video frames.

March 18, 2026

Original Paper

Demystifing Video Reasoning

Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

arXiv · 2603.16870

The Takeaway

It uncovers the 'Chain-of-Steps' mechanism, revealing that video models explore solutions during early denoising and converge later. This insight allows for training-free strategies to improve model reasoning by ensembling or manipulating the denoising process.

From the abstract

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative a