Reduces token consumption in interleaved multimodal reasoning by over 72% using dynamic visual thoughts.
March 24, 2026
Original Paper
Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts
arXiv · 2603.21754
The Takeaway
Most interleaved-modal models insert images at fixed intervals, which is computationally wasteful. This framework adaptively inserts visual information only when reasoning requires it, drastically cutting costs while achieving state-of-the-art performance.
From the abstract
Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuo