AI & ML Efficiency Breakthrough

Reduces token consumption in interleaved multimodal reasoning by over 72% using dynamic visual thoughts.

March 24, 2026

Original Paper

Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

Xu Liu, Yongheng Zhang, Qiguang Chen, Yao Li, Sheng Wang, Libo Qin

arXiv · 2603.21754

The Takeaway

Most interleaved-modal models insert images at fixed intervals, which is computationally wasteful. This framework adaptively inserts visual information only when reasoning requires it, drastically cutting costs while achieving state-of-the-art performance.

From the abstract

Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuo

Read the original paper →

← Back to today's papers