Shifts multimodal LLMs from static image prefixes to an active, sequential 'Visual Chain-of-Thought' that explores images based on saliency.
March 31, 2026
Original Paper
Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning
arXiv · 2603.26737
The Takeaway
Current VLMs process images as fixed grids; SSV-CoT implements a goal-driven curriculum where the model sequentially attends to primary and then secondary cues. This end-to-end method significantly improves complex visual reasoning without requiring expensive region-level annotations.
From the abstract
Current multimodal LLMs encode images as static visual prefixes and rely on text-based reasoning, lacking goal-driven and adaptive visual access. Inspired by human visual perception-where attention is selectively and sequentially shifted from the most informative regions to secondary cues-we propose Structural Sequential Visual CoT SSV-CoT. First, a question-relevant saliency map identifies and organizes key visual regions, explicitly modeling the spatial distribution of visual importance. Secon