Step-by-step thinking makes an AI worse at figuring out where objects are located in a photo.
April 20, 2026
Original Paper
Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
arXiv · 2604.16060
The Takeaway
Chain-of-Thought prompting causes multimodal models to hallucinate visual details based on their text-based expectations. While this technique helps with math, it actively degrades a model's ability to perform spatial reasoning. The AI begins to trust its own written descriptions more than the actual pixels in the image. This proves that thinking out loud is not a universal solution for improving AI performance. Engineers must develop new prompting methods for vision tasks that prevent text logic from overriding visual evidence.
From the abstract
Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that