Advanced AI vision models give the right answer when a photo is missing but fail when they actually look at the picture.
April 20, 2026
Original Paper
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
arXiv · 2604.16256
The Takeaway
Multimodal AI models frequently perform worse when given an image than when they are given only the text of a problem. This modality gap reveals that these models are mostly reasoning in the textual space rather than using visual evidence. They rely on textual priors or guesses based on their training data instead of actually seeing the world. The assumption that adding a camera to an AI makes it understand physical reality is often a total illusion. This finding forces a major rethink of how we build and test vision systems for real-world robots.
From the abstract
Reasoning in vision-language models (VLMs) has recently attracted significant attention due to its broad applicability across diverse downstream tasks. However, it remains unclear whether the superior performance of VLMs stems from genuine vision-grounded reasoning or relies predominantly on the reasoning capabilities of their textual backbones. To systematically measure this, we introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons. Specific