Vision-Language Models suffer from 'Digital Agnosia' where they can 'see' the data perfectly but are unable to say what it is.
April 15, 2026
Original Paper
Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models
arXiv · 2604.09687
The Takeaway
The Grid2Matrix study reveals a bizarre disconnect: VLMs capture all the information in their visual encoders but fail to report it in their language layers. Scaling the model doesn't fix this; even the largest models can't count simple color grids they clearly 'perceive.' This proves that the bottleneck in multimodal AI isn't the 'eyes' (vision) but the bridge to 'speech' (language). For practitioners, this means adding more visual data or bigger encoders won't solve basic descriptive failures. We need to rethink how visual features are actually integrated into the reasoning process.
From the abstract
Vision-Language Models (VLMs) excel on many multimodal reasoning benchmarks, but these evaluations often do not require an exhaustive readout of the image and can therefore obscure failures in faithfully capturing all visual details. We introduce Grid2Matrix (G2M), a controlled benchmark in which a model is shown a color grid and a color-to-number mapping, and must output the corresponding matrix. By varying grid size and the number of colors, G2M provides a simple way to increase visual complex