Giving an AI a picture of a puzzle actually makes it 73% worse at solving it.
April 15, 2026
Original Paper
Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym
arXiv · 2604.09338
The Takeaway
We expect multi-modal models to be better at spatial tasks, but this study found a bizarre inverse relationship. State-of-the-art LLMs already struggle with 2D puzzles (16% solve rate vs 98% human), but when you provide the actual image, the performance drops by another 73%. This reveals a massive 'reasoning-vision' disconnect where the visual data confuses the model's internal logic. For developers, this is a warning: for spatial logic, text-only descriptions are currently far more reliable than image inputs. Multimodality is currently hurting more than it helps in spatial reasoning.
From the abstract
Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. W