Proves that high scores on visual spatial benchmarks are achieved through token-level search (BFS in prose) rather than genuine visual planning.
March 31, 2026
Original Paper
From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
arXiv · 2603.26839
The Takeaway
The research shows that current MLLMs fail at spatial logic when stripped of their ability to convert images into text-based grids for exhaustive search. This challenges the assumption that multimodal models possess human-like spatial understanding and suggests that architectural changes, rather than more data, are needed for native visual reasoning.
From the abstract
How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consum