AI & ML Breaks Assumption

Proves that high scores on visual spatial benchmarks are achieved through token-level search (BFS in prose) rather than genuine visual planning.

March 31, 2026

Original Paper

From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning

Alberto G. Rodriguez Salgado

arXiv · 2603.26839

The Takeaway

The research shows that current MLLMs fail at spatial logic when stripped of their ability to convert images into text-based grids for exhaustive search. This challenges the assumption that multimodal models possess human-like spatial understanding and suggests that architectural changes, rather than more data, are needed for native visual reasoning.

From the abstract

How do multimodal models solve visual spatial tasks -- through genuine planning, or through brute-force search in token space? We introduce \textsc{MazeBench}, a benchmark of 110 procedurally generated maze images across nine controlled groups, and evaluate 16 model configurations from OpenAI, Anthropic, Google, and Alibaba. GPT-5.4 solves 91\% and Gemini 3.1 Pro 79\%, but these scores are misleading: models typically translate images into text grids and then enumerate paths step by step, consum