AI models 'hit a wall' when trying to solve maze puzzles, and scaling them to larger sizes doesn't seem to help.
April 15, 2026
Original Paper
CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms
arXiv · 2604.10825
The Takeaway
CheeseBench treats LLMs like lab rats, testing them on classical rodent spatial navigation tasks. Surprisingly, 7B models do well, but scaling to larger models yields diminishing returns. This suggests that 'spatial navigation' might be a distinct cognitive constraint that current Transformer scaling doesn't address. It highlights a potential 'scaling limit' for digital intelligence in physical or geometric reasoning. For robotics and navigation, it means we can't just rely on 'bigger LLMs' to solve pathfinding; we might need entirely different architectures for spatial awareness.
From the abstract
We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions