AI & ML Collision

AI models 'hit a wall' when trying to solve maze puzzles, and scaling them to larger sizes doesn't seem to help.

April 15, 2026

Original Paper

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

arXiv · 2604.10825

The Takeaway

CheeseBench treats LLMs like lab rats, testing them on classical rodent spatial navigation tasks. Surprisingly, 7B models do well, but scaling to larger models yields diminishing returns. This suggests that 'spatial navigation' might be a distinct cognitive constraint that current Transformer scaling doesn't address. It highlights a potential 'scaling limit' for digital intelligence in physical or geometric reasoning. For robotics and navigation, it means we can't just rely on 'bigger LLMs' to solve pathfinding; we might need entirely different architectures for spatial awareness.

From the abstract

We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions