DeepSeek-R1 solves complex math problems even when every number and operation name is replaced with alien gibberish.
Skeptics often argue that AI only solves math by memorizing patterns from its training data. When researchers renamed everything in the Natural Number Game to nonsensical terms, general models like GPT-4 failed immediately. Reasoning-focused models like DeepSeek-R1 maintained their accuracy because they processed the underlying logic rather than the labels. This proves that a new class of AI is developing genuine structural reasoning capabilities. It marks the transition from models that know math to models that can actually do math.
Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game
arXiv · 2605.00677
While Large Language Models have achieved notable success on formal mathematics benchmarks such as MiniF2F, it remains unclear whether these results stem from genuine logical reasoning or semantic pattern matching against pre-training data. This paper identifies Architectural Reasoning: the ability to synthesize formal proofs using exclusively local axioms and definitions within an alien math domain, as the necessary ability for future automated theorem discovery AI. We use the Obfuscated Natura