AI & ML Practical Magic

A 7B parameter model solved more formal math theorems than a 671B parameter giant by using a small Guide model to police its own reasoning.

April 23, 2026

Original Paper

Scaling Self-Play with Self-Guidance

Luke Bailey, Kaiyue Wen, Kefan Dong, Tatsunori Hashimoto, Tengyu Ma

arXiv · 2604.20209

The Takeaway

Scaling self-play with a dedicated guidance architecture produces a 100x efficiency gain in theorem proving. This dual-model setup prevents the main model from hacking its own reward signals during training. The tiny 7B model eventually outperforms models nearly a hundred times its size by focusing on high-quality logical paths. It challenges the assumption that only trillion-parameter models can handle the most complex formal reasoning. This architecture provides a blueprint for running world-class reasoning engines on consumer-grade hardware.

From the abstract

LLM self-play algorithms are notable in that, in principle, nothing bounds their learning: a Conjecturer model creates problems for a Solver, and both improve together. However, in practice, existing LLM self-play methods do not scale well with large amounts of compute, instead hitting learning plateaus. We argue this is because over long training runs, the Conjecturer learns to hack its reward, collapsing to artificially complex problems that do not help the Solver improve. To overcome this, we