POISE demonstrates the first autonomous, evidence-driven discovery of improved policy optimization algorithms for LLMs.
March 26, 2026
Original Paper
From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents
arXiv · 2603.23951
The Takeaway
It moves LLMs from being assistants to 'scientists' that can independently iterate on training dynamics. The system discovered new variants of GRPO that significantly improved performance on math benchmarks (AIME25), suggesting a shift toward automated algorithm development.
From the abstract
Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a s