AI & ML Paradigm Shift

Synergizes prompt optimization with policy optimization to overcome the 'sparse reward' problem in complex reasoning tasks.

March 24, 2026

Original Paper

P^2O: Joint Policy and Prompt Optimization

Xinyu Lu, Kaiqi Zhang, Jinglin Yang, Boxi Cao, Yaojie Lu, Hongyu Lin, Min He, Xianpei Han, Le Sun

arXiv · 2603.21877

The Takeaway

Instead of relying solely on sparse outcome rewards for hard samples, P^2O uses prompt evolution to discover successful reasoning paths and then distills those gains directly into the model parameters. This allows models to learn from 'hard samples' that would normally provide zero supervision signal, significantly improving out-of-distribution performance.

From the abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). However, vanilla RLVR suffers from inefficient exploration, particularly when confronting "hard samples" that yield nearzero success rates. In such scenarios, the reliance on sparse outcome rewards typically results in zero-advantage estimates, effectively starving the model of supervision signals despite the high informational value o

Read the original paper →

← Back to today's papers