FIPO overcomes reasoning length stagnation in LLMs by using Future-KL divergence to create dense rewards, extending Chain-of-Thought lengths to over 10,000 tokens.
March 23, 2026
Original Paper
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization
arXiv · 2603.19835
The Takeaway
Standard RL for reasoning (like GRPO) suffers from coarse credit assignment. FIPO's dense advantage formulation allows models to distinguish critical logical pivots, enabling a 32B model to outperform o1-mini on AIME 2024 benchmarks.
From the abstract
We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO