AI & ML Paradigm Shift

FIPO overcomes reasoning length stagnation in LLMs by using Future-KL divergence to create dense rewards, extending Chain-of-Thought lengths to over 10,000 tokens.

March 23, 2026

Original Paper

FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou

arXiv · 2603.19835

The Takeaway

Standard RL for reasoning (like GRPO) suffers from coarse credit assignment. FIPO's dense advantage formulation allows models to distinguish critical logical pivots, enabling a 32B model to outperform o1-mini on AIME 2024 benchmarks.

From the abstract

We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO