Eliminates the need for expensive process reward models by propagating terminal rewards across state-space graphs to generate dense, state-level rewards for agentic RL.
March 20, 2026
Original Paper
RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models
arXiv · 2603.18859
The Takeaway
It addresses the credit assignment problem in LLM reasoning by using the topological structure of trajectories. This allows for fine-grained state optimization in RL without the massive overhead of training or human-labeling dedicated process reward models.
From the abstract
Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight