PivotRL identifies 'pivot' turns in agent trajectories where actions matter most, enabling compute-efficient reinforcement learning that matches end-to-end RL at 4x lower cost.
March 24, 2026
Original Paper
PivotRL: High Accuracy Agentic Post-Training at Low Compute Cost
arXiv · 2603.21383
The Takeaway
By focusing rollouts on high-variance decision points and rewarding functional equivalence, it solves the efficiency gap in agentic post-training. It is already being used in production-scale models like NVIDIA's Nemotron-3-Super.
From the abstract
Post-training for long-horizon agentic tasks has a tension between compute efficiency and generalization. While supervised fine-tuning (SFT) is compute efficient, it often suffers from out-of-domain (OOD) degradation. Conversely, end-to-end reinforcement learning (E2E RL) preserves OOD capabilities, but incurs high compute costs due to many turns of on-policy rollout. We introduce PivotRL, a novel framework that operates on existing SFT trajectories to combine the compute efficiency of SFT with