Dynamic constraints using an 'online refiner' resolve the conflict between stability and performance in Reinforcement Learning Fine-Tuning (RFT).
March 20, 2026
Original Paper
Enhancing Reinforcement Learning Fine-Tuning with an Online Refiner
arXiv · 2603.18088
The Takeaway
Traditional RFT uses fixed KL-regularization which limits model improvement; this method uses a reference model as a real-time refiner to only intervene when the model produces errors. This 'dynamic' approach allows for much higher task rewards without the risk of model collapse, providing a superior alternative to standard PPO/DPO constraints.
From the abstract
Constraints are essential for stabilizing reinforcement learning fine-tuning (RFT) and preventing degenerate outputs, yet they inherently conflict with the optimization objective because stronger constraints limit the ability of a fine-tuned model to discover better solutions. We propose \textit{dynamic constraints} that resolve this tension by adapting to the evolving capabilities of the fine-tuned model based on the insight that constraints should only intervene when degenerate outputs occur.