Exponential age decay prevents old data from poisoning the training of rapidly evolving language models.
April 23, 2026
Original Paper
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
arXiv · 2604.16918
The Takeaway
Standard reinforcement learning tricks for saving data usually fail when applied to large language models. These models change so quickly during training that old experiences become irrelevant or misleading. This method introduces a timer that slowly phases out the priority of stale information. It allows models to learn from past successes without getting stuck on outdated strategies. This makes training much more efficient by reusing the right data at the right time. AI can finally benefit from memory management techniques that were previously reserved for simpler systems.
From the abstract
Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample ef