Challenges the standard practice of deep PPO training by proving that consensus aggregation of 'wider' parallel runs is 8x more sample efficient than multiple epochs.
March 16, 2026
Original Paper
Optimize Wider, Not Deeper: Consensus Aggregation for Policy Optimization
arXiv · 2603.12596
The Takeaway
It identifies that deep PPO epochs consume trust region budget with 'wasteful' Fisher-orthogonal residuals. By optimizing parallel replicates and aggregating in natural parameter space, practitioners can achieve significantly higher stability and performance without extra environment interactions.
From the abstract
Proximal policy optimization (PPO) approximates the trust region update using multiple epochs of clipped SGD. Each epoch may drift further from the natural gradient direction, creating path-dependent noise. To understand this drift, we can use Fisher information geometry to decompose policy updates into signal (the natural gradient projection) and waste (the Fisher-orthogonal residual that consumes trust region budget without first-order surrogate improvement). Empirically, signal saturates but