Challenges a core constraint in statistical learning theory by proving that optimal $\sqrt{N}$ convergence is achievable for offline policy learning even with model classes that exceed the standard Donsker complexity limit.
March 31, 2026
Original Paper
Functional Natural Policy Gradients
arXiv · 2603.28681
The Takeaway
It provides a theoretical breakthrough for offline reinforcement learning by showing that the complexity of the policy model (e.g., a massive neural network) does not inherently dictate the learning rate. By decoupling policy complexity from environment dynamics via a novel debiasing device, it justifies the use of high-capacity models in data-constrained settings where conventional wisdom suggested they would fail to converge efficiently.
From the abstract
We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may b