AI & ML Breaks Assumption

Challenges a core constraint in statistical learning theory by proving that optimal $\sqrt{N}$ convergence is achievable for offline policy learning even with model classes that exceed the standard Donsker complexity limit.

March 31, 2026

Original Paper

Functional Natural Policy Gradients

Aurelien Bibaut, Houssam Zenati, Thibaud Rahier, Nathan Kallus

arXiv · 2603.28681

The Takeaway

It provides a theoretical breakthrough for offline reinforcement learning by showing that the complexity of the policy model (e.g., a massive neural network) does not inherently dictate the learning rate. By decoupling policy complexity from environment dynamics via a novel debiasing device, it justifies the use of high-capacity models in data-constrained settings where conventional wisdom suggested they would fail to converge efficiently.

From the abstract

We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is $\sqrt N$ regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is $O(N^{-1/2})$. The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may b

Read the original paper →

← Back to today's papers