AI & ML Paradigm Shift

Demonstrates that the stochasticity in standard regularized model training (like cross-validation) can serve as a 'free' and effective exploration strategy for contextual bandits.

March 13, 2026

Original Paper

RIE-Greedy: Regularization-Induced Exploration for Contextual Bandits

Tong Li, Thiago de Queiroz Casanova, Eric M. Schwartz, Victor Kostyuk, Dehan Kong, Joseph J. Williams

arXiv · 2603.11276

The Takeaway

Practitioners often struggle to apply Thompson Sampling or UCB to complex black-box estimators like Gradient Boosted Trees. This work shows that pure-greedy selection using regularized models naturally induces exploration, providing a simpler and more scalable alternative for large-scale production bandit systems.

From the abstract

Real-world contextual bandit problems with complex reward models are often tackled with iteratively trained models, such as boosting trees. However, it is difficult to directly apply simple and effective exploration strategies--such as Thompson Sampling or UCB--on top of those black-box estimators. Existing approaches rely on sophisticated assumptions or intractable procedures that are hard to verify and implement in practice. In this work, we explore the use of an exploration-free (pure-greedy)