AI & ML Efficiency Breakthrough

Discounted Beta-Bernoulli (DBB) reward estimation solves the variance collapse and sample inefficiency inherent in point-estimation RLVR methods for LLM reasoning.

March 20, 2026

Original Paper

Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang

arXiv · 2603.18444

The Takeaway

As LLM post-training shifts toward RL with verifiable rewards (RLVR), standard point estimation is failing. DBB achieves significant accuracy gains (+12 points OOD) without additional compute by leveraging historical reward statistics to stabilize training.

From the abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate