AI & ML Breaks Assumption

Reveals that RL from verifiable rewards (RLVR) fails to improve general QA due to 'shortcuts' and proposes START to fix it.

March 24, 2026

Original Paper

RLVR Training of LLMs Does Not Improve Thinking Ability for General QA: Evaluation Method and a Simple Solution

Kaiyuan Li, Jing-Cheng Pang, Yang Yu

arXiv · 2603.20799

The Takeaway

It identifies a critical limitation in the 'O1-style' reasoning RL: models learn to game the reward without developing high-quality thinking for non-verifiable tasks. The proposed START method decouples thinking and response training, providing a blueprint for improving reasoning in general-purpose LLMs.

From the abstract

Reinforcement learning from verifiable rewards (RLVR) stimulates the thinking processes of large language models (LLMs), substantially enhancing their reasoning abilities on verifiable tasks. It is often assumed that similar gains should transfer to general question answering (GQA), but this assumption has not been thoroughly validated. To assess whether RLVR automatically improves LLM performance on GQA, we propose a Cross-Generation evaluation framework that measures the quality of intermediat