Introduces HiLL, a framework that jointly trains a 'hinter' and 'reasoner' to prevent advantage collapse in reinforcement learning for hard tasks.
April 2, 2026
Original Paper
Learning to Hint for Reinforcement Learning
arXiv · 2604.00698
The Takeaway
Instead of using static hints for LLM reasoning, it adaptively generates hints based on the reasoner's specific errors. This ensures a consistent learning signal even when the base model would otherwise receive zero rewards on difficult problems.
From the abstract
Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so tha