Matches the performance of the complex SFT+GRPO reasoning pipeline for Vision-Language Models in 1/7th of the training time.
March 20, 2026
Original Paper
Balanced Thinking: Improving Chain of Thought Training in Vision Language Models
arXiv · 2603.18656
The Takeaway
It introduces a scheduled curriculum adaptive loss (SCALe) that handles the token imbalance in long reasoning traces. This allows practitioners to train reasoning-capable VLMs with significantly less compute while avoiding the stability issues of reinforcement learning.
From the abstract
Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long traces overshadow short but task-critical segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and