Moving beyond coarse reward signals, this paper introduces token-level policy optimization for multimodal reasoning.
March 25, 2026
Original Paper
Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought
arXiv · 2603.22847
The Takeaway
By identifying distinct 'perceptual grounding' and 'exploratory' token dynamics, the authors enable fine-grained RL that optimizes the reasoning trajectory itself rather than just the final output. This significantly improves performance on complex visual puzzles and geometry tasks where standard RL often fails due to sparse rewards.
From the abstract
Multimodal Chain-of-Thought (CoT) reasoning requires large vision-language models to construct reasoning trajectories that interleave perceptual grounding with multi-step inference. However, existing Reinforcement Learning with Verifiable Rewards (RLVR) methods typically optimize reasoning at a coarse granularity, treating CoT uniformly without distinguishing their varying degrees of visual grounding. In this work, we conduct a token-level analysis of multimodal reasoning trajectories and show t