AI & ML Scaling Insight

MSRL scales multimodal reward modeling by transferring reasoning capabilities from text to vision-language tasks without requiring new multimodal preference data.

March 27, 2026

Original Paper

MSRL: Scaling Generative Multimodal Reward Modeling via Multi-Stage Reinforcement Learning

Chenglong Wang, Yifu Huo, Yang Gan, Qiaozhi He, Qi Meng, Bei Li, Yan Wang, Junfu Liu, Tianhua Zhou, Jingbo Zhu, Tong Xiao

arXiv · 2603.25108

The Takeaway

Multimodal preference data is notoriously expensive to label. This multi-stage RL approach allows developers to build high-performance generative reward models (essential for RLHF) using much cheaper, large-scale textual data.

From the abstract

Recent advances in multimodal reward modeling have been largely driven by a paradigm shift from discriminative to generative approaches. Building on this progress, recent studies have further employed reinforcement learning from verifiable rewards (RLVR) to enhance multimodal reward models (MRMs). Despite their success, RLVR-based training typically relies on labeled multimodal preference data, which are costly and labor-intensive to obtain, making it difficult to scale MRM training. To overcome