AI & ML New Capability

Enables multimodal models to self-evolve their reasoning without human labels or external reward models.

March 24, 2026

Original Paper

When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

Zhengxian Wu, Kai Shi, Chuanrui Zhang, Zirui Liao, Jun Yang, Ni Yang, Qiuying Peng, Luyuan Zhang, Hangrui Xu, Tianhuang Su, Zhenyu Yang, Haonan Lu, Haoqian Wang

arXiv · 2603.21289

The Takeaway

This framework uses an internal 'Actor-Judge' consistency signal to reweight and update reasoning policies. It demonstrates a scalable path toward self-improving multimodal agents that can learn purely from their own internal logic and group-relative policy optimization.

From the abstract

Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult tothis http URLaddress this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple