Enables multimodal models to self-evolve their reasoning without human labels or external reward models.
March 24, 2026
Original Paper
When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning
arXiv · 2603.21289
The Takeaway
This framework uses an internal 'Actor-Judge' consistency signal to reweight and update reasoning policies. It demonstrates a scalable path toward self-improving multimodal agents that can learn purely from their own internal logic and group-relative policy optimization.
From the abstract
Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult tothis http URLaddress this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple