A unified reinforcement learning framework that jointly optimizes reasoning (text) and synthesis (image) for interleaved multimodal generation.
March 25, 2026
Original Paper
UniGRPO: Unified Policy Optimization for Reasoning-Driven Visual Generation
arXiv · 2603.23500
The Takeaway
As models move toward 'o1-style' reasoning for visual tasks, training them becomes difficult. UniGRPO integrates Flow Matching with GRPO, providing a stable reward signal (MSE on velocity fields) to scale reasoning-driven image generation without reward hacking.
From the abstract
Unified models capable of interleaved generation have emerged as a promising paradigm, with the community increasingly converging on autoregressive modeling for text and flow matching for image generation. To advance this direction, we propose a unified reinforcement learning framework tailored for interleaved generation. We validate our approach on its fundamental unit: a single round of reasoning-driven image generation, where the model first expands the user prompt through reasoning, followed