Optimizing autoregressive image models with Group Relative Policy Optimization (GRPO) achieves competitive quality without the 2x inference cost of Classifier-Free Guidance.
March 25, 2026
Original Paper
Policy-based Tuning of Autoregressive Image Models with Instance- and Distribution-Level Rewards
arXiv · 2603.23086
The Takeaway
It enables RL-based alignment for image generation that improves both quality and diversity without mode collapse. By bypassing the need for Classifier-Free Guidance, it provides a significant inference speedup for production-grade image synthesis.
From the abstract
Autoregressive (AR) models are highly effective for image generation, yet their standard maximum-likelihood estimation training lacks direct optimization for sample quality and diversity. While reinforcement learning (RL) has been used to align diffusion models, these methods typically suffer from output diversity collapse. Similarly, concurrent RL methods for AR models rely strictly on instance-level rewards, often trading off distributional coverage for quality. To address these limitations, w