Distribution-Conditioned Diffusion Decoding enables high-fidelity image generation from pre-trained VLMs without expensive full-model retraining.
March 17, 2026
Original Paper
High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding
arXiv · 2603.13389
The Takeaway
Current VLMs are limited by discrete token artifacts; this method uses a lightweight diffusion decoder on top of existing VLM logits to produce high-quality images. It allows developers to achieve state-of-the-art visual quality using only ImageNet-1K-level training costs.
From the abstract
Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several studies have explored continuous representation modeling to enhance visual quality, adapting pre-trained VLM models to such representations requires large-scale data and training costs comparable to the original pre-training. To circumvent this limitation, we