AI & ML Paradigm Shift

UNITE enables single-stage joint training of the tokenizer and the diffusion model from scratch, removing the need for frozen VAEs.

March 24, 2026

Original Paper

End-to-End Training for Unified Tokenization and Latent Denoising

Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman, Antonio Torralba, Phillip Isola, William T. Freeman

arXiv · 2603.22283

The Takeaway

It simplifies the complex staging of Latent Diffusion Model (LDM) training. By viewing tokenization and generation as the same latent inference problem, it creates a 'common latent language' that reaches SOTA FID scores without using pretrained encoders like DINO.

From the abstract

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoencoder architecture for unified tokenization and latent diffusion. UNITE consists of a Generative Encoder that serves as both image tokenizer and latent generator via weight sharing. Our key insight is