Hypothesizes and demonstrates a unified Gaussian latent geometry connecting vision encoders and generative models.
March 24, 2026
Original Paper
The Universal Normal Embedding
arXiv · 2603.21786
The Takeaway
This paper provides empirical evidence that vision embeddings (CLIP/DINO) and diffusion noise (DDIM) are projections of the same underlying 'Universal Normal Embedding.' This allows for semantic editing and attribute prediction directly in noise space without specialized architectures.
From the abstract
Generative models and vision encoders have largely advanced on separate tracks, optimized for different goals and grounded in different mathematical principles. Yet, they share a fundamental property: latent space Gaussianity. Generative models map Gaussian noise to images, while encoders map images to semantic embeddings whose coordinates empirically behave as Gaussian. We hypothesize that both are views of a shared latent source, the Universal Normal Embedding (UNE): an approximately Gaussian