AI & ML Breaks Assumption

Demonstrates that the 'modality gap' in CLIP-style models is a feature that can be exploited to increase robustness without retraining.

April 1, 2026

Original Paper

Is the Modality Gap a Bug or a Feature? A Robustness Perspective

Rhea Chowers, Oshri Naparstek, Udi Barzelay, Yair Weiss

arXiv · 2603.29080

The Takeaway

Instead of treating the gap between image and text embeddings as a bug, the authors show it is mathematically linked to robustness. They provide a simple post-processing vector shift that improves model stability under perturbations for free.

From the abstract

Many modern multi-modal models (e.g. CLIP) seek an embedding space in which the two modalities are aligned. Somewhat surprisingly, almost all existing models show a strong modality gap: the distribution of images is well-separated from the distribution of texts in the shared embedding space. Despite a series of recent papers on this topic, it is still not clear why this gap exists nor whether closing the gap in post-processing will lead to better performance on downstream tasks. In this paper we