AI & ML Efficiency Breakthrough

A training-free method to fix intra-modal misalignment in CLIP by decomposing projectors into an isotropic aligned subspace.

March 23, 2026

Original Paper

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov

arXiv · 2603.19862

The Takeaway

CLIP is notoriously poor at intra-modal tasks (like image-to-image retrieval). This method provides a mathematical way to strip away 'anisotropic directions' from the weights without any additional training, significantly improving retrieval accuracy and reducing latency.

From the abstract

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing