One specific string of text acts as a skeleton key that makes a vision model think it matches almost every image in existence.
Cross-modal encoders like CLIP are supposed to map images and text into a shared logical space. This study identifies a hubness glitch where a single piece of text becomes a universal neighbor for thousands of unrelated pictures. This vulnerability means that a malicious actor could use this text to hide or misclassify images across an entire system. It exposes a fundamental instability in how AI bridges the gap between sight and language. Designers need to find ways to prune these hubs to ensure visual searches remain accurate. One phrase can effectively blind a system to the differences between a dog and a car.
One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness
arXiv · 2604.27674
The hubness problem, in which hub embeddings are close to many unrelated examples, occurs often in high-dimensional embedding spaces and may pose a practical threat for purposes such as information retrieval and automatic evaluation metrics. In particular, since cross-modal similarity between text and images cannot be calculated by direct comparisons, such as string matching, cross-modal encoders that project different modalities into a shared space are helpful for various cross-modal applicatio