Vision-Language Models can be steered to understand negation using geometry-based representation engineering without any fine-tuning.
March 24, 2026
Original Paper
When Negation Is a Geometry Problem in Vision-Language Models
arXiv · 2603.20554
The Takeaway
Negation is a classic failure mode for models like CLIP. This paper shows a 'negation direction' exists in embedding space and can be manipulated at test-time, providing a zero-cost fix for a major multimodal limitation.
From the abstract
Joint Vision-Language Embedding models such as CLIP typically fail at understanding negation in text queries - for example, failing to distinguish "no" in the query: "a plain blue shirt with no logos". Prior work has largely addressed this limitation through data-centric approaches, fine-tuning CLIP on large-scale synthetic negation datasets. However, these efforts are commonly evaluated using retrieval-based metrics that cannot reliably reflect whether negation is actually understood. In this p