Swapping the word 'person' for 'human' causes AI vision models to look at a completely different part of an image.
April 23, 2026
Original Paper
Prompt Sensitivity in Vision-Language Grounding: How Small Changes in Wording Affect Object Detection
arXiv · 2604.17126
The Takeaway
Vision-language models exhibit a terrifying level of instability when processing synonyms. One might expect that identical meanings would lead to identical visual grounding. Instead, the internal selection mechanism is so sensitive that tiny wording changes break the model perception. This is not a failure of language understanding but a flaw in how the AI links words to pixels. It means that self-driving cars or medical robots could fail based on a single word choice in their instructions. Reliability in AI vision requires a fundamental fix to this grounding mechanism.
From the abstract
Vision-language models enable open-vocabulary object grounding through natural language queries, under the implicit assumption that semantically equivalent descriptions yield consistent outputs. We examine this assumption using a controlled pipeline combining DETR for object proposals with CLIP for language-conditioned selection on 263 COCO val2017 images. We find that overlapping prompts such as "a person," "a human," and "a pedestrian" frequently select different instances, with mean instabili