Discovers that object-centric information in Vision Transformers is distributed across all attention components (q, k, v) and layers, not just the final layer.
March 30, 2026
Original Paper
Finding Distributed Object-Centric Properties in Self-Supervised Transformers
arXiv · 2603.26127
The Takeaway
Contrary to the standard practice of using the final [CLS] token for object discovery, this paper proves that distributed inter-patch similarity contains much richer grounding data. Their training-free 'Object-DINO' method significantly improves unsupervised object discovery and reduces hallucinations in Multimodal LLMs without retraining.
From the abstract
Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions.