AI & ML Breaks Assumption

Discovers that object-centric information in Vision Transformers is distributed across all attention components (q, k, v) and layers, not just the final layer.

March 30, 2026

Original Paper

Finding Distributed Object-Centric Properties in Self-Supervised Transformers

Samyak Rawlekar, Amitabh Swain, Yujun Cai, Yiwei Wang, Ming-Hsuan Yang, Narendra Ahuja

arXiv · 2603.26127

The Takeaway

Contrary to the standard practice of using the final [CLS] token for object discovery, this paper proves that distributed inter-patch similarity contains much richer grounding data. Their training-free 'Object-DINO' method significantly improves unsupervised object discovery and reduces hallucinations in Multimodal LLMs without retraining.

From the abstract

Self-supervised Vision Transformers (ViTs) like DINO show an emergent ability to discover objects, typically observed in [CLS] token attention maps of the final layer. However, these maps often contain spurious activations resulting in poor localization of objects. This is because the [CLS] token, trained on an image-level objective, summarizes the entire image instead of focusing on objects. This aggregation dilutes the object-centric information existing in the local, patch-level interactions.