AI & ML New Capability

A training-free decoding framework that mitigates multimodal hallucinations by re-ranking tokens based on spatial attention entropy.

March 27, 2026

Original Paper

Seeing to Ground: Visual Attention for Hallucination-Resilient MDLLMs

Vishal Narnaware, Animesh Gupta, Kevin Zhai, Zhenyi Wang, Mubarak Shah

arXiv · 2603.25711

The Takeaway

It solves the 'objective mismatch' where multimodal LLMs prioritize linguistic likelihood over visual grounding. By enforcing localization consensus across attention heads at inference time, it significantly improves the reliability of generated content in vision-language tasks.

From the abstract

Multimodal Diffusion Large Language Models (MDLLMs) achieve high-concurrency generation through parallel masked decoding, yet the architectures remain prone to multimodal hallucinations. This structural vulnerability stems from an algorithmic flaw: the decoder ranks candidate tokens based on textual likelihood without verifying localized visual support. We establish that this language-only ranking induces an objective mismatch, where language probability mass acts as a misspecified proxy for the