AI & ML Paradigm Shift

Enables LMMs to 'think' using compact latent visual representations rather than verbalizing everything into text.

March 27, 2026

Original Paper

LanteRn: Latent Visual Structured Reasoning

André G. Viveiros, Nuno Gonçalves, Matthias Lindemann, André Martins

arXiv · 2603.25629

The Takeaway

Current models fail at spatial reasoning because they force visual data through a textual bottleneck. LanteRn allows models to generate and attend to 'visual thoughts,' significantly improving performance on fine-grained grounding and reasoning tasks.

From the abstract

While language reasoning models excel in many tasks, visual reasoning remains challenging for current large multimodal models (LMMs). As a result, most LMMs default to verbalizing perceptual content into text, a strong limitation for tasks requiring fine-grained spatial and visual understanding. While recent approaches take steps toward thinking with images by invoking tools or generating intermediate images, they either rely on external modules, or incur unnecessary computation by reasoning dir