AI & ML Breaks Assumption

Reveals that state-of-the-art MLLMs fail to maintain stable spatial representations under simple counterfactual viewpoint changes.

March 24, 2026

Original Paper

CVT-Bench: Counterfactual Viewpoint Transformations Reveal Unstable Spatial Representations in Multimodal LLMs

Shanmukha Vellamcheti, Uday Kiran Kothapalli, Disharee Bhowmick, Sathyanarayanan N. Aakur

arXiv · 2603.21114

The Takeaway

The paper demonstrates that high single-view spatial accuracy in MLLMs is misleading; models frequently violate 360-degree cycle consistency. This is a critical insight for anyone using MLLMs for spatial reasoning or robotics where viewpoint stability is essential.

From the abstract

Multimodal large language models (MLLMs) achieve strong performance on single-view spatial reasoning tasks, yet it remains unclear whether they maintain stable spatial state representations under counterfactual viewpoint changes. We introduce a controlled diagnostic benchmark that evaluates relational consistency under hypothetical camera orbit transformations without re-rendering images. Across 100 synthetic scenes and 6,000 relational queries, we measure viewpoint consistency, 360° cycle agree