AI & ML Paradigm Challenge

Multimodal AIs aren't 'blind' to object orientation; they just lack the reasoning to use the visual data they already have.

April 16, 2026

Original Paper

Why MLLMs Struggle to Determine Object Orientations

Anju Gopinath, Nikhil Krishnaswamy, Bruce Draper

arXiv · 2604.13321

The Takeaway

Everyone thought current Vision-Language Models failed at orientation (like knowing if a chair is upside down) because of the visual encoder. This paper proves the opposite: the orientation data is perfectly preserved in the embeddings, and simple linear models can extract it easily. The 'blindness' is actually a failure of the higher-level reasoning or integration layers to access that data. This shifts the focus for researchers from 'building better eyes' to 'building better brains' for AI. It means we don't need a new CLIP; we need a better way to fuse existing visual features into the LLM logic. This changes the roadmap for building spatially-aware AI agents.

From the abstract

Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encode