Multimodal AIs aren't 'blind' to object orientation; they just lack the reasoning to use the visual data they already have.
April 16, 2026
Original Paper
Why MLLMs Struggle to Determine Object Orientations
arXiv · 2604.13321
The Takeaway
Everyone thought current Vision-Language Models failed at orientation (like knowing if a chair is upside down) because of the visual encoder. This paper proves the opposite: the orientation data is perfectly preserved in the embeddings, and simple linear models can extract it easily. The 'blindness' is actually a failure of the higher-level reasoning or integration layers to access that data. This shifts the focus for researchers from 'building better eyes' to 'building better brains' for AI. It means we don't need a new CLIP; we need a better way to fuse existing visual features into the LLM logic. This changes the roadmap for building spatially-aware AI agents.
From the abstract
Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encode