AI & ML Nature Is Weird

AI spends way too much energy staring at pictures; it actually figures out what it's looking at almost instantly, and the rest is just wasted effort.

April 13, 2026

Original Paper

Do Vision Language Models Need to Process Image Tokens?

Sambit Ghosh, R. Venkatesh Babu, Chirag Agarwal

arXiv · 2604.09425

The Takeaway

The study shows that image representations become interchangeable very early in the network, unlike text which evolves through every layer. This could lead to a massive reduction in the cost of AI vision by allowing us to 'short-circuit' most of the deep layers.

From the abstract

Vision Language Models (VLMs) have achieved remarkable success by integrating visual encoders with large language models (LLMs). While VLMs process dense image tokens across deep transformer stacks (incurring substantial computational overhead), it remains fundamentally unclear whether sustained image-token processing is necessary for their performance or visual representations meaningfully evolve from early to later layers. In this work, we systematically investigate the functional role of imag