Replaces visual token compression with sparse, dynamically selected vision-language interactions in VLLMs.
March 25, 2026
Original Paper
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions
arXiv · 2603.23495
The Takeaway
Most efficiency methods drop visual information, which hurts fine-grained reasoning. VISOR keeps all pixels but sparsifies the attention layers, matching SOTA performance while drastically reducing FLOPs for high-resolution vision-language tasks.
From the abstract
Existing approaches for improving the efficiency of Large Vision-Language Models (LVLMs) are largely based on the concept of visual token reduction. This approach, however, creates an information bottleneck that impairs performance, especially on challenging tasks that require fine-grained understanding and reasoning. In this work, we challenge this paradigm by introducing VISion On Request (VISOR), a method that reduces inference cost without discarding visual information. Instead of compressin