AwaRes enables low-resolution Vision-Language Models to retrieve only the high-resolution image crops needed for a specific query via tool-calling.
March 19, 2026
Original Paper
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs
arXiv · 2603.16932
The Takeaway
This resolves the standard accuracy-efficiency trade-off in VLMs by only processing high-detail visual segments on demand. It uses GRPO training to teach models when and where to look, allowing small models to handle high-resolution tasks (like reading small text) at a fraction of the compute cost.
From the abstract
Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global vie