Introduces ZoomUI, a trainless method for GUI grounding that uses inference-time scaling to anchor natural language instructions to interface elements.
March 17, 2026
Original Paper
Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements
arXiv · 2603.14448
The Takeaway
It achieves state-of-the-art results on GUI agent benchmarks without any task-specific fine-tuning or data annotation. This significantly lowers the barrier for building reliable multimodal agents by replacing expensive training with smart visual-attention-driven inference.
From the abstract
Multimodal Large Language Model (MLLM)-based Graphical User Interface (GUI) agents develop rapidly, with visual grounding that maps natural language instructions to target UI elements serving as the core capability. Existing GUI agents typically fine-tune MLLM on massive datasets to handle challenges in understanding instructions and UI interfaces, which not only incurs high data annotation costs but also makes performance dependent on data quality and distribution. To avoid such cumbersome yet