Challenges the assumption that 'background' pixels are useless in GUI agents and identifies a 'recency effect' for optimal token pruning.
March 30, 2026
Original Paper
Rethinking Token Pruning for Historical Screenshots in GUI Visual Agents: Semantic, Spatial, and Temporal Perspectives
arXiv · 2603.26041
The Takeaway
The paper finds that background regions in GUI screenshots are critical for detecting state transitions (e.g., buttons being pressed). It offers a blueprint for 'Historical Screenshot' management, showing that agents perform better when token budget is allocated heavily toward recent frames while retaining highly-compressed background cues.
From the abstract
In recent years, GUI visual agents built upon Multimodal Large Language Models (MLLMs) have demonstrated strong potential in navigation tasks. However, high-resolution GUI screenshots produce a large number of visual tokens, making the direct preservation of complete historical information computationally expensive. In this paper, we conduct an empirical study on token pruning for historical screenshots in GUI scenarios and distill three practical insights that are crucial for designing effectiv