AI & ML Efficiency Breakthrough

ResAdapt learns a per-frame visual budget allocator that optimizes input resolution before encoding.

March 31, 2026

Original Paper

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Huanxuan Liao, Zhongtao Jiang, Yupu Hao, Yuqiao Tan, Shizhu He, Jun Zhao, Kun Xu, Kang Liu

arXiv · 2603.28610

The Takeaway

By dynamically allocating pixels to the most informative frames, it allows MLLMs to process 16x more frames within the same token budget, achieving a 15% performance gain on reasoning-intensive video tasks.

From the abstract

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive be