AttentionPack achieves up to 8x memory efficiency during decoding for large vision-language models (VLMs).
March 26, 2026
Original Paper
Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding
arXiv · 2603.23914
The Takeaway
Memory overhead is the primary bottleneck for long-context multi-modal tasks. By using multi-head attention compaction and low-rank structures, this framework enables much larger batch sizes and faster inference in resource-constrained environments without sacrificing quality.
From the abstract
Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on a