Provides a systematic profiling of VLM inference bottlenecks and releases 'recipes' that cut time-to-first-token by up to 93%.
March 19, 2026
Original Paper
Empirical Recipes for Efficient and Compact Vision-Language Models
arXiv · 2603.16987
The Takeaway
The paper identifies that compact VLMs are often bottlenecked by memory traffic rather than compute. Their released optimization recipes and the ArgusVLM model family provide a blueprint for deploying high-performance vision-language agents on edge devices with minimal latency.
From the abstract
Deploying vision-language models (VLMs) in resource-constrained settings demands low latency and high throughput, yet existing compact VLMs often fall short of the inference speedups their smaller parameter counts suggest. To explain this discrepancy, we conduct an empirical end-to-end efficiency analysis and systematically profile inference to identify the dominant bottlenecks. Based on these findings, we develop optimization recipes tailored to compact VLMs that substantially reduce latency wh