Shows that State Space Models (SSMs) like Mamba can match or beat Vision Transformers as vision encoders in VLMs while being more stable.
March 20, 2026
Original Paper
Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
arXiv · 2603.19209
The Takeaway
Challenges the dominance of ViTs in multimodal models by proving that SSM backbones are competitive in VQA and grounding tasks at smaller scales. It also provides stabilization strategies for visual backbones, which is crucial for practitioners building robust localization and detection systems.
From the abstract
Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performanc