AI & ML Breaks Assumption

Shows that State Space Models (SSMs) like Mamba can match or beat Vision Transformers as vision encoders in VLMs while being more stable.

March 20, 2026

Original Paper

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Shang-Jui Ray Kuo, Paola Cascante-Bonilla

arXiv · 2603.19209

The Takeaway

Challenges the dominance of ViTs in multimodal models by proving that SSM backbones are competitive in VQA and grounding tasks at smaller scales. It also provides stabilization strategies for visual backbones, which is crucial for practitioners building robust localization and detection systems.

From the abstract

Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performanc

Read the original paper →

← Back to today's papers