AI & ML Efficiency Breakthrough

AnoleVLA replaces the standard Transformer backbone in robotic Vision-Language-Action models with Deep State Space Models for a 3x speedup.

March 17, 2026

Original Paper

AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation

Yusuke Takagi, Motonari Kambara, Daichi Yashima, Koki Seno, Kento Tokura, Komei Sugiura

arXiv · 2603.15046

The Takeaway

It demonstrates that SSM-based architectures (like Mamba) can process the long multimodal sequences required for robotics much more efficiently than Transformers. This allows for higher-frequency control and better performance on resource-constrained mobile hardware.

From the abstract

In this study, we address the problem of language-guided robotic manipulation, where a robot is required to manipulate a wide range of objects based on visual observations and natural language instructions. This task is essential for service robots that operate in human environments, and requires safety, efficiency, and task-level generality. Although Vision-Language-Action models (VLAs) have demonstrated strong performance for this task, their deployment in resource-constrained environments rem