Integrates tactile perception into video-action models to enable high-fidelity force modulation in contact-rich robotic tasks.
March 25, 2026
Original Paper
VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs
arXiv · 2603.23481
The Takeaway
Current Vision-Language-Action (VLA) models fail in tasks where critical state is not visually observable (e.g., handling fragile objects). VTAM introduces tactile streams that outperform standard baselines by 80% on high-precision tasks like picking up potato chips.
From the abstract
Video-Action Models (VAMs) have emerged as a promising framework for embodied intelligence, learning implicit world dynamics from raw video streams to produce temporally consistent action predictions. Although such models demonstrate strong performance on long-horizon tasks through visual reasoning, they remain limited in contact-rich scenarios where critical interaction states are only partially observable from vision alone. In particular, fine-grained force modulation and contact transitions a