AI & ML Paradigm Shift

Decouples high-level reasoning from low-level motor control in robotics using a visual prompting interface.

March 24, 2026

Original Paper

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

Zixuan Wang, Yuxin Chen, Yuqi Liu, Jinhui Ye, Pengguang Chen, Changsheng Lu, Shu Liu, Jiaya Jia

arXiv · 2603.22003

The Takeaway

Instead of treating Vision-Language-Action (VLA) models as black boxes, this method uses 'visual prompts' (like bounding boxes and crosshairs) as a standardized interface between planning and execution. This improves spatial precision and robustness in out-of-distribution robotic tasks.

From the abstract

Vision-Language-Action (VLA) models typically map visual observations and linguistic instructions directly to robotic control signals. This "black-box" mapping forces a single forward pass to simultaneously handle instruction interpretation, spatial grounding, and low-level control, often leading to poor spatial precision and limited robustness in out-of-distribution scenarios. To address these limitations, we propose VP-VLA, a dual-system framework that decouples high-level reasoning and low-le