A massive AI model can learn how to plug in an Ethernet cable after just two hours of real-world practice.
April 29, 2026
Original Paper
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
arXiv · 2604.23073
The Takeaway
Training robots to perform high-precision tasks usually takes weeks of human guidance or millions of simulated trials. The RL Token interface allows large Vision-Language-Action models to be fine-tuned via reinforcement learning in a fraction of that time. The model treats physical actions as just another type of token, allowing it to learn from its own successes and failures in the real world. This approach surpassed the speed of traditional human-led training for complex physical skills. It provides a practical path for deploying highly capable robots into homes and factories with minimal setup.
From the abstract
Vision-language-action (VLA) models can learn to perform diverse manipulation skills "out of the box," but achieving the precision and speed that real-world tasks demand requires further fine-tuning -- for example, via reinforcement learning (RL). We introduce a lightweight method that enables sample-efficient online RL fine-tuning of pretrained VLAs using just a few hours of real-world practice. We (1) adapt the VLA to expose an "RL token," a compact readout representation that preserves task-r