DILLO enables 14x faster safety-critical agent steering by predicting action consequences from latent states instead of heavy visual simulations.
March 25, 2026
Original Paper
Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models
arXiv · 2603.23149
The Takeaway
It breaks the assumption that visual world models are necessary for proactive foresight in robotics. By distilling a Vision-Language Model teacher into a text-only student, it provides a high-speed inference path for failure prevention in real-time control loops.
From the abstract
Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes suffici