AI & ML Efficiency Breakthrough

DILLO enables 14x faster safety-critical agent steering by predicting action consequences from latent states instead of heavy visual simulations.

March 25, 2026

Original Paper

Describe-Then-Act: Proactive Agent Steering via Distilled Language-Action World Models

Massimiliano Pappa, Luca Romani, Valentino Sacco, Alessio Palma, Stéphane Lathuilière, Fabio Galasso, Xavier Alameda-Pineda, Indro Spinelli

arXiv · 2603.23149

The Takeaway

It breaks the assumption that visual world models are necessary for proactive foresight in robotics. By distilling a Vision-Language Model teacher into a text-only student, it provides a high-speed inference path for failure prevention in real-time control loops.

From the abstract

Deploying safety-critical agents requires anticipating the consequences of actions before they are executed. While world models offer a paradigm for this proactive foresight, current approaches relying on visual simulation incur prohibitive latencies, often exceeding several seconds per step. In this work, we challenge the assumption that visual processing is necessary for failure prevention. We show that a trained policy's latent state, combined with its planned actions, already encodes suffici