DreamPlan fine-tunes Vision-Language planners entirely within the 'imagination' of a video world model, bypassing costly physical robot trials.
March 18, 2026
Original Paper
DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models
arXiv · 2603.16860
The Takeaway
It utilizes sub-optimal zero-shot data to train a video world model that captures complex physics, then uses that world model as a safe, fast sandbox for RL. This significantly lowers the barrier for grounding high-level LLM reasoning in physical task dynamics.
From the abstract
Robotic manipulation requires sophisticated commonsense reasoning, a capability naturally possessed by large-scale Vision-Language Models (VLMs). While VLMs show promise as zero-shot planners, their lack of grounded physical understanding often leads to compounding errors and low success rates when deployed in complex real-world environments, particularly for challenging tasks like deformable object manipulation. Although Reinforcement Learning (RL) can adapt these planners to specific task dyna