A massive 270K-sample multi-view video corpus specifically for embodied AI agents in complex retail environments.
April 1, 2026
Original Paper
PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models
arXiv · 2603.29281
The Takeaway
PRISM addresses the failure of physical AI to understand space and physical dynamics. By providing 11.8M frames of exocentric and egocentric views with CoT supervision, it enables a 66% reduction in error for spatial and physical reasoning tasks in real-world robotics.
From the abstract
A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not unde