AI & ML Open Release

A massive 270K-sample multi-view video corpus specifically for embodied AI agents in complex retail environments.

April 1, 2026

Original Paper

PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models

Amirreza Rouhi, Parikshit Sakurikar, Satya Sai Reddy, Narsimha Menga, Anirudh Govil, Sri Harsha Chittajallu, Rajat Aggarwal, Anoop Namboodiri, Sashi Reddi

arXiv · 2603.29281

The Takeaway

PRISM addresses the failure of physical AI to understand space and physical dynamics. By providing 11.8M frames of exocentric and egocentric views with CoT supervision, it enables a 66% reduction in error for spatial and physical reasoning tasks in real-world robotics.

From the abstract

A critical gap exists between the general-purpose visual understanding of state-of-the-art physical AI models and the specialized perceptual demands of structured real-world deployment environments. We present PRISM, a 270K-sample multi-view video supervised fine-tuning (SFT) corpus for embodied vision-language-models (VLMs) in real-world retail environments. PRISM is motivated by a simple observation - physical AI systems fail not because of poor visual recognition, but because they do not unde

Read the original paper →

← Back to today's papers