Big video AI models aren't actually "watching" your clips; they're mostly just guessing what happens based on the overall vibe.
April 3, 2026
Original Paper
VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification
arXiv · 2604.01569
The Takeaway
While these models can answer questions about a video, they fail almost 100% of the time when asked to show where the evidence is. This reveals that today's video AI has a massive gap between looking smart and actually understanding visual data.
From the abstract
Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for cha