AI & ML Paradigm Challenge

Big video AI models aren't actually "watching" your clips; they're mostly just guessing what happens based on the overall vibe.

April 3, 2026

Original Paper

VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification

Jiahao Meng, Tan Yue, Qi Xu, Haochen Wang, Zhongwei Ren, Weisong Liu, Yuhao Wang, Renrui Zhang, Yunhai Tong, Haodong Duan

arXiv · 2604.01569

The Takeaway

While these models can answer questions about a video, they fail almost 100% of the time when asked to show where the evidence is. This reveals that today's video AI has a massive gap between looking smart and actually understanding visual data.

From the abstract

Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for cha

Read the original paper →

← Back to today's papers