Introduces a training-free pipeline for pixel-level video anomaly detection that achieves a 5x improvement in object-level accuracy.
March 27, 2026
Original Paper
GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids
arXiv · 2603.25467
The Takeaway
It enables high-precision surveillance monitoring without domain-specific fine-tuning by using VLMs as anomaly proposers and SAM2 for mask propagation. This shifts the bottleneck from model training to intelligent proposal consolidation.
From the abstract
Vision-Language Models (VLMs) are powerful open-set reasoners, yet their direct use as anomaly detectors in video surveillance is fragile: without calibrated anomaly priors, they alternate between missed detections and hallucinated false alarms. We argue the problem is not the VLM itself but how it is used. VLMs should function as anomaly proposers, generating open-set candidate descriptions that are then grounded and tracked by purpose-built spatial and temporal modules. We instantiate this pro