Introduces a long-horizon video agent that uses 93% fewer frames than GPT-5/standalone LMMs while achieving higher accuracy.
March 23, 2026
Original Paper
VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking
arXiv · 2603.20185
The Takeaway
Moves from dense video parsing to a 'think-act-observe' loop that selectively seeks relevant frames based on logical flow. This dramatically reduces the memory and compute bottleneck for processing ultra-long video sequences.
From the abstract
Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding c