AI & ML Efficiency Breakthrough

Introduces a long-horizon video agent that uses 93% fewer frames than GPT-5/standalone LMMs while achieving higher accuracy.

March 23, 2026

Original Paper

VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Jingyang Lin, Jialian Wu, Jiang Liu, Ximeng Sun, Ze Wang, Xiaodong Yu, Jiebo Luo, Zicheng Liu, Emad Barsoum

arXiv · 2603.20185

The Takeaway

Moves from dense video parsing to a 'think-act-observe' loop that selectively seeks relevant frames based on logical flow. This dramatically reduces the memory and compute bottleneck for processing ultra-long video sequences.

From the abstract

Video agentic models have advanced challenging video-language tasks. However, most agentic approaches still heavily rely on greedy parsing over densely sampled video frames, resulting in high computational cost. We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video. This insight allows the model to use far fewer frames while maintaining, or even improving, its video understanding c

Read the original paper →

← Back to today's papers