VideoAtlas enables navigation and reasoning over long-form video using compute that scales only logarithmically with video length.
March 19, 2026
Original Paper
VideoAtlas: Navigating Long-Form Video in Logarithmic Compute
arXiv · 2603.17948
The Takeaway
It replaces lossy text-based video summarization with a lossless hierarchical visual grid, allowing agents to 'zoom in' on evidence. This structural innovation bypasses the quadratic context window costs of standard video models, enabling truly long-context visual understanding.
From the abstract
Extending language models to video introduces two challenges: representation, where existing methods rely on lossy approximations, and long-context, where caption- or agent-based pipelines collapse video into text and lose visual fidelity. To overcome this, we introduce \textbf{VideoAtlas}, a task-agnostic environment to represent video as a hierarchical grid that is simultaneously lossless, navigable, scalable, caption- and preprocessing-free. An overview of the video is available at a glance,