Achieves 16x prefill speedup for video models by using reinforcement learning to dynamically compress visual tokens based on temporal 'surprise'.
March 30, 2026
Original Paper
Dynamic Token Compression for Efficient Video Understanding through Reinforcement Learning
arXiv · 2603.26365
The Takeaway
Context redundancy is the primary bottleneck for long-form video understanding; this framework allows models to maintain 99.5% performance while discarding 90% of tokens, drastically reducing the compute needed for long-context multimodal tasks.
From the abstract
Multimodal Large Language Models have demonstrated remarkable capabilities in video understanding, yet face prohibitive computational costs and performance degradation from ''context rot'' due to massive visual token redundancy. Existing compression strategies typically rely on heuristics or fixed transformations that are often decoupled from the downstream task objectives, limiting their adaptability and effectiveness. To address this, we propose SCORE (Surprise-augmented token COmpression via