Uses a lightweight GRPO-trained policy to select optimal video frames, reducing processing time by 93% while actually improving Video QA accuracy.
March 20, 2026
Original Paper
HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models
arXiv · 2603.18850
The Takeaway
Instead of uniform sampling, HORNet learns which frames are task-relevant, reducing input data by up to 99%. This enables the processing of long-form video content at a fraction of the usual compute cost without sacrificing (and often improving) reasoning quality.
From the abstract
Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces