Leum-VL-8B introduces a structural 'grammar' for video parsing by decomposing content into six film-production-style dimensions like camera language and editing.
March 24, 2026
Original Paper
Leum-VL Technical Report
arXiv · 2603.20354
The Takeaway
Current video models treat frames as a sequence of events; Leum-VL treats video as a structured production. This allows for precise identification of cinematic elements (hooks, shot tension, cut rationales) that are essential for high-end content generation and professional editing tools.
From the abstract
A short video succeeds not simply because of what it shows, but because of how it schedules attention -- yet current multimodal models lack the structural grammar to parse or produce this organization. Existing models can describe scenes, answer event-centric questions, and read on-screen text, but they are far less reliable at identifying timeline-grounded units such as hooks, cut rationales, shot-induced tension, and platform-facing packaging cues.We propose SV6D (Structured Video in Six Dimen