HumanOmni-Speaker achieves end-to-end speaker diarization and lip-reading by compressing high-frequency motion residuals into just 6 tokens per frame.
March 24, 2026
Original Paper
HumanOmni-Speaker: Identifying Who said What and When
arXiv · 2603.21664
The Takeaway
Existing multimodal LLMs either fail to capture high-frequency dynamics (low FPS) or suffer from token explosion (high FPS). This architecture enables 'Who said what and when' reasoning natively without cropping or external diarization tools.
From the abstract
While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-