AI & ML New Capability

HumanOmni-Speaker achieves end-to-end speaker diarization and lip-reading by compressing high-frequency motion residuals into just 6 tokens per frame.

March 24, 2026

Original Paper

HumanOmni-Speaker: Identifying Who said What and When

Detao Bai, Shimin Yao, Weixuan Chen, Xihan Wei, Zhiheng Ma

arXiv · 2603.21664

The Takeaway

Existing multimodal LLMs either fail to capture high-frequency dynamics (low FPS) or suffer from token explosion (high FPS). This architecture enables 'Who said what and when' reasoning natively without cropping or external diarization tools.

From the abstract

While Omni-modal Large Language Models have made strides in joint sensory processing, they fundamentally struggle with a cornerstone of human interaction: deciphering complex, multi-person conversational dynamics to accurately answer ``Who said what and when.'' Current models suffer from an ``illusion of competence'' -- they exploit visual biases in conventional benchmarks to bypass genuine cross-modal alignment, while relying on sparse, low-frame-rate visual sampling that destroys crucial high-

Read the original paper →

← Back to today's papers