AI & ML Efficiency Breakthrough

Distills high-fidelity joint audio-visual generation into a real-time streaming model capable of 25 FPS on a single GPU.

March 13, 2026

Original Paper

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation

Yaofeng Su, Yuming Li, Zeyue Xue, Jie Huang, Siming Fu, Haoran Li, Ying Li, Zezhong Qian, Haoyang Huang, Nan Duan

arXiv · 2603.11647

The Takeaway

Most multimodal generators suffer from high latency due to bidirectional attention and modality asymmetry. This framework enables real-time, synchronized audio-video generation, which is a prerequisite for truly responsive interactive AI avatars and live digital content creation.

From the abstract

Recent joint audio-visual diffusion models achieve remarkable generation quality but suffer from high latency due to their bidirectional attention dependencies, hindering real-time applications. We propose OmniForcing, the first framework to distill an offline, dual-stream bidirectional diffusion model into a high-fidelity streaming autoregressive generator. However, naively applying causal distillation to such dual-stream architectures triggers severe training instability, due to the extreme te