AI & ML Efficiency Breakthrough

Achieves real-time, low-latency talking avatar generation at 34ms per frame using a one-step streaming diffusion framework.

March 17, 2026

Original Paper

AvatarForcing: One-Step Streaming Talking Avatars via Local-Future Sliding-Window Denoising

Liyuan Cui, Wentao Hu, Wenyuan Zhang, Zesong Yang, Fan Shi, Xiaoqiang Liu

arXiv · 2603.14331

The Takeaway

It solves the 'exposure bias' in streaming avatars through dual-anchor temporal forcing and collapses the diffusion process into a single step via two-stage distillation. This provides a production-ready path for high-fidelity, interactive digital humans on consumer-grade GPUs.

From the abstract

Real-time talking avatar generation requires low latency and minute-level temporal stability. Autoregressive (AR) forcing enables streaming inference but suffers from exposure bias, which causes errors to accumulate and become irreversible over long rollouts. In contrast, full-sequence diffusion transformers mitigate drift but remain computationally prohibitive for real-time long-form synthesis. We present AvatarForcing, a one-step streaming diffusion framework that denoises a fixed local-future