AI & ML Open Release

Open-sources a high-fidelity foundation model that jointly generates synchronized video and audio using a unified single-stream Transformer.

March 24, 2026

Original Paper

Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model

SII-GAIR, Sand.ai, Ethan Chern, Hansi Teng, Hanwen Sun, Hao Wang, Hong Pan, Hongyu Jia, Jiadi Su, Jin Li, Junjie Yu, Lijie Liu, Lingzhi Li, Lyumanshan Ye, Min Hu, Qiangang Wang, Quanwei Qi, Steffi Chern, Tao Bu, Taoran Wang, Teren Xu, Tianning Zhang, Tiantian Mi, Weixian Xu, Wenqiang Zhang, Wentai Zhang, Xianping Yi, Xiaojie Cai, Xiaoyang Kang, Yan Ma, Yixiu Liu, Yunbo Zhang, Yunpeng Huang, Yutong Lin, Zewei Tao, Zhaoliang Liu, Zheng Zhang, Zhiyao Cen, Zhixuan Yu, Zhongshu Wang, Zhulin Hu, Zijin Zhou, Zinan Guo, Yue Cao, Pengfei Liu

arXiv · 2603.21986

The Takeaway

It democratizes access to state-of-the-art human-centric generative AI (expressive faces, speech, and motion) by releasing the complete stack, including distilled and super-resolution models. Its single-stream architecture is significantly easier to deploy and optimize than traditional multi-stream or cross-attention models.

From the abstract

We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The mode

Read the original paper →

← Back to today's papers