Open-sources a high-fidelity foundation model that jointly generates synchronized video and audio using a unified single-stream Transformer.
March 24, 2026
Original Paper
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
arXiv · 2603.21986
The Takeaway
It democratizes access to state-of-the-art human-centric generative AI (expressive faces, speech, and motion) by releasing the complete stack, including distilled and super-resolution models. Its single-stream architecture is significantly easier to deploy and optimize than traditional multi-stream or cross-attention models.
From the abstract
We present daVinci-MagiHuman, an open-source audio-video generative foundation model for human-centric generation. daVinci-MagiHuman jointly generates synchronized video and audio using a single-stream Transformer that processes text, video, and audio within a unified token sequence via self-attention only. This single-stream design avoids the complexity of multi-stream or cross-attention architectures while remaining easy to optimize with standard training and inference infrastructure. The mode