AI & ML Efficiency Breakthrough

Enables merging independently trained specialist models (e.g., Vision-LLM and Audio-LLM) into a single multimodal model without any paired training data.

March 24, 2026

Original Paper

SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models

Md Kaykobad Reza, Ameya Patil, Edward Ayrapetian, M. Salman Asif

arXiv · 2603.21584

The Takeaway

SSAM uses singular subspace alignment in the parameter space to combine models while minimizing interference. It outperforms even jointly trained models, providing a massive shortcut for building versatile multimodal systems without expensive data collection.

From the abstract

Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs