Discovers that language-centric training in Multimodal LLMs actively degrades their internal visual representation quality.
March 24, 2026
Original Paper
Predictive Regularization Against Visual Representation Degradation in Multimodal Large Language Models
arXiv · 2603.20808
The Takeaway
This paper identifies 'visual representation degradation' as a primary reason MLLMs underperform on purely visual tasks. It introduces Predictive Regularization (PRe) to preserve visual features during training, offering a way to scale multimodal models without sacrificing core visual competence.
From the abstract
While Multimodal Large Language Models (MLLMs) excel at vision-language tasks, the cost of their language-driven training on internal visual foundational competence remains unclear. In this paper, we conduct a detailed diagnostic analysis to unveil a pervasive issue: visual representation degradation in MLLMs. Specifically, we find that compared to the initial visual features, the visual representation in the middle layers of LLM exhibits both a degradation in global function and patch structure