Exposes 'hidden clones' in VLM ensembles, where models from the same family share correlated errors that naive voting mechanisms fail to detect.
March 19, 2026
Original Paper
Hidden Clones: Exposing and Fixing Family Bias in Vision-Language Model Ensembles
arXiv · 2603.17111
The Takeaway
The paper shows that VLM ensemble diversity is an illusion, with effective voter counts often as low as 2.5 regardless of total models. It provides 'family-aware' voting algorithms that recover significant accuracy by accounting for architectural heritage.
From the abstract
Ensembling Vision-Language Models (VLMs) from different providers maximizes benchmark accuracy, yet models from the same architectural family share correlated errors that standard voting ignores. We study this structure across 17 VLMs from 8 families on VQAv2, TextVQA, and GQA. Family-correlated errors reduce effective ensemble dimensionality to 2.5-3.6 independent voters and create a Misleading tier (1.5-6.5% of questions) where correlated majority errors destroy accuracy to 0% despite the best