Group discussions between similar AI models actually lower performance compared to letting a single model think through a problem alone.
Multi-agent debate is often touted as a way to fix errors, but it frequently leads to sycophantic conformity. When similar models talk to each other, they tend to agree with the first wrong answer suggested rather than correcting it. This digital groupthink destabilizes correct reasoning that a lone model might have reached on its own. Engineers should rethink the more heads are better approach for homogeneous systems. Effective error correction requires diversity of perspective or isolated self-reflection rather than simple consensus-seeking.
The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate
arXiv · 2605.00914
Multi-agent debate, where teams of LLMs iteratively exchange rationales and vote on answers, is widely deployed under the assumption that peer review filters hallucinations. Yet the failure dynamics of homogeneous debate remain poorly understood, therefore we report findings from a controlled empirical study of teams of $N{=}10$ homogeneous agents (Qwen2.5-7B, Llama-3.1-8B, Ministral-3-8B) across $R{=}3$ debate rounds on two high-difficulty benchmarks (GSM-Hard and MMLU-Hard). We compare peer de