AI & ML Breaks Assumption

Auditing 'Silicon Bureaucracy' reveals that LLM benchmark scores are often inflated by contamination-related memory reactivation rather than genuine generalization.

March 24, 2026

Original Paper

Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks

Yiliang Song, Hongjun An, Jiangan Chen, Xuanchen Yan, Huan Song, Jiawei Shao, Xuelong Li

arXiv · 2603.21636

The Takeaway

The paper introduces a router-worker audit showing that models can reassemble benchmark cues from noisy perturbations. It forces a re-evaluation of how we interpret model leaderboards and suggests score confidence varies wildly even for similar performance levels.

From the abstract

Public benchmarks increasingly govern how large language models (LLMs) are ranked, selected, and deployed. We frame this benchmark-centered regime as Silicon Bureaucracy and AI Test-Oriented Education, and argue that it rests on a fragile assumption: that benchmark scores directly reflect genuine generalization. In practice, however, such scores may conflate exam-oriented competence with principled capability, especially when contamination and semantic leakage are difficult to exclude from moder