Auditing 'Silicon Bureaucracy' reveals that LLM benchmark scores are often inflated by contamination-related memory reactivation rather than genuine generalization.
March 24, 2026
Original Paper
Silicon Bureaucracy and AI Test-Oriented Education: Contamination Sensitivity and Score Confidence in LLM Benchmarks
arXiv · 2603.21636
The Takeaway
The paper introduces a router-worker audit showing that models can reassemble benchmark cues from noisy perturbations. It forces a re-evaluation of how we interpret model leaderboards and suggests score confidence varies wildly even for similar performance levels.
From the abstract
Public benchmarks increasingly govern how large language models (LLMs) are ranked, selected, and deployed. We frame this benchmark-centered regime as Silicon Bureaucracy and AI Test-Oriented Education, and argue that it rests on a fragile assumption: that benchmark scores directly reflect genuine generalization. In practice, however, such scores may conflate exam-oriented competence with principled capability, especially when contamination and semantic leakage are difficult to exclude from moder