AI & ML Breaks Assumption

Standard confidence calibration is structurally biased when ground truth labels are ambiguous or annotators disagree.

March 25, 2026

Original Paper

Confidence Calibration under Ambiguous Ground Truth

Linwei Tao, Haoyang Luo, Minjing Dong, Chang Xu

arXiv · 2603.22879

The Takeaway

The authors show that temperature scaling systematically underestimates uncertainty in multi-annotator settings and provide new calibrators that optimize against the full label distribution. This is critical for high-stakes domains like medicine or content moderation where 'ground truth' is rarely a single consensus label.

From the abstract

Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temper