Standard confidence calibration is structurally biased when ground truth labels are ambiguous or annotators disagree.
March 25, 2026
Original Paper
Confidence Calibration under Ambiguous Ground Truth
arXiv · 2603.22879
The Takeaway
The authors show that temperature scaling systematically underestimates uncertainty in multi-annotator settings and provide new calibrators that optimize against the full label distribution. This is critical for high-stakes domains like medicine or content moderation where 'ground truth' is rarely a single consensus label.
From the abstract
Confidence calibration assumes a unique ground-truth label per input, yet this assumption fails wherever annotators genuinely disagree. Post-hoc calibrators fitted on majority-voted labels, the standard single-label targets used in practice, can appear well-calibrated under conventional evaluation yet remain substantially miscalibrated against the underlying annotator distribution. We show that this failure is structural: under simplifying assumptions, Temperature Scaling is biased toward temper