Demonstrates that current 'faithfulness' metrics for Chain-of-Thought reasoning are highly subjective and vary wildly depending on the choice of classifier.
March 23, 2026
Original Paper
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
arXiv · 2603.20172
The Takeaway
This challenges the industry's reliance on fixed 'faithfulness' scores for models like DeepSeek-R1. It proves that model rankings can flip depending on the judge, suggesting the field lacks a stable definition of epistemic dependence in reasoning.
From the abstract
Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. O