Large language models are much harsher judges of mistakes if they happen at the beginning of a document rather than the end.
April 23, 2026
Original Paper
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
arXiv · 2604.18835
The Takeaway
LLMs exhibit a severe positional bias that undermines their use as automated judges for similarity scoring. A model's scoring distribution forms a stable fingerprint that is unique to that specific AI architecture. These models fail to read documents holistically, instead giving disproportionate weight to early content. This could lead to unfair grading or flawed legal reviews if the AI is used to compare long texts. It forces a complete rethink of how we use LLM-as-a-judge systems in critical workflows.
From the abstract
We propose a scalable, multifactorial experimental framework that systematically probes LLM sensitivity to subtle semantic changes in pairwise document comparison. We analogize this as a needle-in-a-haystack problem: a single semantically altered sentence (the needle) is embedded within surrounding context (the hay), and we vary the perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length acros