AI & ML Breaks Assumption

Introduces Cross-Context Verification (CCV) to detect benchmark contamination, finding that contamination is binary: models either recall solutions perfectly or lack reasoning entirely.

March 24, 2026

Original Paper

Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis

Tae-Eun Song

arXiv · 2603.21454

The Takeaway

It challenges existing contamination detection methods by proving that 'reasoning absence' is the ultimate discriminator. The study found that 33% of previously labeled contaminated problems were false positives, necessitating a shift in how we audit model integrity.

From the abstract

LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly observe whether a model reasons or recalls. Meanwhile, simply repeating verification degrades accuracy: multi-turn review generates false positives faster than it discovers true errors, suggesting that structural approaches are needed.We introduce Cross-Conte