A statistical safety test used by thousands of researchers for decades is actually producing misleading results.
April 29, 2026
Original Paper
Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research
arXiv · 2604.25349
The Takeaway
The Wilcoxon signed-rank test is the standard choice for researchers who want a safe alternative to the t-test. This paper demonstrates that the test is fundamentally misapplied in Information Retrieval and often fails to provide the security it promises. It creates a false sense of statistical significance where none exists, undermining the validity of numerous published findings. The assumptions required for the Wilcoxon test are rarely met in modern datasets. Shifting to more appropriate statistical methods is necessary to ensure the field is actually making progress rather than chasing noise.
From the abstract
In benchmarking of Information Retrieval systems, the Wilcoxon signed-rank test is often treated as a safer alternative to the t-test. This belief is fueled by textbooks and recommendations that portray Wilcoxon as the proper non-parametric alternative because metric scores are not normally distributed. We argue that this narrative is misleading and harmful. A careful review of Statistics textbooks reveals inconsistencies and omissions in how the assumptions underlying these tests are presented,