Introduces a 'clone-robust' mechanism (YRWR) to prevent AI model producers from strategically gaming the rankings in crowd-sourced arenas like Chatbot Arena.
March 31, 2026
Original Paper
Strategic Candidacy in Generative AI Arenas
arXiv · 2603.26891
The Takeaway
As LMArena/Chatbot Arena scores become the de facto metric for model success, the risk of 'model cloning' (submitting minor variants to inflate rank) increases. This paper provides the formal game-theoretic correction needed to keep crowdsourced AI evaluation honest and statistically reliable as the field scales.
From the abstract
AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and t