AI & ML Paradigm Shift

ADARUBRIC generates task-specific evaluation rubrics on the fly, significantly outperforming static rubrics in human correlation and agent training outcomes.

March 24, 2026

Original Paper

AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation

Liang Ding

arXiv · 2603.21362

The Takeaway

It moves LLM-as-a-judge from rigid, error-prone static rubrics to adaptive, dimension-aware feedback. Agents trained on its preference pairs showed significant performance gains (+6.8-8.5 pp) and faster RL convergence without manual rubric engineering.

From the abstract

LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency. We present ADARUBRIC, which closes this gap by generating task-specific evaluation rubrics on the fly from task descriptions, scoring trajectories step-by-step with confidence-weighted per-dimension feedback, and filtering preference pairs with the novel DimensionAwareFilter