A production-grade framework that converts LLM/RAG evaluation into a deployment decision workflow using Pareto frontiers and CI gates.
March 31, 2026
Original Paper
LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications
arXiv · 2603.27355
The Takeaway
It moves evaluation away from static scores to operationally-grounded metrics (latency, cost, groundedness), allowing teams to programmatically block 'unsafe' prompt or model variants before they reach production.
From the abstract
We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. We evaluate the harness on ticket-routing workflows and BEIR grounding tasks (SciFact a