AI & ML New Capability

A production-grade framework that converts LLM/RAG evaluation into a deployment decision workflow using Pareto frontiers and CI gates.

March 31, 2026

Original Paper

LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications

Alexandre Cristovão Maiorano

arXiv · 2603.27355

The Takeaway

It moves evaluation away from static scores to operationally-grounded metrics (latency, cost, groundedness), allowing teams to programmatically block 'unsafe' prompt or model variants before they reach production.

From the abstract

We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. We evaluate the harness on ticket-routing workflows and BEIR grounding tasks (SciFact a

Read the original paper →

← Back to today's papers