SYSTEM_CONSOLE v2.4.0

Evaluation

How to measure retrieval quality, answer quality, and safety, and how to detect regressions before they reach production.

LAST_UPDATED: 2025-05

Evaluation determines whether your AI system is performing as expected. Without it, "looks correct" becomes the only quality signal, which often fails silently. A governed AI system requires a defined evaluation harness that runs on index changes, prompt updates, model upgrades, and a regular schedule.

Key Takeaways

  • • Retrieval evaluation is more actionable than answer evaluation alone.
  • • Adversarial and edge cases are as important as golden happy-path cases.
  • • Evaluation datasets drift; maintain them as living assets.

Evaluation pipeline

Retrieval, answer quality, and safety are scored independently and compared against a baseline.

Evaluation Pipeline

Evaluation dataset composition

A good dataset covers the full range of expected inputs, not just the cases where the system should succeed.

Golden cases

Representative queries with verified correct answers and expected source citations.

40–60% of dataset

Basis for regression comparison across changes

Adversarial cases

Queries designed to elicit hallucination, policy bypass, prompt injection, or unsafe tool use.

20–30% of dataset

Essential for safety governance; often the most revealing tests

Edge cases

Ambiguous queries, out-of-scope questions, queries with no relevant sources, multi-domain queries.

20–30% of dataset

Tests graceful degradation and refusal behaviour

Retrieval quality metrics

Retrieval quality should be measured independently from answer quality. A good retrieval result can compensate for a weak model; a poor retrieval result cannot be compensated by a strong model.

Metric What it measures When to use
Precision@k Fraction of top-k results that are relevant When relevance of top results matters most
Recall@k Fraction of relevant sources that appear in top-k When missing a relevant source is a critical failure
MRR (Mean Reciprocal Rank) How highly the first relevant result is ranked When the top result disproportionately influences the answer
nDCG (Normalised Discounted Cumulative Gain) Ranking quality across all retrieved results When the full ranked list matters, not just the top result
Freshness Age of retrieved sources vs. defined SLA When answer currency is a quality requirement

Answer quality and safety metrics

Answer quality

  • Faithfulness: does the answer only use information from retrieved context?
  • Correctness: does it match the expected answer for golden cases?
  • Citation rate: fraction of answers that cite a source
  • Refusal rate: fraction of out-of-scope queries correctly refused

Safety checks

  • Policy bypass rate: adversarial queries that elicited a policy-violating response
  • Prompt injection success rate: injections that caused unsafe behaviour
  • PII in response: responses containing identifiable personal data
  • Unauthorised tool calls: tool invocations outside the granted allow-list

Regression testing

Regression testing compares current evaluation scores against a stored baseline. Any change that degrades a metric beyond a defined threshold should block or flag the deployment.

Triggers for regression run

  • Index rebuild or source update
  • System prompt or prompt template change
  • Model version upgrade
  • Retrieval policy change
  • Weekly scheduled run (baseline drift detection)

Regression thresholds (examples)

faithfulness: drop > 5% → block

citation_rate: drop > 10% → warn

policy_bypass: any increase → block

pii_in_response: any occurrence → block

precision@5: drop > 8% → warn

Evaluation dataset maintenance

Maintaining a static dataset at launch is insufficient, as the domain and system will evolve. Treat evaluation datasets as living assets with a structured maintenance process.

  • Add new cases when human reviewers correct the system, as corrections indicate gaps.
  • Review and update expected answers whenever source content changes significantly.
  • Include adversarial cases after every security review or red-team exercise.
  • Assign a dataset owner and perform quarterly reviews.
Best practice
Store the evaluation dataset in version control alongside the system prompts. A dataset change is as significant as a code change: version it, review it, and track when it was last validated against domain owner expectations.

Failure modes

  • ! Evaluation only runs at launch; regressions from index or prompt changes go undetected.
  • ! Dataset contains only golden happy-path cases; safety and edge case failures are invisible.
  • ! Metrics are tracked but thresholds are not defined; no deployment gates exist.
  • ! Expected answers become stale after domain content changes.
  • ! Evaluation is treated as a one-person task with no backup; it is skipped under time pressure.

Checklist

  • Evaluation dataset includes golden, adversarial, and edge case categories.
  • Retrieval and answer quality are scored independently.
  • Regression thresholds are defined and enforced on index/prompt/model changes.
  • Safety metrics (policy bypass, PII in response) block deployment on any regression.
  • Dataset has a named owner and a quarterly review cadence.