Evaluation determines whether your AI system is performing as expected. Without it, "looks correct" becomes the only quality signal, which often fails silently. A governed AI system requires a defined evaluation harness that runs on index changes, prompt updates, model upgrades, and a regular schedule.
Key Takeaways
- • Retrieval evaluation is more actionable than answer evaluation alone.
- • Adversarial and edge cases are as important as golden happy-path cases.
- • Evaluation datasets drift; maintain them as living assets.
Evaluation pipeline
Retrieval, answer quality, and safety are scored independently and compared against a baseline.
Evaluation dataset composition
A good dataset covers the full range of expected inputs, not just the cases where the system should succeed.
Golden cases
Representative queries with verified correct answers and expected source citations.
40–60% of dataset
Basis for regression comparison across changes
Adversarial cases
Queries designed to elicit hallucination, policy bypass, prompt injection, or unsafe tool use.
20–30% of dataset
Essential for safety governance; often the most revealing tests
Edge cases
Ambiguous queries, out-of-scope questions, queries with no relevant sources, multi-domain queries.
20–30% of dataset
Tests graceful degradation and refusal behaviour
Retrieval quality metrics
Retrieval quality should be measured independently from answer quality. A good retrieval result can compensate for a weak model; a poor retrieval result cannot be compensated by a strong model.
| Metric | What it measures | When to use |
|---|---|---|
| Precision@k | Fraction of top-k results that are relevant | When relevance of top results matters most |
| Recall@k | Fraction of relevant sources that appear in top-k | When missing a relevant source is a critical failure |
| MRR (Mean Reciprocal Rank) | How highly the first relevant result is ranked | When the top result disproportionately influences the answer |
| nDCG (Normalised Discounted Cumulative Gain) | Ranking quality across all retrieved results | When the full ranked list matters, not just the top result |
| Freshness | Age of retrieved sources vs. defined SLA | When answer currency is a quality requirement |
Answer quality and safety metrics
Answer quality
- Faithfulness: does the answer only use information from retrieved context?
- Correctness: does it match the expected answer for golden cases?
- Citation rate: fraction of answers that cite a source
- Refusal rate: fraction of out-of-scope queries correctly refused
Safety checks
- Policy bypass rate: adversarial queries that elicited a policy-violating response
- Prompt injection success rate: injections that caused unsafe behaviour
- PII in response: responses containing identifiable personal data
- Unauthorised tool calls: tool invocations outside the granted allow-list
Regression testing
Regression testing compares current evaluation scores against a stored baseline. Any change that degrades a metric beyond a defined threshold should block or flag the deployment.
Triggers for regression run
- → Index rebuild or source update
- → System prompt or prompt template change
- → Model version upgrade
- → Retrieval policy change
- → Weekly scheduled run (baseline drift detection)
Regression thresholds (examples)
faithfulness: drop > 5% → block
citation_rate: drop > 10% → warn
policy_bypass: any increase → block
pii_in_response: any occurrence → block
precision@5: drop > 8% → warn
Evaluation dataset maintenance
Maintaining a static dataset at launch is insufficient, as the domain and system will evolve. Treat evaluation datasets as living assets with a structured maintenance process.
- → Add new cases when human reviewers correct the system, as corrections indicate gaps.
- → Review and update expected answers whenever source content changes significantly.
- → Include adversarial cases after every security review or red-team exercise.
- → Assign a dataset owner and perform quarterly reviews.
Failure modes
- ! Evaluation only runs at launch; regressions from index or prompt changes go undetected.
- ! Dataset contains only golden happy-path cases; safety and edge case failures are invisible.
- ! Metrics are tracked but thresholds are not defined; no deployment gates exist.
- ! Expected answers become stale after domain content changes.
- ! Evaluation is treated as a one-person task with no backup; it is skipped under time pressure.
Checklist
- □ Evaluation dataset includes golden, adversarial, and edge case categories.
- □ Retrieval and answer quality are scored independently.
- □ Regression thresholds are defined and enforced on index/prompt/model changes.
- □ Safety metrics (policy bypass, PII in response) block deployment on any regression.
- □ Dataset has a named owner and a quarterly review cadence.