AI Evals: What I learned shipping production grade billing systems
Part 1: Evals 101 - What Are Evaluations?
The Core Concept
When you build an AI system that answers questions using documents (a RAG system), you need to know: does it work?
An evaluation is a systematic way to measure your system’s performance using metrics. Instead of manually checking a few examples and saying “looks good,” you:
- Create a test dataset of questions with known correct answers
- Run your system on those questions
- Measure how often it gets the right answer
- Track this score over time
Example:
You build a system to answer questions about your company’s HR policies.
Manual testing: “Hey, let me try 3 questions. Seems fine!”
Evaluation-based testing:
- Test dataset: 100 HR policy questions
- Your system: 73 correct, 27 incorrect
- Accuracy score: 73%
- After you improve chunking: 84 correct
- New accuracy: 84%
- Confidence: The change worked
Why This Matters
Before you have evals:
- You don’t know if changes make things better or worse
- You can’t catch regressions (new code breaking old functionality)
- You can’t compare different approaches objectively
After you have evals:
- Every code change has a score
- You know exactly what improved and what got worse
- You can make data-driven decisions
The Two-Part System
Every RAG system has two components to evaluate:
1. Retrieval: Did you find the right documents?
- Input: User question
- Output: List of relevant documents
- Metric: Did the most relevant document appear in your top results?
2. Generation: Did you create the right answer?
- Input: Question + Retrieved documents
- Output: Generated answer
- Metric: Is the answer correct and supported by the documents?
If retrieval fails, generation can’t succeed. If you retrieve the wrong documents, even the best language model can’t give the right answer.
Part 2: Basic Evaluation Metrics
Retrieval Metrics
These measure how well your system finds relevant documents.
Mean Reciprocal Rank (MRR)
What it measures: How quickly users find what they need.
How it works:
- For each question, find the rank (position) of the first relevant document
- Calculate reciprocal rank = 1 / rank
- Average across all questions
Example:
| Question | Relevant Doc Position | Reciprocal Rank |
|---|---|---|
| Q1 | 1st result | 1/1 = 1.0 |
| Q2 | 3rd result | 1/3 = 0.33 |
| Q3 | Not in top 10 | 0 |
MRR = (1.0 + 0.33 + 0) / 3 = 0.44
What this means: On average, the first relevant result appears around position 2-3.
Good score: MRR > 0.7 means most relevant documents appear in top 2
Recall@k
What it measures: What percentage of relevant documents appear in your top k results?
Example:
Question: “What is our parental leave policy?” Relevant documents in entire database: 3 documents Your system returns top 10 results: 2 of those 3 relevant docs are in the top 10
Recall@10 = 2/3 = 0.67 (you found 67% of relevant documents)
Good score: Recall@10 > 0.8 means you’re finding most relevant information
Precision@k
What it measures: What percentage of your top k results are actually relevant?
Example:
Your system returns top 10 results 6 of those 10 are actually relevant to the question
Precision@10 = 6/10 = 0.6 (60% of what you returned was useful)
Good score: Precision@5 > 0.7 means most of what you return is relevant
Generation Metrics
These measure the quality of the generated answer.
Correctness
What it measures: Did the answer solve the user’s question?
How to measure:
Option 1 - Manual: Human reads the answer and marks it correct/incorrect
Option 2 - LLM-as-judge: Use another AI model to grade the answer
def evaluate_answer(question, generated_answer, expected_answer):
"""
Compare generated answer to expected answer
Returns: CORRECT, PARTIALLY_CORRECT, or INCORRECT
"""
# Use an LLM to judge
prompt = f"""
Question: {question}
Expected Answer: {expected_answer}
Generated Answer: {generated_answer}
Is the generated answer correct?
Reply with: CORRECT, PARTIALLY_CORRECT, or INCORRECT
"""
return llm_judge(prompt)
Faithfulness (Groundedness)
What it measures: Is every claim in the answer supported by the retrieved documents?
This catches hallucinations - when the model makes up information not in the source documents.
Example:
Retrieved document: “Our company offers 12 weeks of parental leave.”
Generated answer: “The company offers 12 weeks of parental leave, which can be extended to 16 weeks with manager approval.”
Faithfulness check: ❌ FAILED
- “12 weeks” ✓ supported by document
- “extended to 16 weeks with manager approval” ✗ NOT in document (hallucination)
Answer Completeness
What it measures: Did the answer include all important information?
Example:
Expected answer should contain: [12 weeks leave, applies to all employees, must be taken within first year]
Generated answer mentions: [12 weeks leave, applies to all employees]
Completeness = 2/3 = 0.67 (missing one key fact)
Part 3: Building Your First Evaluation
Step 1: Create a Test Dataset
Start with 20-30 real questions that users have asked (or will ask).
Example for HR Policy Q&A:
[
{
"id": "hr_001",
"question": "How many weeks of parental leave do we offer?",
"expected_answer": "12 weeks of paid parental leave",
"relevant_doc_ids": ["policies/parental_leave.pdf"],
"category": "benefits"
},
{
"id": "hr_002",
"question": "What is the remote work policy for engineers?",
"expected_answer": "Engineers can work remotely up to 3 days per week",
"relevant_doc_ids": ["policies/remote_work.pdf"],
"category": "work_arrangements"
}
// ... 18 more questions
]
Step 2: Run Your System
For each question in your test dataset:
- Run retrieval to get documents
- Generate an answer from those documents
- Record what happened
results = []
for item in test_dataset:
# Retrieval
retrieved_docs = retrieval_system.search(item["question"], top_k=5)
# Generation
answer = llm.generate(
question=item["question"],
context=retrieved_docs
)
# Record
results.append({
"question_id": item["id"],
"retrieved_docs": retrieved_docs,
"generated_answer": answer,
"expected_answer": item["expected_answer"]
})
Step 3: Calculate Metrics
# Retrieval metrics
def calculate_mrr(results, test_dataset):
reciprocal_ranks = []
for result in results:
# Find position of first relevant doc
relevant_doc_ids = test_dataset[result["question_id"]]["relevant_doc_ids"]
for position, doc in enumerate(result["retrieved_docs"], start=1):
if doc.id in relevant_doc_ids:
reciprocal_ranks.append(1 / position)
break
else:
reciprocal_ranks.append(0) # No relevant doc found
return sum(reciprocal_ranks) / len(reciprocal_ranks)
# Generation metrics
def calculate_correctness(results):
correct = 0
for result in results:
if llm_judge(result["generated_answer"], result["expected_answer"]) == "CORRECT":
correct += 1
return correct / len(results)
Step 4: Track Over Time
# Save results
evaluation_run = {
"timestamp": "2024-01-15",
"version": "v1.0",
"metrics": {
"mrr": 0.65,
"recall@5": 0.78,
"precision@5": 0.82,
"correctness": 0.73,
"faithfulness": 0.85
}
}
# Compare to previous version
previous_run = load_previous_run("v0.9")
print(f"MRR improved from {previous_run['mrr']} to {evaluation_run['mrr']}")
Part 4: Detailed Case Study - Exa.ai People Search
Source: https://exa.ai/blog/people-search-benchmark
The Problem
Exa built a search engine to find people based on their roles, skills, and locations. For example:
- “Senior software engineer in San Francisco”
- “VP of Product at Figma”
- “Director of sales operations in Chicago SaaS companies”
They indexed 1 billion people from LinkedIn profiles, company websites, and conference speaker bios.
Challenge: How do you know if your people search actually works?
You can’t manually test 1 billion profiles. You need systematic evaluation.
Step 1: Understanding User Queries
Exa analyzed 10,000 historical search queries and found 3 patterns:
Pattern 1: Role-based search (most common)
- “VP of product at Figma”
- “CTO at Stripe”
- Someone looking for a specific person at a specific company
- Ground truth: That person’s actual profile exists or doesn’t
Pattern 2: Skill/role discovery
- “Senior backend engineers in Austin”
- “Marketing directors with B2B SaaS experience”
- Finding anyone who matches multiple criteria
- Ground truth: Multiple correct answers, must check if results match ALL criteria
Pattern 3: Individual lookup
- “Sarah Chen at Microsoft”
- Name + company affiliation
- Ground truth: That specific person’s profile
Step 2: Building the Evaluation Dataset
They created 1,400 test queries stratified across job functions:
| Job Function | Number of Queries | Examples |
|---|---|---|
| Engineering | 365 | “Senior DevOps engineer”, “Security architect” |
| Marketing | 180 | “Growth marketing lead”, “Brand manager” |
| Sales | 160 | “Enterprise account executive”, “SDR manager” |
| Product | 90 | “Senior product manager”, “Head of product” |
| Finance | 85 | “FP&A analyst”, “Controller” |
| HR/People | 100 | “People operations manager”, “Recruiter” |
| Legal | 70 | “Corporate counsel”, “Compliance officer” |
| Design | 100 | “Product designer”, “Creative director” |
| Data/Analytics | 70 | “Data scientist”, “Analytics engineer” |
| Trust & Safety | 80 | “Trust & Safety specialist”, “Policy analyst” |
Example query structure:
{
"query_id": "eng_042",
"query_text": "Senior software engineer in San Francisco",
"category": "engineering",
"query_type": "role_discovery",
"criteria": {
"role": "Senior Software Engineer",
"location": "San Francisco",
"seniority": "Senior"
}
}
Step 3: Choosing Evaluation Methods
Exa used different metrics for different query types:
For Pattern 1 & 3 (Specific person lookup):
Use standard retrieval metrics (Recall@k, MRR, Precision)
# Example evaluation
query = "VP of Product at Figma"
ground_truth_profile = "figma.com/team/jeff-weinstein" # The actual VP
results = search_engine.search(query, top_k=10)
# Check if correct profile is in results
if ground_truth_profile in results:
position = results.index(ground_truth_profile) + 1
reciprocal_rank = 1 / position
else:
reciprocal_rank = 0
For Pattern 2 (Discovery queries):
Use LLM-as-a-judge because there’s no single correct answer.
For each result, check if it matches ALL criteria:
def evaluate_person_result(query_criteria, person_profile):
"""
Check if person matches all criteria in the query
Returns: 1 if match, 0 if no match
"""
prompt = f"""
Query criteria:
- Role: {query_criteria['role']}
- Location: {query_criteria['location']}
- Seniority: {query_criteria['seniority']}
Person profile:
{person_profile}
Does this person match ALL criteria?
Check:
1. Is their role a match? (e.g., "Software Engineer" matches "Senior Software Engineer")
2. Is their location a match? (must be exact city)
3. Is their seniority level correct? (Senior vs Junior vs Staff)
Reply with: MATCH or NO_MATCH
Explain which criteria failed if NO_MATCH.
"""
result = llm_judge(prompt)
return 1 if result == "MATCH" else 0
Grading is strict:
Query: “Senior software engineer in San Francisco”
| Result | Role | Location | Seniority | Score |
|---|---|---|---|---|
| Person A | Senior SWE | San Francisco | Senior | ✅ 1 |
| Person B | Senior SWE | San Diego | Senior | ❌ 0 (wrong city) |
| Person C | Junior SWE | San Francisco | Junior | ❌ 0 (wrong level) |
| Person D | Senior PM | San Francisco | Senior | ❌ 0 (wrong role) |
Overall score: 1/4 = 25% precision
Step 4: Running the Evaluation
Exa tested three search engines:
Test setup:
- Same 1,400 queries for all engines
- Each engine returns top 10 results
- Measure: Recall@1 (first result correct), Recall@10 (correct answer in top 10), Precision
Results:
| Search Engine | R@1 | R@10 | Precision |
|---|---|---|---|
| Exa | 72.0% | 94.5% | 63.3% |
| Brave | 44.4% | 77.9% | 30.2% |
| Parallel | 20.8% | 74.7% | 26.9% |
What these numbers mean:
Exa’s R@1 = 72%
- For 72% of queries, the first result is correct
- User doesn’t need to scroll
- Fast, low-friction experience
Exa’s R@10 = 94.5%
- For 94.5% of queries, the correct answer appears somewhere in top 10
- Answer is findable, even if not first
Exa’s Precision = 63.3%
- Of all results returned, 63.3% are actually relevant
- Still some noise, but better than competitors
Compare to Parallel’s R@1 = 20.8%
- Only 1 in 5 queries gets the right first result
- Users scroll through ~5 results on average
- Higher friction, slower
Step 5: Understanding What Made Exa Better
The 51% improvement in R@1 (from 21% to 72%) came from:
1. Fine-tuned embeddings
- Trained specifically on person-query pairs
- “Senior engineer in SF” embeds close to profiles with {role: Senior Engineer, location: SF}
- Not general-purpose text embeddings
2. Hybrid retrieval
- Semantic search (embeddings) for fuzzy matching
- Matches “Software Engineer” to “SWE”
- Keyword search (BM25) for exact terms
- Ensures “San Francisco” doesn’t match “San Jose”
- Combined with Reciprocal Rank Fusion
3. Profile consolidation
- Same person appears on LinkedIn, company site, conference speaker page
- System merges these into one profile
- Avoids duplicate results
4. Fresh data
- 50 million profile updates per week
- Job changes reflected quickly
- Location updates captured
Key Takeaways from Exa Case Study
Different query types need different evaluation methods
- Factual lookups → standard metrics
- Open-ended discovery → LLM-as-judge
Strict grading prevents false positives
- All criteria must match
- “Close enough” = wrong
Real user queries matter
- 10,000 historical queries → understand actual usage
- Synthetic queries miss real patterns
Public benchmarks drive progress
- Open-sourcing the eval forces competitors to report same metrics
- Can’t cherry-pick easy examples anymore
Part 5: Intermediate Concepts
LLM-as-a-Judge: Automated Evaluation
For large-scale testing, you can’t manually review every answer. LLM-as-a-judge automates this.
How it works:
def llm_as_judge(question, answer, expected_answer):
prompt = f"""
You are evaluating an AI system's answer.
Question: {question}
Expected: {expected_answer}
Actual: {answer}
Rate the actual answer:
- CORRECT: Fully answers the question
- PARTIALLY_CORRECT: Right idea but incomplete
- INCORRECT: Wrong or irrelevant
Provide your rating and a brief explanation.
"""
response = gpt4(prompt) # or claude, or another strong model
return parse_rating(response)
Benefits:
- Scales to thousands of evaluations
- Consistent scoring (doesn’t get tired like humans)
- Fast (seconds vs. hours for human review)
Limitations:
- Has biases (covered in advanced topics)
- Not perfect - should be calibrated against human judgment
- Best used for regression testing, not absolute measurement
ADVANCED: LLM-as-Judge Biases
LLM judges have systematic biases you should know about:
Verbosity Bias: Judges favor longer answers even when brief answers are better.
- Solution: Include length guidance in judging prompt
Position Bias: In pairwise comparisons, judges favor the first option.
- Solution: Randomize order and run twice
Circular Logic: If judge model = generation model, it’s biased toward its own outputs.
- Solution: Use different model family as judge
Calibration with Cohen’s Kappa:
# 1. Get human judgments on 100 samples human_scores = [evaluate_manually(q) for q in sample_queries[:100]] # 2. Get LLM judgments on same samples llm_scores = [llm_as_judge(q) for q in sample_queries[:100]] # 3. Calculate agreement from sklearn.metrics import cohen_kappa_score kappa = cohen_kappa_score(human_scores, llm_scores) # kappa > 0.75: Good agreement # kappa 0.60-0.75: Moderate agreement # kappa < 0.60: Poor agreement, recalibrate judge
The Precision-Recall Trade-off
You can increase recall (finding more relevant documents) OR increase precision (returning only relevant documents), but it’s hard to maximize both.
Example:
Your system can retrieve k documents (k = 5, 10, or 20).
| k value | Recall@k | Precision@k | Why |
|---|---|---|---|
| k=5 | 0.65 | 0.82 | Few docs, most relevant, but miss some |
| k=10 | 0.83 | 0.71 | More docs, find more, but some noise |
| k=20 | 0.91 | 0.58 | Find almost everything, but half is junk |
How to decide:
Consider your use case:
High-stakes compliance queries: Optimize for recall (k=20)
- Missing information is worse than noise
- Accept lower precision
Fast user Q&A: Optimize for precision (k=5)
- Users want quick, clean answers
- Accept that you’ll miss some edge cases
Real-world constraint: Latency
More documents = more tokens in LLM context = higher latency:
- k=5: ~2,000 tokens, 0.8s generation time
- k=10: ~4,000 tokens, 1.2s generation time
- k=20: ~8,000 tokens, 2.1s generation time
If your SLA is <1.5s response time, k=20 is not an option.
ADVANCED: The Pareto Frontier
You’re optimizing three competing objectives:
- Recall (completeness)
- Precision (accuracy)
- Latency (speed)
You cannot maximize all three. This creates a Pareto frontier - the set of optimal trade-offs.
High Recall ^ | B | A (you are here) | C | | Low Precision -----> High Precision
- Point A: High precision, medium recall, fast (k=5)
- Point B: High recall, medium precision, slow (k=20)
- Point C: Medium on both, medium speed (k=10)
How production teams decide:
Set hard constraints
- Latency must be < 2.0s (eliminates options above this)
- Cost must be < $0.05/query (eliminates expensive configs)
Among remaining options, optimize for primary metric
- If primary = user satisfaction → optimize precision
- If primary = compliance → optimize recall
From Braintrust research: “RAG systems face strict latency requirements. Users tolerate maybe 2-3 seconds for answers. Retrieval plus generation easily exceeds this budget without optimization.”
Building a Golden Dataset
A “golden dataset” is your regression test suite. It’s called “golden” because:
- Carefully curated from real failures
- Manually annotated by experts
- Small enough to reason about (50-100 queries)
- Represents your actual query distribution
How to build one:
Step 1: Collect failure cases
Look for:
- Support tickets that required escalation
- Questions that got wrong answers
- Queries that returned “I don’t know” but should have answered
- Edge cases that exposed gaps
Step 2: Stratify by category
Example for billing Q&A system:
| Category | Count | Example Query |
|---|---|---|
| Proration | 15 | “How do we handle mid-cycle upgrades?” |
| Credits | 12 | “Can credits roll over to next month?” |
| Invoicing | 18 | “When are invoices generated?” |
| Taxes | 10 | “How is sales tax calculated?” |
| Edge cases | 20 | “What if customer downgrades same day as upgrade?” |
Step 3: Annotate thoroughly
{
"query_id": "billing_015",
"query": "How do we handle proration when customer upgrades mid-cycle?",
"expected_answer": "We use day-based proration. Credit unused days from old plan, charge prorated amount for new plan.",
"relevant_doc_ids": [
"docs/proration_logic.md",
"docs/billing_calculations.md"
],
"category": "proration",
"difficulty": "medium",
"requires_calculation": true,
"should_refuse": false
}
Step 4: Get expert review
Have domain experts validate:
- Is the expected answer correct?
- Are the relevant docs complete?
- Is the difficulty rating accurate?
Part 6: Continuous Evaluation
Automated Testing Pipeline
Once you have a golden dataset, integrate it into your development workflow:
# ci_cd_pipeline.py
def run_evaluation_suite():
"""Run before every deployment"""
# Load golden dataset
golden_dataset = load_golden_dataset()
# Run current system
results = []
for query in golden_dataset:
retrieved = retrieval_system.search(query["question"])
answer = llm.generate(query["question"], retrieved)
results.append({
"query_id": query["id"],
"answer": answer,
"retrieved": retrieved
})
# Calculate metrics
metrics = {
"mrr": calculate_mrr(results, golden_dataset),
"recall@5": calculate_recall_at_k(results, golden_dataset, k=5),
"correctness": calculate_correctness(results, golden_dataset),
"faithfulness": calculate_faithfulness(results, golden_dataset)
}
# Check thresholds
assert metrics["correctness"] >= 0.85, "Correctness below threshold!"
assert metrics["faithfulness"] >= 0.90, "Faithfulness below threshold!"
return metrics
When to run:
- Before every deployment (CI/CD)
- After changing retrieval logic
- After updating document corpus
- When switching LLM models
- Daily as a sanity check
A/B Testing in Production
Golden datasets don’t capture real user behavior. You need production testing.
Setup:
# Route 90% to version A, 10% to version B
def handle_query(user_query):
user_id = get_user_id()
if hash(user_id) % 100 < 90:
# Version A (current production)
return version_a.answer(user_query)
else:
# Version B (new candidate)
return version_b.answer(user_query)
Track metrics:
- User feedback (thumbs up/down)
- Follow-up question rate (did first answer satisfy?)
- Session length
- Task completion rate
Decision criteria:
After 2 weeks:
- If Version B’s thumbs up rate > Version A’s: ramp B to 50%
- Monitor for 1 more week
- If still better: ramp B to 100%
- If worse at any point: rollback to A
Part 7: Advanced Topics
GraphRAG for Cross-Document Reasoning
ADVANCED TOPIC
The Problem:
Traditional RAG fails on questions like:
- “What are the main themes across all Q4 board decks?”
- “How has our security posture evolved over the past year?”
- “What consistent patterns appear in customer feedback?”
These require synthesizing information across many documents. You can’t answer by retrieving 5 chunks.
GraphRAG Approach:
Build knowledge graph
- Extract entities (companies, people, concepts)
- Extract relationships between entities
- Store as graph: nodes = entities, edges = relationships
Detect communities
- Find clusters of related entities
- Example: All entities related to “product development” form one community
Generate community summaries
- Each community gets an LLM-generated summary
- These summaries become the retrieval units
Query-time synthesis
- Retrieve relevant community summaries
- Synthesize into final answer
Trade-off:
- Much better for global questions
- 40x more expensive to build index
- Overkill for simple fact lookup
When to use:
- 10% of queries that need cross-document synthesis
- Use regular RAG for the other 90%
Drift Detection
ADVANCED TOPIC
Your system’s performance can degrade over time even without code changes:
Behavioral Drift: Model gets worse
- Provider updates the model
- Performance degrades
- Detection: Track metrics over time, alert if drop >10%
Concept Drift: User queries shift
- Users start asking about “AI agents”
- Your docs use term “autonomous systems”
- Embeddings miss the connection
- Detection: Track query distribution, alert if significant shift
# Weekly drift check current_week_queries = get_queries(last_7_days) baseline_queries = golden_dataset_queries # Cluster and compare distributions from scipy.spatial.distance import jensenshannon divergence = jensenshannon( cluster_distribution(current_week_queries), cluster_distribution(baseline_queries) ) if divergence > 0.3: alert("Query distribution has shifted significantly")
Refusal Accuracy
ADVANCED TOPIC
In high-stakes domains, saying “I don’t know” when uncertain is safer than hallucinating.
Metrics:
Refusal Precision: When system refuses, is it correct to refuse?
- System refused 20 queries
- 18 were actually unanswerable
- Refusal Precision = 18/20 = 90%
Refusal Recall: When system should refuse, does it?
- 25 queries were unanswerable
- System refused 18 of them
- Refusal Recall = 18/25 = 72%
The system hallucinated answers for 7 queries it should have refused.
Implementation:
def generate_answer(query, retrieved_chunks): # Filter low-confidence chunks high_conf = [c for c in retrieved_chunks if c.score > 0.75] if not high_conf: return "I don't have enough information to answer this." if max(c.score for c in retrieved_chunks) < 0.80: return "I'm not confident in the retrieved information." # Proceed with generation return llm.generate(query, high_conf)