How We Evaluate Code Retrieval | ContextPacker Benchmarks

Why Evaluation Is Hard for Code Retrieval

Evaluating code retrieval systems is surprisingly difficult. The obvious approach — test on popular open-source repositories — has a fundamental flaw: LLMs have already seen these codebases during training.

When we first tested on Flask and FastAPI, our "no context" baseline (LLM with no code provided) scored 7/10 on answer quality. The model wasn't using our retrieval — it was reciting memorized knowledge from training data.

This realization changed our entire evaluation strategy.

Our Approach: Test Where LLMs Have No Prior

We now focus our evaluation on repositories where LLMs are unlikely to have prior knowledge:

Tier	Description	Questions	Why It Matters
Private	5 private GitHub repos	47	Zero LLM prior — gold standard
Obscure	<1K star repos	24	Minimal training influence
Famous	Flask, FastAPI, etc.	106	Upper bound (inflated)

Private repositories are our gold standard because we know the LLM hasn't seen them. On these repos, the "no context" baseline drops to 4.9/10 — proof that retrieval is actually providing value.

Two Benchmarks, Two Questions

We run two separate benchmarks that answer different questions:

Benchmark 1: Retrieval Quality

"Does ContextPacker find the right files?"

For each question, we have human-labeled ground truth: the files that actually contain the answer. We measure how well the system ranks these files using industry-standard metrics:

NDCG@10 — How good is the ranking quality? (0-1, higher is better)
Hit@10 — Did we find any relevant file in the top 10?
MRR — How high is the first relevant file?

Retrieval Results (Private Repos)

0.92

ContextPacker NDCG

0.79

Embeddings NDCG

+13%

Improvement

Benchmark 2: End-to-End Answer Quality

"Do those files help the LLM answer correctly?"

Good retrieval doesn't guarantee good answers. So we also measure whether an LLM produces correct answers when given our files. We use LLM-as-judge scoring against human-written key facts.

To reduce bias, we use cross-vendor judging: OpenAI models generate answers, Google models judge them. This prevents models from favoring their own output style.

E2E Results (22 Questions)

8.5/10

ContextPacker

8.6/10

Embeddings

4.9/10

No Context

Cross-vendor judging: GPT-4.1-mini → Gemini-2.0-flash

What We Learned Along the Way

Building this evaluation taught us several lessons that might help others in the space:

1. Famous Repos Inflate All Scores

Testing on Flask or React doesn't tell you much — LLMs already know these codebases. Your retrieval might not be helping at all.

Solution: Test on private repos or obscure projects where the "no context" baseline actually fails.

2. Symbol Extraction Matters

Just showing file paths isn't enough. Adding function and class names (via AST parsing) improved our NDCG from 0.85 to 0.92 — a significant jump.

3. Your Judge Prompt Is Probably Broken

Vague evaluation criteria like "should explain error handling" reward confident bullshit. We switched to requiring exact symbol names in answers:

❌ "Should explain the request lifecycle"
✓ "Must mention RequestContext and full_dispatch_request()"

4. Cross-Vendor Judging Reduces Bias

When GPT-4 judges GPT-4's answers, scores inflate by ~0.5 points. Using different model families (OpenAI for answers, Google for judging) produces more honest scores.

Honest Limitations

We believe in transparent evaluation. Here's what you should know:

Documented Caveats

We wrote the questions — possible unconscious bias toward what we handle well
Sample sizes are small — 47 private repo questions, 22 E2E questions
No external validation — questions not reviewed by independent parties
Results are directional — treat as optimistic upper bounds, not rigorous proof

We're sharing this data because it's useful, not because it's perfect. If you're evaluating code retrieval systems, we hope our methodology and lessons learned help you avoid the same traps.

Comparison to Industry Baselines

For context, here's how different approaches perform on code search benchmarks (from published research):

System	NDCG	MRR	Source
BM25 (lexical)	0.31	0.40	CodeSearchNet 2019
CodeBERT	0.69	0.72	Microsoft 2020
UniXcoder	0.75	0.78	Microsoft 2022
ContextPacker	0.92	0.89	Our private repo benchmark

Note: These numbers come from different benchmarks and aren't directly comparable. We include them for rough context on what "good" looks like in this space.

The Bottom Line

Our evaluation shows two things:

Context clearly helps — +3.6 points over no context baseline
ContextPacker matches embeddings quality — without the infrastructure

The value proposition isn't "better than embeddings" — it's "same quality, dramatically simpler." No vector database, no pre-indexing, no sync to maintain. Just call the API.

Try It Free → How It Works