Why Evaluation Is Hard for Code Retrieval
Evaluating code retrieval systems is surprisingly difficult. The obvious approach — test on popular open-source repositories — has a fundamental flaw: LLMs have already seen these codebases during training.
When we first tested on Flask and FastAPI, our "no context" baseline (LLM with no code provided) scored 7/10 on answer quality. The model wasn't using our retrieval — it was reciting memorized knowledge from training data.
This realization changed our entire evaluation strategy.
Our Approach: Test Where LLMs Have No Prior
We now focus our evaluation on repositories where LLMs are unlikely to have prior knowledge:
| Tier | Description | Questions | Why It Matters |
|---|---|---|---|
| Private | 5 private GitHub repos | 47 | Zero LLM prior — gold standard |
| Obscure | <1K star repos | 24 | Minimal training influence |
| Famous | Flask, FastAPI, etc. | 106 | Upper bound (inflated) |
Private repositories are our gold standard because we know the LLM hasn't seen them. On these repos, the "no context" baseline drops to 4.9/10 — proof that retrieval is actually providing value.
Two Benchmarks, Two Questions
We run two separate benchmarks that answer different questions:
Benchmark 1: Retrieval Quality
"Does ContextPacker find the right files?"
For each question, we have human-labeled ground truth: the files that actually contain the answer. We measure how well the system ranks these files using industry-standard metrics:
- NDCG@10 — How good is the ranking quality? (0-1, higher is better)
- Hit@10 — Did we find any relevant file in the top 10?
- MRR — How high is the first relevant file?
Retrieval Results (Private Repos)
Benchmark 2: End-to-End Answer Quality
"Do those files help the LLM answer correctly?"
Good retrieval doesn't guarantee good answers. So we also measure whether an LLM produces correct answers when given our files. We use LLM-as-judge scoring against human-written key facts.
To reduce bias, we use cross-vendor judging: OpenAI models generate answers, Google models judge them. This prevents models from favoring their own output style.
E2E Results (22 Questions)
Cross-vendor judging: GPT-4.1-mini → Gemini-2.0-flash
What We Learned Along the Way
Building this evaluation taught us several lessons that might help others in the space:
1. Famous Repos Inflate All Scores
Testing on Flask or React doesn't tell you much — LLMs already know these codebases. Your retrieval might not be helping at all.
Solution: Test on private repos or obscure projects where the "no context" baseline actually fails.
2. Symbol Extraction Matters
Just showing file paths isn't enough. Adding function and class names (via AST parsing) improved our NDCG from 0.85 to 0.92 — a significant jump.
3. Your Judge Prompt Is Probably Broken
Vague evaluation criteria like "should explain error handling" reward confident bullshit. We switched to requiring exact symbol names in answers:
❌ "Should explain the request lifecycle"
✓ "Must mention RequestContext and full_dispatch_request()"
4. Cross-Vendor Judging Reduces Bias
When GPT-4 judges GPT-4's answers, scores inflate by ~0.5 points. Using different model families (OpenAI for answers, Google for judging) produces more honest scores.
Honest Limitations
We believe in transparent evaluation. Here's what you should know:
Documented Caveats
- We wrote the questions — possible unconscious bias toward what we handle well
- Sample sizes are small — 47 private repo questions, 22 E2E questions
- No external validation — questions not reviewed by independent parties
- Results are directional — treat as optimistic upper bounds, not rigorous proof
We're sharing this data because it's useful, not because it's perfect. If you're evaluating code retrieval systems, we hope our methodology and lessons learned help you avoid the same traps.
Comparison to Industry Baselines
For context, here's how different approaches perform on code search benchmarks (from published research):
| System | NDCG | MRR | Source |
|---|---|---|---|
| BM25 (lexical) | 0.31 | 0.40 | CodeSearchNet 2019 |
| CodeBERT | 0.69 | 0.72 | Microsoft 2020 |
| UniXcoder | 0.75 | 0.78 | Microsoft 2022 |
| ContextPacker | 0.92 | 0.89 | Our private repo benchmark |
Note: These numbers come from different benchmarks and aren't directly comparable. We include them for rough context on what "good" looks like in this space.
The Bottom Line
Our evaluation shows two things:
- Context clearly helps — +3.6 points over no context baseline
- ContextPacker matches embeddings quality — without the infrastructure
The value proposition isn't "better than embeddings" — it's "same quality, dramatically simpler." No vector database, no pre-indexing, no sync to maintain. Just call the API.