ADR-010: Commit-Driven A/B Agent Evaluation Harness — Methodology & Design¶
| Field | Value |
|---|---|
| ID | ADR-010 |
| Status | 🟡 Proposed |
| Date | 2026-03-06 |
| System | tools/commit_eval · agentic-workflows-v2 · agentic-v2-eval |
| Authors | Platform Engineering |
| Reviewers | Research Infra, ML Platform, Security |
| Supersedes | (none) |
1. TL;DR¶
We adopt a SWE-bench–inspired evaluation pattern pointed at our own commits: extract requirements from
git show, create one git worktree at<commit>~1(pre-task state), run two contestant strategies sequentially (A → reset → B), extract git patches, and score both using a two-layer system — LLM-as-judge for semantic quality plus rubric-driven static analysis for structural standards. No Docker. One worktree, sequential execution, real commits as ground truth.
2. Status History¶
| Date | Status | Note |
|---|---|---|
| 2026-03-06 | 🟡 Proposed | Initial design — commit-driven A/B harness |
3. Context & Problem Statement¶
The platform team has built a multi-agent workflow runtime capable of generating code. The pressing question is: how do we objectively measure whether one agent, workflow, or prompt strategy produces better code than another?
┌──────────────────────────────────────────────────────────────────┐
│ THREE QUESTIONS THE EVAL HARNESS MUST ANSWER │
├──────────────────────────────────────────────────────────────────┤
│ Q1 │ What is ground-truth "correct" output for a code task? │
│ Q2 │ How do we isolate two strategies to compare them fairly? │
│ Q3 │ How do we score a generated patch vs. the gold standard? │
└──────────────────────────────────────────────────────────────────┘
3.1 Existing Approaches and Their Limits¶
| Approach | Problem |
|---|---|
| Manual code review | Expert judgment; not repeatable; introduces reviewer bias; doesn't scale |
| Synthetic benchmarks (HumanEval, MBPP) | Toy problems; doesn't reflect real production code patterns or domain conventions |
| Hold-out test sets | Requires pre-labeling; quickly goes stale; task diversity is narrow |
| Arbitrary prompts with LLM judge | No ground truth; "good" is undefined without a reference implementation |
| SWE-bench (public) | Docker-based; public repos only; fixed 2,294 pre-labeled issues; cannot evaluate own repo commits |
3.2 The SWE-bench Insight — Commits as Ground Truth¶
Jimenez, Yang, Wettig, Yao, Pei, Press, and Narasimhan introduced SWE-bench at ICLR 2024 (arXiv:2310.06770), building a benchmark of 2,294 real GitHub issues from 12 popular Python repositories. Every instance is a pull request that: - Is associated with an issue (the task specification) - Modified one or more test files (the test oracle) - Has an implementation diff (the gold standard)
Gold solutions edit an average of 1.7 files, 3.0 functions, and 32.8 lines. The initial RAG-only baseline solved just 1.96% of instances; even the "oracle" condition (given the exact files to edit) reached only 4.80%. This intentionally high bar forced the field toward proper agent scaffolding and validated the commit-as-ground-truth structure as a rigorous, non-trivial evaluation surface.
┌───────────────────────────────────────────────────────┐
│ COMMIT AS GROUND TRUTH (the SWE-bench insight) │
│ │
│ git show <sha>: │
│ commit message → task specification │
│ diff → gold standard implementation │
│ test changes → test oracle │
│ <sha>~1 state → pre-task baseline │
└───────────────────────────────────────────────────────┘
3.3 Extension to Own Repos — The Automation Gap¶
SWE-rebench (Badertdinov et al., NeurIPS 2025, arXiv:2505.20411) demonstrated that the SWE-bench pattern can be fully automated and extended to arbitrary repositories. The pipeline filters approximately 450,000 pull requests across 30,000+ repositories and uses LLM-driven extraction to produce 21,000+ verified interactive tasks. A key finding: models like GPT-4.1 show a measurable performance drop on March–April 2025 tasks compared to earlier dates, suggesting contamination in fixed benchmarks. This validates the need for continuously updated, private-commit evaluation.
SWE-Bench++ (Wang et al., arXiv:2512.17419, 2025) further demonstrated scalable benchmark generation across 11 languages and 3,971 repositories, achieving 137% higher environment yield than the SetUpAgent baseline on Python repos by using template-guided Dockerfile synthesis and adaptive test log parsing.
The gap: no existing tool combines (1) own private repo, (2) arbitrary commit SHA, (3) own workflow/prompt/agent strategies as contestants, (4) own domain rubrics, and (5) real-time streaming results to a web UI.
4. Decision¶
4.1 Commit Extraction via git show¶
CommitExtractor runs git show --stat --patch <sha> and extracts:
- Commit message body → task context
- Full unified diff →
gold_patch - File list from
--stat→affected_files - Test files (heuristic:
test_*.py,*_test.py, paths undertests/) →test_oracle_files - An LLM call via
LLMClient.generate_text()synthesizes a cleanrequirements_prompt
Extraction prompt template:
Given the following git commit message and code diff from a production repository,
extract a concise but complete requirements specification that captures:
1. The problem being solved or feature being added
2. The constraints and acceptance criteria
3. The expected behavior changes visible in the diff
Focus on what, not how — describe the task as it would be given to a developer
who has not seen the solution.
Commit message: {commit_message}
Diff (truncated to {max_tokens} tokens):
{diff}
Supports both local paths and GitHub URLs. For URLs: git clone --bare <url>
/tmp/eval-repo-{uuid} then proceeds as local.
4.2 Sandbox Isolation via git worktree (not Docker)¶
SandboxManager creates one git worktree at the pre-commit state:
git worktree add /tmp/eval-{uuid} <sha>~1 # isolate at pre-task state
[run Contestant A — agent writes files]
git -C /tmp/eval-{uuid} checkout -f HEAD # wipe A's changes (index op only)
[run Contestant B — agent writes files]
git worktree remove /tmp/eval-{uuid} --force # cleanup
Used as a context manager:
with SandboxManager(repo_path, pre_commit_sha) as worktree_path:
trial_a = await runner.run(contestant_a, task, worktree_path)
sandbox.reset()
trial_b = await runner.run(contestant_b, task, worktree_path)
4.3 Sequential Contestant Execution (one worktree, not two)¶
┌────────────────────────────────────────────────────────────┐
│ SEQUENTIAL EVALUATION FLOW │
│ │
│ worktree @ sha~1 │
│ │ │
│ ├── [Run Contestant A] │
│ │ agent / workflow / prompt writes files │
│ │ git diff HEAD → patch_a │
│ │ pytest oracle → test_results_a │
│ │ │
│ ├── git checkout -f HEAD (reset — microseconds) │
│ │ │
│ ├── [Run Contestant B] │
│ │ agent / workflow / prompt writes files │
│ │ git diff HEAD → patch_b │
│ │ pytest oracle → test_results_b │
│ │ │
│ └── git worktree remove --force │
└────────────────────────────────────────────────────────────┘
4.4 Two-Layer Scoring¶
┌─────────────────────────────────────────────────────────────┐
│ LAYER 1: LLM-as-Judge (semantic quality) │
│ │
│ tools/agents/benchmarks/llm_evaluator.py:evaluate_with_llm │
│ │
│ Input: requirements_prompt + generated patch + gold patch │
│ Output: EvaluationResult, 0–10 per dimension │
│ │
│ Dimension Weight What it catches │
│ ─────────────── ────── ────────────────────────────── │
│ completeness 25% Task fully addressed? │
│ correctness 25% Code correct vs. gold standard? │
│ quality 20% Clean, readable, idiomatic? │
│ specificity 15% Concrete, not boilerplate? │
│ alignment 15% Matches original intent? │
│ │
│ LAYER 2: Rubric-Driven Static Analysis (structural std.) │
│ │
│ agentic-v2-eval/src/agentic_v2_eval/scorer.py:Scorer │
│ │
│ Input: patch text (raw diff) │
│ Output: ScoringResult, 0.0–1.0 weighted rubric score │
│ Rubric: coding_standards.yaml (Style & Formatting, │
│ Type Safety, Naming & Structure, Error Handling, │
│ Testing, Security & Privacy, ML Reproducibility, │
│ Deployment Readiness) │
└─────────────────────────────────────────────────────────────┘
Both layers are required. The LLM judge evaluates semantic intent and functional
correctness but is known to reward fluency over structure. The rubric layer
mechanically catches missing type annotations, bare except clauses, undocumented
functions, and other project standards that the LLM does not reliably penalize.
Integration note:
evaluate_with_llm()andScorerexist independently in the codebase today and are not currently wired together. The two-layer scoring pipeline described here is new integration work proposed by this ADR, not reuse of an existing composed system.Async note:
evaluate_with_llm()is a synchronous function (regulardef, notasync def). Since theComparatororchestrator is async, calls toevaluate_with_llm()must be wrapped withasyncio.to_thread()to avoid blocking the event loop.
5. Files Changed¶
| File | Purpose |
|---|---|
tools/commit_eval/__init__.py |
Package exports |
tools/commit_eval/models.py |
TaskInstance, Contestant, Trial, ComparisonResult |
tools/commit_eval/extractor.py |
CommitExtractor (git show → LLM → TaskInstance) |
tools/commit_eval/sandbox.py |
SandboxManager (git worktree lifecycle context manager) |
tools/commit_eval/runner.py |
ContestantRunner (workflow / prompt / agent dispatch) |
tools/commit_eval/patch.py |
PatchExtractor + TestRunner |
tools/commit_eval/evaluator.py |
EvalHarness (two-layer scoring) |
tools/commit_eval/comparator.py |
Comparator (main orchestrator) |
tools/commit_eval/reporter.py |
ComparisonReporter (wraps existing HTML/MD reporters) |
tools/commit_eval/cli.py |
typer CLI entry point |
tools/__init__.py |
Add commit_eval export |
pyproject.toml (root) |
Add pytest-json-report dependency; add agentic-v2-eval as cross-package dependency (required for tools/commit_eval/ to import agentic_v2_eval.scorer) (note: typer >=0.9,<1 is already declared in agentic-workflows-v2/pyproject.toml, so it is not a net-new dependency to the workspace) |
6. Rationale¶
6.1 Production Precedents for Commit-as-Ground-Truth Evaluation¶
| System | Approach | Relevance to This ADR |
|---|---|---|
| SWE-bench (Jimenez et al., ICLR 2024) | 2,294 real GitHub PRs; Docker sandbox; patch vs. test oracle | Direct methodology inspiration; validates commit as verifiable task source |
| SWE-rebench (NeurIPS 2025) | Automated LLM extraction from 450K+ PRs; 21K+ tasks; decontamination-aware | Proves that LLM can reliably extract requirements from git show output at scale |
| SWE-Bench++ (Wang et al., arXiv:2512.17419, 2025) | 11 languages, 3,971 repos; 137% yield improvement; template-guided env setup | Confirms the base/before/after three-snapshot model; shows test oracle generation is automatable |
| Aider SWE-bench (aider.chat, 2024) | 26.3% on SWE-bench Lite (SOTA); uses git repo map + AST analysis | Demonstrates that commit-driven eval is the de facto standard for code agent benchmarking |
| SWE-bench Live (arXiv:2505.23419) | Continuously updatable commit corpus; broader repo coverage | Validates continuously updated private-commit eval as superior to fixed benchmarks |
6.2 git worktree over Docker¶
| Factor | git worktree | Docker container |
|---|---|---|
| Startup time | ~100 ms (index checkout) | 5+ seconds additional per launch per published benchmarks |
| Context switch in CI | ~30 seconds (worktree-based) | 10+ minutes (container provision + image pull) |
| Windows support | Native — no daemon required | Requires Docker Desktop; problematic in non-admin environments |
| Reset cost | git checkout -f HEAD — index ops only, microseconds |
Destroy + recreate container (seconds) |
| Disk footprint | One shallow copy of tracked files | Full container image layer stack |
| Isolation depth | File system only | OS-level (separate namespaces, network stack) |
| Sufficient for single-repo eval | Yes — no cross-repo contamination possible | Overkill — cross-process isolation not needed |
| Required for multi-repo at scale | Not ideal | Yes |
For a single-repo, single-commit evaluation (the targeted use case), file system isolation via git worktree is sufficient. Docker's OS-level isolation guarantees are unnecessary when both contestants run against the same repository with no external service dependencies.
6.3 Sequential over Parallel (one worktree, not two)¶
| Factor | Sequential (one worktree) | Parallel (two worktrees) |
|---|---|---|
| Implementation complexity | Simple: reset between runs | Two independent lifecycle objects |
| Disk usage | One repo copy | Two repo copies |
| Environmental drift risk | None — identical binary, env vars | Possible if process spawns diverge |
| Test oracle port conflicts | None — sequential execution | Possible if tests bind to fixed ports |
| Wall-clock time | A time + B time | max(A time, B time) |
Chosen: sequential. Evaluation fidelity — identical environment for both contestants — is more important than wall-clock speed for a tool used interactively or in CI. The time saving from parallel execution does not justify the added implementation complexity or the risk of environmental bias between the two sandboxes.
6.4 Two-Layer Scoring vs. Single LLM Judge¶
A pure LLM judge has a documented blind spot: it rewards fluency and apparent confidence,
not structural correctness. Empirically, high-scoring LLM-reviewed outputs routinely omit
type hints, use bare except Exception: pass, and employ magic number constants — all
patterns the LLM judge marks as acceptable because they read naturally. The rubric layer
catches these mechanically and deterministically.
| Scoring Layer | Catches | Misses |
|---|---|---|
| LLM-as-judge | Semantic correctness, task alignment, code intent, logical completeness | Structural standards, missing annotations, bare excepts, naming violations |
| Rubric (coding_standards) | Style & Formatting, Type Safety, Naming & Structure, Error Handling, Testing, Security & Privacy, ML Reproducibility, Deployment Readiness | Semantic correctness, business logic correctness |
| Both combined | Full spectrum — semantic AND structural | — |
7. Consequences¶
7.1 Positive Outcomes¶
| Outcome | Mechanism |
|---|---|
| Real, diverse task corpus | Every commit in the repo is a potential eval case with no pre-labeling |
| Version-controlled ground truth | Gold patch is in git history; immutable and auditable |
| No synthetic benchmark staleness | Each evaluation targets actual production code, not toy problems |
| Reuses existing eval infrastructure | evaluate_with_llm(), Scorer, rubrics, reporters unchanged |
| Domain-specific standards | Existing coding_standards.yaml rubric already calibrated for this codebase |
7.2 Trade-offs and Risks¶
| Risk | Severity | Mitigation |
|---|---|---|
| Requirements extraction quality depends on commit message quality | Medium | Flag low-confidence extractions; allow manual --requirements-file override |
| git worktree incompatible with sparse checkout or certain submodule configs | Low | Detect and warn at SandboxManager.__init__; document known incompatibilities |
| Test oracle absent if commit doesn't include test changes | Low | Test phase is optional; LLM + rubric scoring still runs without it |
| LLM judge latency adds significant wall-clock time per eval | Low | Run scoring asynchronously after both trials complete; cache by (sha, contestant_hash) |
| Sequential execution means total runtime is A + B elapsed time | Low | Stream phase progress events so UI stays responsive; add per-contestant timeout |
| Gold patch contains implementation details that bias the LLM judge | Low | Strip the gold patch from the requirements_prompt template; pass it only to the scorer |
8. Alternatives Considered¶
| Alternative | Rejected Because |
|---|---|
| Docker containers for sandbox isolation | 5+ second startup per run; requires Docker Desktop on Windows; OS-level isolation exceeds requirements for single-repo eval |
| Two worktrees in parallel | Added implementation complexity; environmental drift risk between sandboxes undermines evaluation parity |
| GitHub Actions runners as sandboxes | Minutes to provision; no offline support; incompatible with local iterative use |
difflib similarity score as evaluation |
Token-level similarity does not capture semantic correctness; brittle to formatting changes; rewards verbosity |
| Human review board | Not scalable; introduces reviewer bias; incompatible with automated CI evaluation |
| Single LLM judge (no rubric) | Misses structural violations; documented blind spot for annotation and error-handling standards |
| Single rubric (no LLM judge) | Cannot assess semantic correctness, task alignment, or logical completeness |
| promptfoo as the orchestration layer | Node.js dependency; no git worktree integration; no patch-level comparison; no private-commit support |
9. References¶
| Citation | Relevance |
|---|---|
| Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan — SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (ICLR 2024, arXiv:2310.06770) | Original commit-as-ground-truth benchmark; establishes 1.96% RAG baseline, test oracle methodology |
| Badertdinov et al. — SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents (NeurIPS 2025, arXiv:2505.20411) | Automated LLM extraction from 450K+ PRs; 21K+ tasks; decontamination evidence |
| Wang et al. — SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks (arXiv:2512.17419, 2025) | Multi-language scalable commit extraction; base/before/after snapshot model |
| Zhang et al. — SWE-bench Live (arXiv:2505.23419) | Continuously updatable commit corpus; broader repo coverage than fixed benchmarks |
| Aider Team — SWE-bench technical results (aider.chat, 2024) | 26.3% on SWE-bench Lite using git repo map + AST; validates commit-driven eval as de facto standard |
tools/agents/benchmarks/llm_evaluator.py |
Existing evaluate_with_llm() function (5 dimensions, 0–10 scale, weighted) |
agentic-v2-eval/src/agentic_v2_eval/scorer.py |
Existing Scorer class (rubric-driven, weighted 0.0–1.0) |
agentic-v2-eval/src/agentic_v2_eval/rubrics/coding_standards.yaml |
Default structural rubric for Python code evaluation |
| git-scm.com — git-worktree documentation | Worktree semantics, checkout -f HEAD reset behavior, remove --force cleanup |
10. Decision Map¶
┌──────────────────────────────────────────────────────────────────────┐
│ ADR-010 DECISION MAP │
│ │
│ Evaluation Task Source │
│ ├── Synthetic benchmark (HumanEval, MBPP) ──────── REJECTED │
│ ├── Manual task specification ──────────────────── REJECTED │
│ └── Real git commit (git show <sha>) ────────────── CHOSEN │
│ │
│ Sandbox Isolation │
│ ├── Docker container ────────────────────────────── REJECTED │
│ ├── Two worktrees in parallel ───────────────────── REJECTED │
│ └── One git worktree, sequential ────────────────── CHOSEN │
│ │
│ Scoring │
│ ├── difflib similarity score ────────────────────── REJECTED │
│ ├── LLM-as-judge only ───────────────────────────── REJECTED │
│ ├── Rubric only ─────────────────────────────────── REJECTED │
│ └── LLM-as-judge + rubric (two layers) ──────────── CHOSEN │
│ │
│ Scope │
│ ├── Public repos only (SWE-bench) ───────────────── INSUFFICIENT │
│ └── Own private repo + arbitrary commit SHA ─────── CHOSEN │
└──────────────────────────────────────────────────────────────────────┘