ADR-010: Commit-Driven A/B Agent Evaluation Harness — Methodology & Design¶

Field	Value
ID	ADR-010
Status	🟡 Proposed
Date	2026-03-06
System	tools/commit_eval · agentic-workflows-v2 · agentic-v2-eval
Authors	Platform Engineering
Reviewers	Research Infra, ML Platform, Security
Supersedes	(none)

1. TL;DR¶

We adopt a SWE-bench–inspired evaluation pattern pointed at our own commits: extract requirements from git show, create one git worktree at <commit>~1 (pre-task state), run two contestant strategies sequentially (A → reset → B), extract git patches, and score both using a two-layer system — LLM-as-judge for semantic quality plus rubric-driven static analysis for structural standards. No Docker. One worktree, sequential execution, real commits as ground truth.

2. Status History¶

Date	Status	Note
2026-03-06	🟡 Proposed	Initial design — commit-driven A/B harness

3. Context & Problem Statement¶

The platform team has built a multi-agent workflow runtime capable of generating code. The pressing question is: how do we objectively measure whether one agent, workflow, or prompt strategy produces better code than another?

┌──────────────────────────────────────────────────────────────────┐
│       THREE QUESTIONS THE EVAL HARNESS MUST ANSWER               │
├──────────────────────────────────────────────────────────────────┤
│  Q1 │ What is ground-truth "correct" output for a code task?     │
│  Q2 │ How do we isolate two strategies to compare them fairly?   │
│  Q3 │ How do we score a generated patch vs. the gold standard?   │
└──────────────────────────────────────────────────────────────────┘

3.1 Existing Approaches and Their Limits¶

Approach	Problem
Manual code review	Expert judgment; not repeatable; introduces reviewer bias; doesn't scale
Synthetic benchmarks (HumanEval, MBPP)	Toy problems; doesn't reflect real production code patterns or domain conventions
Hold-out test sets	Requires pre-labeling; quickly goes stale; task diversity is narrow
Arbitrary prompts with LLM judge	No ground truth; "good" is undefined without a reference implementation
SWE-bench (public)	Docker-based; public repos only; fixed 2,294 pre-labeled issues; cannot evaluate own repo commits

3.2 The SWE-bench Insight — Commits as Ground Truth¶

Jimenez, Yang, Wettig, Yao, Pei, Press, and Narasimhan introduced SWE-bench at ICLR 2024 (arXiv:2310.06770), building a benchmark of 2,294 real GitHub issues from 12 popular Python repositories. Every instance is a pull request that: - Is associated with an issue (the task specification) - Modified one or more test files (the test oracle) - Has an implementation diff (the gold standard)

Gold solutions edit an average of 1.7 files, 3.0 functions, and 32.8 lines. The initial RAG-only baseline solved just 1.96% of instances; even the "oracle" condition (given the exact files to edit) reached only 4.80%. This intentionally high bar forced the field toward proper agent scaffolding and validated the commit-as-ground-truth structure as a rigorous, non-trivial evaluation surface.

┌───────────────────────────────────────────────────────┐
│  COMMIT AS GROUND TRUTH (the SWE-bench insight)       │
│                                                       │
│  git show <sha>:                                      │
│    commit message   →  task specification             │
│    diff             →  gold standard implementation   │
│    test changes     →  test oracle                    │
│    <sha>~1 state    →  pre-task baseline              │
└───────────────────────────────────────────────────────┘

3.3 Extension to Own Repos — The Automation Gap¶

SWE-rebench (Badertdinov et al., NeurIPS 2025, arXiv:2505.20411) demonstrated that the SWE-bench pattern can be fully automated and extended to arbitrary repositories. The pipeline filters approximately 450,000 pull requests across 30,000+ repositories and uses LLM-driven extraction to produce 21,000+ verified interactive tasks. A key finding: models like GPT-4.1 show a measurable performance drop on March–April 2025 tasks compared to earlier dates, suggesting contamination in fixed benchmarks. This validates the need for continuously updated, private-commit evaluation.

SWE-Bench++ (Wang et al., arXiv:2512.17419, 2025) further demonstrated scalable benchmark generation across 11 languages and 3,971 repositories, achieving 137% higher environment yield than the SetUpAgent baseline on Python repos by using template-guided Dockerfile synthesis and adaptive test log parsing.

The gap: no existing tool combines (1) own private repo, (2) arbitrary commit SHA, (3) own workflow/prompt/agent strategies as contestants, (4) own domain rubrics, and (5) real-time streaming results to a web UI.

4. Decision¶

4.1 Commit Extraction via `git show`¶

CommitExtractor runs git show --stat --patch <sha> and extracts:

Commit message body → task context
Full unified diff → gold_patch
File list from --stat → affected_files
Test files (heuristic: test_*.py, *_test.py, paths under tests/) → test_oracle_files
An LLM call via LLMClient.generate_text() synthesizes a clean requirements_prompt

Extraction prompt template:

Given the following git commit message and code diff from a production repository,
extract a concise but complete requirements specification that captures:
1. The problem being solved or feature being added
2. The constraints and acceptance criteria
3. The expected behavior changes visible in the diff

Focus on what, not how — describe the task as it would be given to a developer
who has not seen the solution.

Commit message: {commit_message}

Diff (truncated to {max_tokens} tokens):
{diff}

Supports both local paths and GitHub URLs. For URLs: git clone --bare <url> /tmp/eval-repo-{uuid} then proceeds as local.

4.2 Sandbox Isolation via git worktree (not Docker)¶

SandboxManager creates one git worktree at the pre-commit state:

git worktree add /tmp/eval-{uuid} <sha>~1   # isolate at pre-task state
  [run Contestant A — agent writes files]
git -C /tmp/eval-{uuid} checkout -f HEAD     # wipe A's changes (index op only)
  [run Contestant B — agent writes files]
git worktree remove /tmp/eval-{uuid} --force # cleanup

Used as a context manager:

with SandboxManager(repo_path, pre_commit_sha) as worktree_path:
    trial_a = await runner.run(contestant_a, task, worktree_path)
    sandbox.reset()
    trial_b = await runner.run(contestant_b, task, worktree_path)

4.3 Sequential Contestant Execution (one worktree, not two)¶

┌────────────────────────────────────────────────────────────┐
│  SEQUENTIAL EVALUATION FLOW                                │
│                                                            │
│  worktree @ sha~1                                          │
│      │                                                     │
│      ├── [Run Contestant A]                                │
│      │       agent / workflow / prompt writes files        │
│      │       git diff HEAD  →  patch_a                    │
│      │       pytest oracle  →  test_results_a             │
│      │                                                     │
│      ├── git checkout -f HEAD  (reset — microseconds)      │
│      │                                                     │
│      ├── [Run Contestant B]                                │
│      │       agent / workflow / prompt writes files        │
│      │       git diff HEAD  →  patch_b                    │
│      │       pytest oracle  →  test_results_b             │
│      │                                                     │
│      └── git worktree remove --force                       │
└────────────────────────────────────────────────────────────┘

4.4 Two-Layer Scoring¶

┌─────────────────────────────────────────────────────────────┐
│  LAYER 1: LLM-as-Judge (semantic quality)                   │
│                                                             │
│  tools/agents/benchmarks/llm_evaluator.py:evaluate_with_llm │
│                                                             │
│  Input:  requirements_prompt + generated patch + gold patch │
│  Output: EvaluationResult, 0–10 per dimension              │
│                                                             │
│  Dimension       Weight  What it catches                    │
│  ─────────────── ──────  ──────────────────────────────     │
│  completeness     25%    Task fully addressed?              │
│  correctness      25%    Code correct vs. gold standard?    │
│  quality          20%    Clean, readable, idiomatic?        │
│  specificity      15%    Concrete, not boilerplate?         │
│  alignment        15%    Matches original intent?           │
│                                                             │
│  LAYER 2: Rubric-Driven Static Analysis (structural std.)   │
│                                                             │
│  agentic-v2-eval/src/agentic_v2_eval/scorer.py:Scorer       │
│                                                             │
│  Input:  patch text (raw diff)                             │
│  Output: ScoringResult, 0.0–1.0 weighted rubric score      │
│  Rubric: coding_standards.yaml (Style & Formatting,         │
│          Type Safety, Naming & Structure, Error Handling,    │
│          Testing, Security & Privacy, ML Reproducibility,   │
│          Deployment Readiness)                               │
└─────────────────────────────────────────────────────────────┘

Both layers are required. The LLM judge evaluates semantic intent and functional correctness but is known to reward fluency over structure. The rubric layer mechanically catches missing type annotations, bare except clauses, undocumented functions, and other project standards that the LLM does not reliably penalize.

Integration note: evaluate_with_llm() and Scorer exist independently in the codebase today and are not currently wired together. The two-layer scoring pipeline described here is new integration work proposed by this ADR, not reuse of an existing composed system.

Async note: evaluate_with_llm() is a synchronous function (regular def, not async def). Since the Comparator orchestrator is async, calls to evaluate_with_llm() must be wrapped with asyncio.to_thread() to avoid blocking the event loop.

5. Files Changed¶

File	Purpose
`tools/commit_eval/__init__.py`	Package exports
`tools/commit_eval/models.py`	`TaskInstance`, `Contestant`, `Trial`, `ComparisonResult`
`tools/commit_eval/extractor.py`	`CommitExtractor` (`git show` → LLM → `TaskInstance`)
`tools/commit_eval/sandbox.py`	`SandboxManager` (git worktree lifecycle context manager)
`tools/commit_eval/runner.py`	`ContestantRunner` (workflow / prompt / agent dispatch)
`tools/commit_eval/patch.py`	`PatchExtractor` + `TestRunner`
`tools/commit_eval/evaluator.py`	`EvalHarness` (two-layer scoring)
`tools/commit_eval/comparator.py`	`Comparator` (main orchestrator)
`tools/commit_eval/reporter.py`	`ComparisonReporter` (wraps existing HTML/MD reporters)
`tools/commit_eval/cli.py`	typer CLI entry point
`tools/__init__.py`	Add `commit_eval` export
`pyproject.toml` (root)	Add `pytest-json-report` dependency; add `agentic-v2-eval` as cross-package dependency (required for `tools/commit_eval/` to import `agentic_v2_eval.scorer`) (note: `typer >=0.9,<1` is already declared in `agentic-workflows-v2/pyproject.toml`, so it is not a net-new dependency to the workspace)

6. Rationale¶

6.1 Production Precedents for Commit-as-Ground-Truth Evaluation¶

System	Approach	Relevance to This ADR
SWE-bench (Jimenez et al., ICLR 2024)	2,294 real GitHub PRs; Docker sandbox; patch vs. test oracle	Direct methodology inspiration; validates commit as verifiable task source
SWE-rebench (NeurIPS 2025)	Automated LLM extraction from 450K+ PRs; 21K+ tasks; decontamination-aware	Proves that LLM can reliably extract requirements from `git show` output at scale
SWE-Bench++ (Wang et al., arXiv:2512.17419, 2025)	11 languages, 3,971 repos; 137% yield improvement; template-guided env setup	Confirms the base/before/after three-snapshot model; shows test oracle generation is automatable
Aider SWE-bench (aider.chat, 2024)	26.3% on SWE-bench Lite (SOTA); uses git repo map + AST analysis	Demonstrates that commit-driven eval is the de facto standard for code agent benchmarking
SWE-bench Live (arXiv:2505.23419)	Continuously updatable commit corpus; broader repo coverage	Validates continuously updated private-commit eval as superior to fixed benchmarks

6.2 git worktree over Docker¶

Factor	git worktree	Docker container
Startup time	~100 ms (index checkout)	5+ seconds additional per launch per published benchmarks
Context switch in CI	~30 seconds (worktree-based)	10+ minutes (container provision + image pull)
Windows support	Native — no daemon required	Requires Docker Desktop; problematic in non-admin environments
Reset cost	`git checkout -f HEAD` — index ops only, microseconds	Destroy + recreate container (seconds)
Disk footprint	One shallow copy of tracked files	Full container image layer stack
Isolation depth	File system only	OS-level (separate namespaces, network stack)
Sufficient for single-repo eval	Yes — no cross-repo contamination possible	Overkill — cross-process isolation not needed
Required for multi-repo at scale	Not ideal	Yes

For a single-repo, single-commit evaluation (the targeted use case), file system isolation via git worktree is sufficient. Docker's OS-level isolation guarantees are unnecessary when both contestants run against the same repository with no external service dependencies.

6.3 Sequential over Parallel (one worktree, not two)¶

Factor	Sequential (one worktree)	Parallel (two worktrees)
Implementation complexity	Simple: reset between runs	Two independent lifecycle objects
Disk usage	One repo copy	Two repo copies
Environmental drift risk	None — identical binary, env vars	Possible if process spawns diverge
Test oracle port conflicts	None — sequential execution	Possible if tests bind to fixed ports
Wall-clock time	A time + B time	max(A time, B time)

Chosen: sequential. Evaluation fidelity — identical environment for both contestants — is more important than wall-clock speed for a tool used interactively or in CI. The time saving from parallel execution does not justify the added implementation complexity or the risk of environmental bias between the two sandboxes.

6.4 Two-Layer Scoring vs. Single LLM Judge¶

A pure LLM judge has a documented blind spot: it rewards fluency and apparent confidence, not structural correctness. Empirically, high-scoring LLM-reviewed outputs routinely omit type hints, use bare except Exception: pass, and employ magic number constants — all patterns the LLM judge marks as acceptable because they read naturally. The rubric layer catches these mechanically and deterministically.

Scoring Layer	Catches	Misses
LLM-as-judge	Semantic correctness, task alignment, code intent, logical completeness	Structural standards, missing annotations, bare excepts, naming violations
Rubric (coding_standards)	Style & Formatting, Type Safety, Naming & Structure, Error Handling, Testing, Security & Privacy, ML Reproducibility, Deployment Readiness	Semantic correctness, business logic correctness
Both combined	Full spectrum — semantic AND structural	—

7. Consequences¶

7.1 Positive Outcomes¶

Outcome	Mechanism
Real, diverse task corpus	Every commit in the repo is a potential eval case with no pre-labeling
Version-controlled ground truth	Gold patch is in git history; immutable and auditable
No synthetic benchmark staleness	Each evaluation targets actual production code, not toy problems
Reuses existing eval infrastructure	`evaluate_with_llm()`, `Scorer`, rubrics, reporters unchanged
Domain-specific standards	Existing `coding_standards.yaml` rubric already calibrated for this codebase

7.2 Trade-offs and Risks¶

Risk	Severity	Mitigation
Requirements extraction quality depends on commit message quality	Medium	Flag low-confidence extractions; allow manual `--requirements-file` override
git worktree incompatible with sparse checkout or certain submodule configs	Low	Detect and warn at `SandboxManager.__init__`; document known incompatibilities
Test oracle absent if commit doesn't include test changes	Low	Test phase is optional; LLM + rubric scoring still runs without it
LLM judge latency adds significant wall-clock time per eval	Low	Run scoring asynchronously after both trials complete; cache by `(sha, contestant_hash)`
Sequential execution means total runtime is A + B elapsed time	Low	Stream phase progress events so UI stays responsive; add per-contestant timeout
Gold patch contains implementation details that bias the LLM judge	Low	Strip the gold patch from the `requirements_prompt` template; pass it only to the scorer

8. Alternatives Considered¶

Alternative	Rejected Because
Docker containers for sandbox isolation	5+ second startup per run; requires Docker Desktop on Windows; OS-level isolation exceeds requirements for single-repo eval
Two worktrees in parallel	Added implementation complexity; environmental drift risk between sandboxes undermines evaluation parity
GitHub Actions runners as sandboxes	Minutes to provision; no offline support; incompatible with local iterative use
`difflib` similarity score as evaluation	Token-level similarity does not capture semantic correctness; brittle to formatting changes; rewards verbosity
Human review board	Not scalable; introduces reviewer bias; incompatible with automated CI evaluation
Single LLM judge (no rubric)	Misses structural violations; documented blind spot for annotation and error-handling standards
Single rubric (no LLM judge)	Cannot assess semantic correctness, task alignment, or logical completeness
promptfoo as the orchestration layer	Node.js dependency; no git worktree integration; no patch-level comparison; no private-commit support

9. References¶

Citation	Relevance
Jimenez, Yang, Wettig, Yao, Pei, Press, Narasimhan — SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (ICLR 2024, arXiv:2310.06770)	Original commit-as-ground-truth benchmark; establishes 1.96% RAG baseline, test oracle methodology
Badertdinov et al. — SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents (NeurIPS 2025, arXiv:2505.20411)	Automated LLM extraction from 450K+ PRs; 21K+ tasks; decontamination evidence
Wang et al. — SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks (arXiv:2512.17419, 2025)	Multi-language scalable commit extraction; base/before/after snapshot model
Zhang et al. — SWE-bench Live (arXiv:2505.23419)	Continuously updatable commit corpus; broader repo coverage than fixed benchmarks
Aider Team — SWE-bench technical results (aider.chat, 2024)	26.3% on SWE-bench Lite using git repo map + AST; validates commit-driven eval as de facto standard
`tools/agents/benchmarks/llm_evaluator.py`	Existing `evaluate_with_llm()` function (5 dimensions, 0–10 scale, weighted)
`agentic-v2-eval/src/agentic_v2_eval/scorer.py`	Existing `Scorer` class (rubric-driven, weighted 0.0–1.0)
`agentic-v2-eval/src/agentic_v2_eval/rubrics/coding_standards.yaml`	Default structural rubric for Python code evaluation
git-scm.com — git-worktree documentation	Worktree semantics, `checkout -f HEAD` reset behavior, `remove --force` cleanup

10. Decision Map¶

┌──────────────────────────────────────────────────────────────────────┐
│  ADR-010 DECISION MAP                                                │
│                                                                      │
│  Evaluation Task Source                                              │
│    ├── Synthetic benchmark (HumanEval, MBPP) ──────── REJECTED       │
│    ├── Manual task specification ──────────────────── REJECTED       │
│    └── Real git commit (git show <sha>) ────────────── CHOSEN        │
│                                                                      │
│  Sandbox Isolation                                                   │
│    ├── Docker container ────────────────────────────── REJECTED       │
│    ├── Two worktrees in parallel ───────────────────── REJECTED       │
│    └── One git worktree, sequential ────────────────── CHOSEN        │
│                                                                      │
│  Scoring                                                             │
│    ├── difflib similarity score ────────────────────── REJECTED       │
│    ├── LLM-as-judge only ───────────────────────────── REJECTED       │
│    ├── Rubric only ─────────────────────────────────── REJECTED       │
│    └── LLM-as-judge + rubric (two layers) ──────────── CHOSEN        │
│                                                                      │
│  Scope                                                               │
│    ├── Public repos only (SWE-bench) ───────────────── INSUFFICIENT  │
│    └── Own private repo + arbitrary commit SHA ─────── CHOSEN        │
└──────────────────────────────────────────────────────────────────────┘