LLM-as-Judge¶
When an artifact's quality cannot be reduced to a unit-test predicate — prose, summaries, code reviews, design rationales — Agentic Runtimes delegates scoring to a language model. The judge is itself a workflow step with strict input/output contracts; its score becomes a first-class field on the run record and feeds the production gating thresholds.
This page covers what the judge does, how it is prompted, and how the 0.0–10.0 score is normalized.
What the judge actually scores¶
The judge does not produce a single subjective verdict. It is invoked once per rubric dimension — coherence, fluency, relevance, groundedness, faithfulness — and returns an integer 1–5 for that dimension only. The harness then:
- Normalizes each per-dimension score to 0.0–1.0 with
(score - 1) / 4 - Aggregates per the rubric's policy (mean, min, weighted average, or per-dimension gating)
- Multiplies by 10 to produce the final 0.0–10.0 result reported to callers
This indirection matters. A single overall "rate this 0–10" score collapses signal that is actually orthogonal — a response can be fluent but ungrounded, or coherent but irrelevant — and the gating logic depends on that orthogonality to make safe ship/quarantine decisions.
The judge protocol¶
For each dimension the harness sends two messages to the judge model:
- System message — establishes the judge's role and output contract:
You are an expert evaluator. Your goal is to evaluate the {dimension} of a response. Output ONLY a single number from 1 to 5 on the last line of your reply.
- User message — the dimension's prompt template (from
quality.yaml) populated with the input and the response under test.
The harness then parses the integer score from the last line of the reply. Any non-integer reply is logged, retried once with a stricter re-prompt, and on second failure escalated to the run's error stream without contaminating the score.
Example — coherence dimension¶
The packaged quality.yaml defines coherence as:
Coherence refers to the logical and orderly presentation of ideas in a response. A coherent response flows naturally, connects ideas logically, maintains consistent focus on the topic, and presents information in a well-structured manner that is easy for the reader to follow.
The five-level rubric:
| Score | Label | Description |
|---|---|---|
| 1 | Incoherent | Lacks any logical structure; ideas are disconnected or contradictory |
| 2 | Mostly Incoherent | Major coherence issues, frequent topic shifts, weak logical connections |
| 3 | Somewhat Coherent | Basic structure but noticeable gaps in logic or abrupt transitions |
| 4 | Coherent | Well-organized with clear logical flow; minor issues do not impede understanding |
| 5 | Highly Coherent | Excellent logical flow, smooth transitions, consistent focus, well-structured presentation |
The full prompt template embeds the query and response in a # Data
block and ends with a # Task instruction that requires step-by-step
reasoning followed by a single number on the last line.
Score normalization¶
# Per-dimension raw score: 1, 2, 3, 4, or 5 (rejected otherwise)
normalized = (raw_score - 1) / 4.0 # → 0.0, 0.25, 0.5, 0.75, 1.0
# Aggregate across dimensions per rubric policy
match policy:
case "mean": agg = mean(normalized_per_dim)
case "min": agg = min(normalized_per_dim)
case "weighted": agg = sum(w * n for w, n in zip(weights, normalized_per_dim))
case "per_dimension": agg = {dim: n for dim, n in zip(dims, normalized_per_dim)}
# Final 0.0–10.0 score
final = agg * 10.0
The default aggregation policy is mean. min is the strictest
useful choice — it lets any one weak dimension veto the artifact —
and is the right default when the gate is downstream of a high-stakes
production decision.
Reproducibility controls¶
LLM judging is non-deterministic by default. The harness offers three levers to tighten reproducibility:
- Temperature. Judge calls run at
temperature=0.0unless the caller overrides. Lower temperature reduces variance in the integer score. - Self-consistency. Setting
judge.samples=Nruns the judge N times per dimension and takes the median. The default is 1; raising to 3 typically narrows the variance to ±0.5 on the final 0–10 score. - Provider pinning. The model router can be told to pin a single provider/model for judge calls so different runs of the same artifact are scored on the same scale. Without pinning, model substitution can cause score drift across runs.
All three are configured in the workflow's evaluation: block, not in
the rubric YAML — the rubric describes what to score, the
configuration block describes how strictly.
Where this lands in the run record¶
The judge score appears in two places:
outputs.json— top-levelrubric_score: 0.0–1.0and arubric_breakdown:map of per-dimension scoressteps/<judge-step>/— full prompt, raw reply, parsed score, and any retry traces, for audit
Both files are version-controlled artifacts of the run. Production gating reads the top-level score; auditors read the step record to understand how it was produced.
What to read next¶
- Rubrics — the eight YAML rubrics the judge consumes
- Production Gating — the thresholds that turn a 0.0–10.0 score into a ship or quarantine decision
- ADR-009 — Scoring Enhancements — historical rationale for the per-dimension protocol
- ADR-010 — Eval Harness Methodology — architectural decisions behind the runner