Skip to content

LLM-as-Judge

When an artifact's quality cannot be reduced to a unit-test predicate — prose, summaries, code reviews, design rationales — Agentic Runtimes delegates scoring to a language model. The judge is itself a workflow step with strict input/output contracts; its score becomes a first-class field on the run record and feeds the production gating thresholds.

This page covers what the judge does, how it is prompted, and how the 0.0–10.0 score is normalized.

What the judge actually scores

The judge does not produce a single subjective verdict. It is invoked once per rubric dimension — coherence, fluency, relevance, groundedness, faithfulness — and returns an integer 1–5 for that dimension only. The harness then:

  1. Normalizes each per-dimension score to 0.0–1.0 with (score - 1) / 4
  2. Aggregates per the rubric's policy (mean, min, weighted average, or per-dimension gating)
  3. Multiplies by 10 to produce the final 0.0–10.0 result reported to callers

This indirection matters. A single overall "rate this 0–10" score collapses signal that is actually orthogonal — a response can be fluent but ungrounded, or coherent but irrelevant — and the gating logic depends on that orthogonality to make safe ship/quarantine decisions.

The judge protocol

For each dimension the harness sends two messages to the judge model:

  1. System message — establishes the judge's role and output contract:

You are an expert evaluator. Your goal is to evaluate the {dimension} of a response. Output ONLY a single number from 1 to 5 on the last line of your reply.

  1. User message — the dimension's prompt template (from quality.yaml) populated with the input and the response under test.

The harness then parses the integer score from the last line of the reply. Any non-integer reply is logged, retried once with a stricter re-prompt, and on second failure escalated to the run's error stream without contaminating the score.

Example — coherence dimension

The packaged quality.yaml defines coherence as:

Coherence refers to the logical and orderly presentation of ideas in a response. A coherent response flows naturally, connects ideas logically, maintains consistent focus on the topic, and presents information in a well-structured manner that is easy for the reader to follow.

The five-level rubric:

Score Label Description
1 Incoherent Lacks any logical structure; ideas are disconnected or contradictory
2 Mostly Incoherent Major coherence issues, frequent topic shifts, weak logical connections
3 Somewhat Coherent Basic structure but noticeable gaps in logic or abrupt transitions
4 Coherent Well-organized with clear logical flow; minor issues do not impede understanding
5 Highly Coherent Excellent logical flow, smooth transitions, consistent focus, well-structured presentation

The full prompt template embeds the query and response in a # Data block and ends with a # Task instruction that requires step-by-step reasoning followed by a single number on the last line.

Score normalization

# Per-dimension raw score: 1, 2, 3, 4, or 5 (rejected otherwise)
normalized = (raw_score - 1) / 4.0       # → 0.0, 0.25, 0.5, 0.75, 1.0

# Aggregate across dimensions per rubric policy
match policy:
    case "mean":            agg = mean(normalized_per_dim)
    case "min":             agg = min(normalized_per_dim)
    case "weighted":        agg = sum(w * n for w, n in zip(weights, normalized_per_dim))
    case "per_dimension":   agg = {dim: n for dim, n in zip(dims, normalized_per_dim)}

# Final 0.0–10.0 score
final = agg * 10.0

The default aggregation policy is mean. min is the strictest useful choice — it lets any one weak dimension veto the artifact — and is the right default when the gate is downstream of a high-stakes production decision.

Reproducibility controls

LLM judging is non-deterministic by default. The harness offers three levers to tighten reproducibility:

  • Temperature. Judge calls run at temperature=0.0 unless the caller overrides. Lower temperature reduces variance in the integer score.
  • Self-consistency. Setting judge.samples=N runs the judge N times per dimension and takes the median. The default is 1; raising to 3 typically narrows the variance to ±0.5 on the final 0–10 score.
  • Provider pinning. The model router can be told to pin a single provider/model for judge calls so different runs of the same artifact are scored on the same scale. Without pinning, model substitution can cause score drift across runs.

All three are configured in the workflow's evaluation: block, not in the rubric YAML — the rubric describes what to score, the configuration block describes how strictly.

Where this lands in the run record

The judge score appears in two places:

  1. outputs.json — top-level rubric_score: 0.0–1.0 and a rubric_breakdown: map of per-dimension scores
  2. steps/<judge-step>/ — full prompt, raw reply, parsed score, and any retry traces, for audit

Both files are version-controlled artifacts of the run. Production gating reads the top-level score; auditors read the step record to understand how it was produced.