Skip to content

Production Gating

Production gating is the policy layer between the evaluation harness and the runtime's "this artifact is shippable" verdict. It turns a collection of rubric scores into a single boolean and, when the gate fails, surfaces the dimension that failed so the run can be quarantined rather than silently downgraded.

The headline rule:

An artifact ships only if coverage_score >= 0.80 and every in-scope quality dimension meets its threshold.

The rest of this page expands what coverage_score is, what the quality dimensions are, and why the thresholds sit where they do.

The four gating dimensions

Every artifact emerges from a workflow with four scored dimensions attached. Each is normalized to 0.0–1.0; the gate threshold for each is configurable per workflow but defaults to the values below.

Dimension Default threshold What it asserts
Coverage 0.80 The artifact addresses all required aspects of the input
Quality 0.70 The artifact meets the rubric's content quality bar
Agreement 0.70 Independent judges agree within tolerance (multi-sample)
Recency 0.50 If the artifact cites sources, those sources are recent enough

coverage_score is the most aggressive of the four. It is also the only one that defaults to 0.80 — the rest sit at 0.70 or below. The distinction matters: an artifact that covers every required topic poorly is closer to "needs another pass" than to "do not ship." Conversely, an artifact that covers only half the topics but does so brilliantly is still a partial answer and is unsafe to ship.

Why coverage_score >= 0.80?

Three reasons drove the 0.80 floor:

  1. Calibration data. Across the harness's regression set, judges reach inter-rater agreement above 0.85 on coverage. Setting the gate at 0.80 leaves headroom for legitimate near-misses without admitting artifacts the judges disagree on.
  2. Asymmetric cost. Shipping an under-covered artifact in a federal evaluation context is meaningfully worse than re-running the workflow. Re-runs cost tokens; misleading artifacts cost trust.
  3. No silent degradation. The gate is enforced as >=, not >. A score of exactly 0.80 ships; a score of 0.7999 does not. The runtime never silently rounds.

The threshold can be lowered for development or experimentation by setting evaluation.gating.coverage_threshold: 0.70 in the workflow's YAML, but production deployments inherit 0.80 from the package default. Lowering below 0.60 logs a warning at workflow load time.

Quality, agreement, and recency

The remaining three dimensions are gated independently. An artifact must clear all four to ship.

Quality (default 0.70)

Quality is the aggregate of the per-dimension judge scores from quality.yaml — coherence, fluency, relevance, groundedness, faithfulness. The aggregation policy is configurable; mean is the default, min is the strict choice for high-stakes contexts.

The 0.70 floor reflects the band where independent judges consistently agree the artifact is "useful as written." Below 0.70 the harness sees disagreement and short prose qualifying the artifact as "adequate but" — exactly the band production gates should reject.

Agreement (default 0.70)

When the rubric requests multi-sample judging (samples >= 2), agreement is the proportion of samples whose scores fall within ±1 of the median. A score of 1.0 means every sample agreed; below 0.70 means the judge itself was inconsistent on this artifact, which is sufficient grounds to re-run.

Recency (default 0.50)

For artifacts that cite external sources (research summaries, RAG-grounded reports), recency scores the freshness of the citations. The 0.50 floor is intentionally permissive — many legitimate citations are years old — but exists to catch artifacts that lean entirely on stale or deprecated sources.

Rationale — why thresholds, not pass/fail tests?

Earlier iterations of the runtime gated artifacts on string-match unit tests: "the response must contain X" or "the JSON must have Y." Those tests were brittle. They flagged stylistic variations as failures and admitted hollow templated responses that happened to contain the target substring.

The gating thresholds sidestep both failure modes. The judge scores the artifact's meaning, not its surface form, and the threshold is calibrated against a regression set so any score that meets it has a known, measurable correspondence to "shippable in the ground truth."

When a gate fails

A failed gate produces a structured outcome rather than an exception:

{
  "status": "quarantined",
  "gate": "coverage",
  "score": 0.74,
  "threshold": 0.80,
  "remediation": "rerun_with_higher_tier"
}

The runtime persists the artifact and the failure record together, emits a quarantine event on the run channel, and — if configured — auto-routes the failed run to a higher-tier model for a retry. The failure does not propagate as an exception unless the workflow explicitly opts in via evaluation.gating.fail_on_quarantine: true.

This shape is what makes the gate compatible with bounded loops like iterative_review: the loop body inspects the gate result and decides whether to rework rather than crashing on a quality miss.

Where to read more