Skip to content

Rubrics

Every evaluation in Agentic Runtimes scores an artifact against a YAML rubric. The rubrics live alongside the eval harness at agentic-v2-eval/src/agentic_v2_eval/rubrics/ and ship as part of the package, so users get the same scoring model the runtime authors use internally.

There are eight rubrics in the canonical set. They split into two families — scoring rubrics (criteria + weights) and judge rubrics (prompt-template based) — covered in turn below.

Family 1 — Scoring rubrics

Scoring rubrics are flat YAML files that declare a list of criteria, each with a name, a weight (0.0–1.0), an optional description, and an optional levels: block mapping integer scores to qualitative descriptors. The harness multiplies each criterion's score by its weight and sums to get the final rubric score.

default.yaml

The fallback rubric used when no domain-specific rubric is supplied. Three criteria, weights sum to 1.0:

Criterion Weight Description
Accuracy 0.50 How accurate are the predictions?
Completeness 0.30 Does the output cover all required aspects?
Efficiency 0.20 How quickly is the task completed?

Use when you have an evaluation in flight and have not yet authored a domain-specific rubric.

code.yaml

Specialized rubric for code-generation outputs. Six criteria, each with a six-level descriptor scale (0–5):

Criterion Weight Highlights
Correctness 0.30 5 = passes all tests, handles edge cases; 0 = does not run
Completeness 0.20 5 = all requirements implemented; 0 = no meaningful implementation
Code Quality 0.20 5 = excellent style and docs; 0 = unreadable
Efficiency 0.15 Reasonable time and space complexity
Robustness 0.10 Handling of malformed or boundary inputs
Documentation 0.05 Inline comments, docstrings, type hints

Use for code_review, bug_resolution, and fullstack_generation outputs.

agent.yaml

Rubric for agent behavior — judges whether the agent stayed in role, respected its boundaries, and used tools appropriately. Best paired with the LLM-as-judge protocol, since agent behavior is harder to score with flat criteria.

pattern.yaml

Rubric for evaluating whether an agentic pattern (consensus, refine loop, react loop, etc.) was applied correctly. Targets the pattern catalog and is invoked when comparing two implementations of the same pattern.

coding_standards.yaml

A standards-conformance rubric. Scores Python code against the project's coding standards: PEP 8 strict naming, mypy --strict typing, ruff rule coverage, no magic numbers, no bare excepts. Used as a CI gate, not a quality grade.

Family 2 — Judge rubrics

Judge rubrics are prompt templates rather than flat criteria. They are consumed by an LLM-as-judge that produces a 1–5 score on a single dimension per call. The template contains the definition, the rubric levels, and a placeholder for the input/output pair to evaluate.

quality.yaml

The largest and most-used judge rubric. Defines five orthogonal quality dimensions, each with its own 1–5 rubric and prompt template:

Dimension What it measures
Coherence Logical flow, consistency, and structure
Fluency Grammar, word choice, sentence structure
Relevance Whether the answer addresses the actual query
Groundedness Whether claims are supported by the supplied context
Faithfulness Whether the response stays within the source material

Each dimension is scored independently; the harness aggregates per the caller's preference (mean, min, weighted average, or per-dimension gating).

prompt_pattern.yaml

Judge rubric for prompts themselves — does the prompt follow a known pattern (zero-shot, few-shot, chain-of-thought, ReAct)? Used when auditing prompt assets in the prompts/ directory.

prompt_standard.yaml

Companion rubric to prompt_pattern.yaml that scores prompt hygiene — clear role definition, explicit boundaries, output format specified, critical rules listed. Used by the persona linter.

How rubrics are loaded

from agentic_v2_eval import load_rubric, evaluate

rubric = load_rubric("code")     # resolves to code.yaml
result = evaluate(rubric, run_record)
print(result.score, result.per_criterion)

load_rubric accepts either a bare name (resolved against the packaged rubric directory) or an absolute path to a custom YAML file. Custom rubrics must match the same schema — the harness rejects malformed rubrics at load time, before any LLM cost is incurred.

Adding a custom rubric

  1. Drop a YAML file in your project under any directory you control.
  2. Match the schema of code.yaml (criteria + weights + levels) for scoring, or quality.yaml (definitions + prompt templates) for judging.
  3. Pass the absolute path to load_rubric() or wire it into your workflow's evaluation: block.
  4. The harness validates weights sum to 1.0 (±0.01 tolerance) and that every criterion has a non-empty description.