Skip to content

Architecture: agentic-v2-eval

Executive Summary

agentic-v2-eval (v0.3.0) is a rubric-driven evaluation framework for agentic workflows. It provides LLM-as-judge scoring, structural pattern evaluation (ReAct, CoVe, Reflexion, RAG), output quality metrics, batch and streaming runners with configurable concurrency, and multiple report formats (JSON, Markdown, HTML). All evaluator dependencies are injected through structural protocols, enabling deterministic unit testing without live API calls.

The package is a workspace dependency of the broader monorepo. agentic-tools supplies the LLMClient and is lazy-loaded at call time so that import overhead is not paid unless evaluation actually runs. Install via pip install -e ".[dev]" from the agentic-v2-eval/ directory.


Technology Stack

Component Technology Notes
Language Python 3.11+
Build backend hatchling pyproject.toml as single config source
Rubric parsing PyYAML YAML rubric files loaded at call time
LLM access agentic-tools (workspace dep) Lazy-loaded via adapters/ bridge
Test runner pytest + pytest-asyncio asyncio mode: auto
Coverage pytest-cov 80% threshold enforced in CI
Static analysis mypy --strict mode
Linting ruff Rules: E, F, W, I, N, UP, S, B, A, C4, SIM, TCH, RUF

Architecture Pattern

Plugin Registry + Protocol-Based Dependency Injection

Evaluators are registered in a central EvaluatorRegistry keyed by string name. Callers request an evaluator by name; the registry resolves and returns the concrete instance. New evaluator types can be added without touching any runner or reporter code.

All evaluators depend on LLMClientProtocol, a structural protocol defined in interfaces.py. The concrete LLMClient from agentic-tools satisfies this protocol at runtime. Tests inject a mock that also satisfies the protocol, ensuring no real API calls are made in the test suite.

Rubrics are loaded from YAML files at call time, not at import time. This keeps evaluator classes free of file-system state and makes rubric substitution trivial at the CLI level.


Package Structure

src/agentic_v2_eval/
├── evaluators/
│   ├── base.py          # Abstract Evaluator base class
│   ├── llm.py           # LLMEvaluator — choice-anchored LLM judge
│   ├── pattern.py       # PatternEvaluator — agentic pattern conformance
│   ├── quality.py       # QualityEvaluator — output quality dimensions
│   └── standard.py      # StandardEvaluator — prompt quality grading
├── metrics/
│   ├── accuracy.py      # accuracy, precision, recall, F1, confusion matrix
│   ├── performance.py   # execution_time, memory_usage, throughput, latency_percentiles
│   └── quality.py       # code_quality (AST), lint_score, complexity_score
├── reporters/
│   ├── json.py          # JsonReporter
│   ├── markdown.py      # MarkdownReporter
│   └── html.py          # HtmlReporter (self-contained, embedded CSS)
├── runners/
│   ├── batch.py         # BatchRunner — sync, sequential, generic T/R
│   └── streaming.py     # StreamingRunner + AsyncStreamingRunner
├── rubrics/             # 8 YAML rubric definition files
│   ├── default.yaml
│   ├── agent.yaml
│   ├── code.yaml
│   ├── coding_standards.yaml
│   ├── pattern.yaml
│   ├── quality.yaml
│   ├── prompt_standard.yaml
│   └── prompt_pattern.yaml
├── sandbox/
│   ├── base.py          # BaseSandbox abstract class
│   └── local.py         # LocalSubprocessSandbox with safe_mode
├── adapters/
│   └── llm_client.py    # Bridge to agentic-tools LLMClient (lazy-loaded)
├── interfaces.py        # LLMClientProtocol, Evaluator protocols
├── scorer.py            # YAML-rubric weighted scoring engine
├── datasets.py          # Lazy bridge to tools.agents.benchmarks
└── __main__.py          # CLI entry point: evaluate, report

Evaluator System

The EvaluatorRegistry holds four built-in evaluator types. Each is registered by a string key and resolved by name at runtime.

LLMEvaluator

Strategy: Choice-anchored LLM-as-judge on a 5-point discrete scale.

  • Scale values: 0, 0.25, 0.5, 0.75, 1.0
  • Default judge model: gh:gpt-4o
  • Score extraction: The judge prompt instructs the model to end its response with a bare numeric score on a new line. The evaluator reads the last line of the response and matches it against the five allowed values. Any deviation raises a parse error — there is no silent fallback to zero.
  • Use case: General-purpose rubric-driven scoring where human-proxy judgment is needed.

PatternEvaluator

Strategy: Structural conformance scoring for agentic prompt patterns.

  • Supported patterns: ReAct, CoVe (Chain-of-Verification), Reflexion, RAG
  • Score structure: PatternScore dataclass with 17 fields covering 7 universal dimensions and up to 3 pattern-specific dimensions per pattern type.
  • Variance reduction: The evaluator runs the same prompt 20 times and reports the median score across runs to smooth stochastic LLM output.
  • Hard gates: Four pre-conditions must all be met before the weighted score is computed. Failing any gate produces an immediate overall failure regardless of weighted score.
Gate Minimum Meaning
POI (Pattern Observability Index) 4 Pattern structure must be detectable in the output
PC (Pattern Completeness) 4 All required pattern steps must be present
CA (Criterion Adherence) 4 The LLM followed the criterion instructions
PR (Pattern Ratio) 0.75 At least 75% of pattern-specific dimensions must pass
  • Use case: Verifying that agentic outputs adhere to a defined reasoning pattern.

QualityEvaluator

Strategy: Five independent LLM judge calls, one per quality dimension.

Dimension Description
Coherence Logical consistency and flow across the output
Fluency Grammatical correctness and natural language quality
Relevance On-topic alignment with the input prompt
Groundedness Claims supported by provided context or evidence
Similarity Semantic overlap with a reference output (if provided)
  • Each dimension is scored by a separate LLM call using the rubric definitions in quality.yaml.
  • Use case: Output quality assessment independent of task correctness.

StandardEvaluator

Strategy: Five prompt-quality dimensions scored 0–10, with letter grade and pass/fail determination.

Dimension Scale
Clarity 0–10
Effectiveness 0–10
Structure 0–10
Specificity 0–10
Completeness 0–10
  • Overall score: Unweighted mean of the five dimension scores.
  • Grade mapping: A (≥ 9.0), B (≥ 8.0), C (≥ 7.0), D (≥ 6.0), F (< 6.0)
  • Pass threshold: overall_score ≥ 7.0
  • Use case: Prompt engineering quality review and grading of prompt templates.

Rubric System

All eight rubric files reside in src/agentic_v2_eval/rubrics/. Each criterion specifies a weight and optional description. The sum of weights within each rubric equals 1.0.

Rubric File Criteria Purpose Pass Threshold
default.yaml 3 General (Accuracy 0.5, Completeness 0.3, Efficiency 0.2)
agent.yaml 6 Scoring agent workflow outputs 0.70
code.yaml 5 Scoring code generation quality 0.75
pattern.yaml 6 Agentic pattern adherence with hard gates 0.75
coding_standards.yaml 8 Python/ML coding standards conformance 0.70
quality.yaml 5 LLM judge dimension definitions for QualityEvaluator
prompt_standard.yaml 5 Judge prompt definitions for StandardEvaluator
prompt_pattern.yaml 4 Judge prompts for ReAct/CoVe/Reflexion/RAG patterns

Hard Gates (pattern.yaml)

Hard gates are evaluated as binary pre-conditions before weighted scoring begins. If any gate fails the overall result is marked failed regardless of the weighted score.

Gate Minimum Value Meaning
POI 4 Pattern structure is observable in the output
PC 4 All required pattern steps are present
CA 4 The output adhered to criterion instructions
PR 0.75 ≥ 75% of pattern-specific dimensions passed

Scoring Engine (scorer.py)

The Scorer class loads a YAML rubric and computes both raw and weighted scores.

@dataclass
class Criterion:
    name: str
    weight: float
    description: str
    min_score: float = 0.0
    max_score: float = 1.0

@dataclass
class ScoringResult:
    criteria_scores: dict[str, float]
    total_score: float         # unweighted mean
    weighted_score: float      # sum(score * weight) — range [0, 1]
    passed: bool               # weighted_score >= rubric pass_threshold

Rubric files are loaded at call time, not at module import, keeping the scorer stateless. Multiple rubrics can be applied to the same result set in a single run by passing different --rubric arguments.


Runners

All runners are generic over T (input type) and R (result type).

Runner Mode Concurrency Result Order Use Case
BatchRunner[T, R] Synchronous Sequential Submission order CI pipelines; finite input list, all results needed before reporting
StreamingRunner[T, R] Synchronous iterator Sequential Submission order Terminal progress display; no async overhead required
AsyncStreamingRunner[T, R] Asynchronous asyncio.Semaphore(max_concurrency=5) Completion order (FIRST_COMPLETED) I/O-bound scoring; up to 5 concurrent LLM evaluations in flight

BatchRunner exposes a continue_on_error: bool flag. When True, per-item errors are recorded in the result without halting the batch.

AsyncStreamingRunner uses FIRST_COMPLETED wait strategy so results are yielded as each evaluation finishes rather than in the order they were submitted.


Reporters

All three reporters share a calculate_summary() utility function and accept configurable options at construction time.

Reporter Output Format Notes
JsonReporter .json Structured output suitable for downstream automation
MarkdownReporter .md Human-readable tables suitable for PR comments or GitHub Wiki
HtmlReporter .html Self-contained file with embedded CSS; no external assets required

Both class-based (reporter.generate(results)) and functional (generate_json_report(results)) interfaces are available for all three formats.


Metrics

accuracy.py

  • accuracy(y_true, y_pred) — standard classification accuracy
  • precision_recall_f1(y_true, y_pred) — macro-averaged precision, recall, F1
  • confusion_matrix(y_true, y_pred) — raw confusion matrix as nested list

performance.py

  • execution_time(fn) — decorator that records wall-clock elapsed time
  • memory_usage(fn) — decorator using tracemalloc to measure peak memory
  • throughput(count, elapsed_s) — items per second
  • benchmark(fn, n_runs) — aggregate min/mean/max over n_runs
  • latency_percentiles(samples, percentiles) — P50/P95/P99 from a list of latency samples

quality.py (metrics)

  • code_quality(source) — AST-based static analysis returning a normalized score
  • lint_score(source) — invokes ruff programmatically and returns a pass-rate score
  • complexity_score(source) — cyclomatic complexity via AST visitor

Sandbox

LocalSubprocessSandbox executes evaluated code in an isolated subprocess.

Property Value
Default timeout 30 seconds
Safe mode Enabled by default
Blocked commands (safe_mode) 24 commands including rm, wget, curl, nc, kill, chmod, chown, dd, mkfs, and additional destructive/network commands
Path escape prevention Absolute paths outside the sandbox root are rejected before execution

The sandbox provides subprocess-level guardrails only. It does not offer container-level isolation. For higher assurance, wrap the evaluator in a container image.


Adapters

adapters/llm_client.py provides a lazy-loading bridge to agentic-tools. The import of LLMClient from tools.llm.llm_client is deferred until the first evaluation call. This means the agentic-v2-eval package can be imported without agentic-tools installed, enabling lighter-weight test environments that mock LLMClientProtocol directly.


CLI Reference

The CLI is the __main__.py entry point, registered as the agentic-v2-eval console script.

# Score a results file against a rubric
agentic-v2-eval evaluate results.json [--rubric rubric.yaml] [--output scored.json]

# Generate a formatted report from scored results
agentic-v2-eval report results.json --format {json,markdown,html} --output out_file

evaluate arguments

Argument Required Default Description
results.json Yes JSON file containing LLM outputs to evaluate
--rubric No default.yaml Path to a YAML rubric file
--output No scored.json Output path for scored results

report arguments

Argument Required Default Description
results.json Yes Scored results file (output of evaluate)
--format Yes One of json, markdown, html
--output Yes Output file path

Public API

The package exports 16 symbols from its top-level __init__.py.

Symbol Type Description
COHERENCE Constant Quality dimension identifier for Coherence
FLUENCY Constant Quality dimension identifier for Fluency
GROUNDEDNESS Constant Quality dimension identifier for Groundedness
RELEVANCE Constant Quality dimension identifier for Relevance
SIMILARITY Constant Quality dimension identifier for Similarity
Evaluator Protocol Base structural protocol for all evaluators
EvaluatorRegistry Class Plugin registry mapping names to evaluator instances
LLMClientProtocol Protocol Structural protocol satisfied by LLMClient and test mocks
LLMEvaluatorDefinition Dataclass Configuration for an LLMEvaluator instance
PatternEvaluator Class Evaluator for agentic pattern conformance
PatternScore Dataclass 17-field score result for pattern evaluation
QualityEvaluator Class Evaluator for five output quality dimensions
Scorer Class YAML-rubric weighted scoring engine
ScoringResult Dataclass Result of a Scorer.score() call
StandardEvaluator Class Evaluator for prompt engineering quality
StandardScore Dataclass Scored result from StandardEvaluator

Testing

Property Value
Test files 11
Approximate test count ~215
asyncio mode auto (pytest-asyncio)
Coverage gate 80% (enforced in CI)
Live API calls None — all tests mock LLMClientProtocol
Test markers integration (skipped in fast mode), slow
cd agentic-v2-eval
pip install -e ".[dev]"
python -m pytest tests/ -v
python -m pytest tests/ --cov=agentic_v2_eval --cov-report=term-missing

Static analysis:

mypy --strict src/agentic_v2_eval/
ruff check src/agentic_v2_eval/

All function signatures and class attributes must carry type annotations. mypy --strict enforces this in CI and blocks merge on violations.