Skip to content

Architecture: agentic-tools

Executive Summary

agentic-tools (v0.1.0) is the shared utility layer for the monorepo. It provides a multi-provider LLM client abstraction (10 providers), a disk-backed response cache, model probing and discovery, benchmark infrastructure, a structured error taxonomy, a research library builder, and script bootstrap helpers. It is consumed by agentic-v2-eval and by the agentic-workflows-v2 runtime.

The design follows a static facade + provider adapters + two-level cache pattern. Callers never instantiate a client object; they call LLMClient.generate_text(...) as a static method. The method dispatches to one of ten provider backends based on a name prefix in the model string. A 24-hour, SHA-256-keyed disk cache sits in front of all provider calls when enabled.

The package is the uv workspace root. Members agentic-workflows-v2 and agentic-v2-eval declare it as a workspace dependency.


Technology Stack

Component Technology Notes
Language Python 3.11+
Build backend hatchling pyproject.toml as single config source
Workspace manager uv Workspace root; members: agentic-workflows-v2, agentic-v2-eval
Config / rubric parsing PyYAML
HTTP (async) aiohttp Async provider backends
Data validation pydantic v2 model_dump() / model_validate() — not .dict() / .parse_obj()
OpenAI SDK openai OpenAI and Azure OpenAI providers
Anthropic SDK anthropic Claude provider
Numeric numpy Upper bound < 3 for semver safety
Optional: Google google-generativeai Gemini provider
Optional: ONNX onnxruntime-genai Local ONNX model inference
Optional: Windows AI winrt-runtime Phi Silica NPU (Copilot+ PC)

Package Structure

tools/
├── llm/                          # 20 modules — LLM client and provider layer
│   ├── llm_client.py             # LLMClient static facade
│   ├── provider_adapters.py      # Dispatch table; all provider adapter classes
│   ├── probe_providers.py        # Model availability probing (all providers)
│   ├── probe_providers_cloud.py  # Cloud-specific probe logic
│   ├── probe_providers_local.py  # Local model probe logic
│   ├── probe_discovery.py        # Cross-provider discovery orchestration
│   ├── probe_discovery_providers.py  # Per-provider discovery helpers
│   ├── probe_config.py           # Probe configuration dataclasses
│   ├── model_probe.py            # ModelProbe class with persistent JSON cache
│   ├── model_inventory.py        # Model inventory management
│   ├── model_locks.py            # Concurrency locks for model loading
│   ├── local_model.py            # ONNX model wrapper
│   ├── local_models.py           # Local model registry
│   ├── local_model_cli.py        # CLI for local model operations
│   ├── local_model_discovery.py  # Auto-detect models from ~/.cache/aigallery
│   ├── model_bakeoff.py          # Multi-model comparison runner
│   ├── bakeoff_tasks.py          # Bakeoff task definitions
│   ├── bakeoff_reporting.py      # Bakeoff result formatting
│   ├── rank_models.py            # Score-based model ranking
│   ├── langchain_adapter.py      # LangChain adapter bridge
│   └── windows_ai.py             # Windows AI (Phi Silica NPU) integration
├── core/                         # 9 modules — shared utilities
│   ├── config.py                 # ModelConfig, PathConfig, Config dataclasses
│   ├── errors.py                 # ErrorCode StrEnum, LLMClientError, classify_error
│   ├── cache.py                  # ResponseCache (SHA-256, 24h TTL, disk-backed)
│   ├── response_cache.py         # Low-level cache read/write primitives
│   ├── _encoding.py              # Internal encoding helpers
│   ├── tool_init.py              # ToolInit bootstrap dataclass + with_retry decorator
│   ├── prompt_db.py              # Prompt template database
│   ├── local_media.py            # Local media file handling
│   └── model_availability.py    # Runtime model availability checks
├── agents/benchmarks/            # 11 modules — benchmark infrastructure
│   ├── registry.py               # BenchmarkRegistry + BENCHMARK_DEFINITIONS
│   ├── config.py                 # BenchmarkConfig dataclass + preset configs
│   ├── datasets.py               # Dataset loaders for each benchmark
│   ├── runner.py                 # BenchmarkRunner (sync)
│   ├── async_runner.py           # AsyncBenchmarkRunner
│   ├── llm_evaluator.py          # LLM-as-judge evaluator (5 weighted dimensions)
│   ├── scoring.py                # Score aggregation and grade assignment
│   ├── reporting.py              # Result formatters
│   ├── task.py                   # BenchmarkTask dataclass
│   ├── loader.py                 # load_benchmark() entry point
│   └── __init__.py               # Public API exports
├── research/                     # 2 modules — research library builder
│   ├── library_builder.py        # build_library() — scan, classify, consolidate
│   └── helpers.py                # URL classification, domain lists
└── tests/                        # 10 test modules — all mocked, no live LLM calls

LLM Client

Interface

class LLMClient:
    @staticmethod
    def generate_text(
        model_name: str,
        prompt: str,
        system_instruction: str | None = None,
        temperature: float = 0.7,
        max_tokens: int = 4096,
    ) -> str: ...

LLMClient is a pure static-method facade. There is no instance state. Thread safety is delegated to the cache layer, which uses a threading.Lock on all read/write operations.

Call Flow

Caller
  └─ LLMClient.generate_text(model_name, prompt, ...)  ← static facade
       ├─ ResponseCache.get(sha256_key)                  ← cache layer
       │    └─ hit → return cached response
       └─ miss → provider_adapters.dispatch(prefix, ...)
            ├─ local:*          → LocalONNXAdapter
            ├─ ollama:*         → OllamaRESTAdapter
            ├─ windows-ai:*     → WindowsAIAdapter (Phi Silica NPU)
            ├─ azure-foundry:*  → AzureFoundryAdapter
            ├─ azure-openai:*   → AzureOpenAIAdapter (slot failover)
            ├─ gh:*             → GitHubModelsAdapter (gh CLI subprocess)
            ├─ openai:*         → OpenAIAdapter
            ├─ gemini:*         → GeminiAdapter
            └─ claude:*         → AnthropicAdapter

Provider Routing

The model_name string prefix selects the backend. Unknown prefixes raise LLMClientError with ErrorCode.UNAVAILABLE_MODEL.

Prefix Provider Backend
local:* Local ONNX onnxruntime-genai; auto-detects from ~/.cache/aigallery or LOCAL_MODEL_PATH
ollama:* Ollama REST urllib.request to localhost:11434 (default port)
windows-ai:* Windows Copilot Runtime (Phi Silica NPU) winrt-runtime; requires Windows 11 Copilot+ PC hardware
azure-foundry:* Azure AI Foundry urllib.request; env: AZURE_FOUNDRY_ENDPOINT, AZURE_FOUNDRY_KEY
azure-openai:* Azure OpenAI Service openai.AzureOpenAI; numbered slot failover (AZURE_OPENAI_API_KEY_0 through _n)
gh:* GitHub Models gh CLI subprocess; env: GITHUB_TOKEN; maps 18+ friendly names to GitHub model IDs
openai:* OpenAI openai.OpenAI SDK; env: OPENAI_API_KEY
gemini:* Google Gemini google.generativeai SDK; env: GEMINI_API_KEY
claude:* Anthropic Claude anthropic.Anthropic SDK; env: ANTHROPIC_API_KEY

Remote Provider Gate

Remote providers (OpenAI, Anthropic, Gemini, Azure variants) are disabled by default to prevent accidental spend during local development and CI runs.

Environment Variable Default Effect
PROMPTEVAL_ALLOW_REMOTE unset / 0 Remote providers raise LLMClientError(PERMISSION_DENIED)
PROMPTEVAL_ALLOW_REMOTE=1 All ten providers are active

Local providers (local:*, ollama:*, windows-ai:*, gh:*) are always allowed.


Response Cache

Property Value
Key algorithm SHA-256 of (model_name, prompt, system_instruction, temperature, max_tokens)
TTL 24 hours
Storage location ~/.cache/prompts-eval/responses/
Max size 500 MB (LRU eviction when exceeded)
Thread safety threading.Lock on all read/write operations
Enable PROMPTS_CACHE_ENABLED=1
Disable Unset or PROMPTS_CACHE_ENABLED=0

Disk layout:

~/.cache/prompts-eval/responses/
└── <sha256_hex>.json    # one file per unique (model, prompt, params) combination

Each cache entry is a JSON object containing response, model, timestamp, and ttl_seconds fields.

Disable the cache when running evaluations that intentionally test stochastic model behavior (for example, the PatternEvaluator 20-run median computation), or when exercising cache-miss code paths in tests.


Model Probing (ModelProbe)

ModelProbe in tools/llm/model_probe.py provides runtime model availability probing with a persistent JSON backing cache.

  • probe(model_name) — check whether a single model responds to a minimal prompt
  • discover_all_models() — runs all provider probes in parallel and returns a consolidated list of available models
  • Results are written to a JSON file under ~/.cache/prompts-eval/model_probe/ and reused across sessions until invalidated

Provider-specific probe logic lives in probe_providers_cloud.py and probe_providers_local.py, coordinated by probe_discovery.py.


Error Taxonomy

tools.core.errors defines a StrEnum with ten values and helper functions.

Code Meaning Retryable
SUCCESS No error
UNAVAILABLE_MODEL Model not found or not loaded No
PERMISSION_DENIED Remote provider blocked by env gate No
RATE_LIMITED Provider rate limit exceeded Yes
TIMEOUT Request exceeded timeout Yes
PARSE_ERROR Could not parse LLM response No
FILE_NOT_FOUND Required file path does not exist No
INVALID_INPUT Input failed validation No
NETWORK_ERROR Network-level failure Yes
INTERNAL_ERROR Unexpected error in client code No
class LLMClientError(RuntimeError):
    code: ErrorCode
    model: str              # model_name that was requested
    original_error: Exception | None

Helper functions:

  • classify_error(exc: Exception) -> ErrorCode — inspects exception type and message to return the appropriate code
  • is_retryable(code: ErrorCode) -> bool — returns True for RATE_LIMITED, TIMEOUT, NETWORK_ERROR, PARSE_ERROR
  • is_permanent(code: ErrorCode) -> bool — convenience inverse of is_retryable

Transient error codes (retryable): RATE_LIMITED, TIMEOUT, NETWORK_ERROR, PARSE_ERROR.


Config (tools.core.config)

@dataclass
class ModelConfig:
    gen_model: str      # from GEN_MODEL env var
    rev_model: str      # from REV_MODEL env var
    ref_model: str      # from REF_MODEL env var

@dataclass
class PathConfig:
    cache_dir: Path
    model_dir: Path
    output_dir: Path

@dataclass
class Config:
    models: ModelConfig
    paths: PathConfig

default_config() reads GEN_MODEL, REV_MODEL, REF_MODEL environment variables and returns a Config instance. Defaults are applied for unset variables so the package is usable out of the box for local-only providers.


ToolInit and with_retry

ToolInit is a dataclass for script bootstrap. It performs startup environment and path checks and writes structured JSONL log entries.

@dataclass
class ToolInit:
    check_env: list[str]        # required env var names
    check_models: list[str]     # required model prefixes to verify
    check_paths: list[str]      # required file/dir paths
    log_file: str | None        # JSONL log path; None = no log
    exit_code: int              # 0 = success after all checks

ToolInit.run() validates all checks sequentially. On any failure it writes a JSONL error entry and raises SystemExit with a non-zero code.

with_retry Decorator

def with_retry(
    max_attempts: int = 3,
    backoff_base: float = 1.0,
    transient_only: bool = True,
) -> Callable: ...

Wraps a function with exponential backoff retry logic. When transient_only=True (default), retries only on errors where is_retryable() returns True. Permanent errors such as PERMISSION_DENIED are re-raised immediately without consuming retry budget.


Benchmark Infrastructure

Located in tools/agents/benchmarks/. Provides a registry of 8 named benchmarks, an LLM-as-judge evaluator, preset run configurations, and both sync and async runners.

Supported Benchmarks

Benchmark ID Task Count Description
swe-bench 2,294 Full GitHub issue → patch set
swe-bench-verified 500 Manually verified subset
swe-bench-lite 300 Lightweight subset for fast iteration
humaneval 164 OpenAI function synthesis benchmark
humaneval-plus 164 Extended test cases for HumanEval
mbpp 974 Mostly Basic Python Problems
mbpp-sanitized 427 Cleaned subset of MBPP
codeclash 100 Competitive programming problems
custom-local variable User-provided local benchmark

BenchmarkTask

@dataclass
class BenchmarkTask:
    task_id: str
    benchmark_id: str
    prompt: str
    instruction: str
    repo: str | None
    base_commit: str | None
    golden_patch: str | None
    test_cases: list[str]
    pass_criteria: str

LLM-as-Judge Evaluator

The benchmark LLM evaluator (llm_evaluator.py) scores responses on a 0.0–10.0 scale with five weighted dimensions.

Dimension Weight
Completeness 0.25
Correctness 0.25
Quality 0.20
Specificity 0.15
Alignment 0.15

Grade thresholds: A (≥ 9.0), B (≥ 8.0), C (≥ 7.0), D (≥ 6.0), F (< 6.0).

Preset Configurations

Preset Benchmarks Included Use Case
quick-test HumanEval (first 20 tasks) Fast smoke test
swe-bench-eval SWE-bench-verified Standard SWE-bench run
local-dev custom-local Local iteration
full-eval All benchmarks Complete evaluation suite

Research Library Builder

tools.research.library_builder provides build_library(sources: list[str]) -> ResearchLibrary.

  • Scans the repository for research materials (markdown files, notebooks, annotated references)
  • Classifies source URLs against two domain lists:
  • Approved domains (35+): arxiv.org, acm.org, ieee.org, nature.com, openai.com, anthropic.com, huggingface.co, microsoft.com, and similar authoritative sources
  • Caution domains: unverified blogs, marketing sites
  • Consolidates materials into a ResearchLibrary artifact with coverage_score and source_quality_score fields
  • Research gating thresholds (used by evaluation pipelines): coverage_score >= 0.80, source_quality_score >= 0.80

Domain classification uses keyword heuristics and falls back to an LLM call for ambiguous URLs.


Stable Public API

The following symbols form the cross-package boundary consumed by agentic-v2-eval and agentic-workflows-v2. They are stable and must not be removed or changed incompatibly.

Symbol Module Description
LLMClient tools.llm.llm_client Static LLM facade
LLMClientError tools.llm.llm_client Base error class with code and model fields
LangChainAdapter tools.llm.langchain_adapter LangChain-compatible adapter wrapping LLMClient
ModelProbe tools.llm.model_probe Runtime model availability prober with persistent cache
ErrorCode tools.core.errors StrEnum of 10 error codes
classify_error tools.core.errors Map exception to ErrorCode
is_retryable tools.core.errors Check if error code is retryable
ResponseCache tools.core.cache Disk-backed SHA-256 response cache
default_config tools.core.config Return Config from environment variables
init_tool tools.core.tool_init Convenience wrapper for ToolInit.run()
with_retry tools.core.tool_init Exponential backoff retry decorator
BenchmarkRegistry tools.agents.benchmarks.registry Registry of all named benchmarks
BenchmarkConfig tools.agents.benchmarks.config Run configuration dataclass
load_benchmark tools.agents.benchmarks.loader Load tasks for a named benchmark
BENCHMARK_DEFINITIONS tools.agents.benchmarks.registry Dict of all registered benchmark metadata

Testing

Property Value
Test modules 10
Live LLM calls None — all providers are mocked
asyncio mode auto (pytest-asyncio)
cd tools
pip install -e ".[dev]"
python -m pytest tests/ -v
python -m pytest tests/ --cov=tools --cov-report=term-missing

Static analysis:

mypy --strict tools/
ruff check tools/

All test doubles satisfy the same structural protocols as the real implementations, ensuring mocks remain in sync with the actual interface.


Known Issues

Bare Module Import Paths

Affected files: tools/llm/llm_client.py, tools/core/cache.py

These files contain bare module imports (from cache import ..., from response_cache import ...) that function when the files are executed directly from their containing directory but silently disable caching when the package is installed conventionally. The Python import resolver cannot find cache as a top-level module; the try/except ImportError catches the failure, and caching is skipped without any warning to the caller.

Workaround: Always install the package with pip install -e . (editable mode) from the tools/ directory, or use the workspace uv install, which handles path resolution correctly.

Fix required: Change bare imports to relative imports (from .cache import ..., from .response_cache import ...).