Architecture: agentic-tools¶
Executive Summary¶
agentic-tools (v0.1.0) is the shared utility layer for the monorepo. It provides a multi-provider LLM client abstraction (10 providers), a disk-backed response cache, model probing and discovery, benchmark infrastructure, a structured error taxonomy, a research library builder, and script bootstrap helpers. It is consumed by agentic-v2-eval and by the agentic-workflows-v2 runtime.
The design follows a static facade + provider adapters + two-level cache pattern. Callers never instantiate a client object; they call LLMClient.generate_text(...) as a static method. The method dispatches to one of ten provider backends based on a name prefix in the model string. A 24-hour, SHA-256-keyed disk cache sits in front of all provider calls when enabled.
The package is the uv workspace root. Members agentic-workflows-v2 and agentic-v2-eval declare it as a workspace dependency.
Technology Stack¶
| Component | Technology | Notes |
|---|---|---|
| Language | Python | 3.11+ |
| Build backend | hatchling | pyproject.toml as single config source |
| Workspace manager | uv | Workspace root; members: agentic-workflows-v2, agentic-v2-eval |
| Config / rubric parsing | PyYAML | — |
| HTTP (async) | aiohttp | Async provider backends |
| Data validation | pydantic v2 | model_dump() / model_validate() — not .dict() / .parse_obj() |
| OpenAI SDK | openai | OpenAI and Azure OpenAI providers |
| Anthropic SDK | anthropic | Claude provider |
| Numeric | numpy | Upper bound < 3 for semver safety |
| Optional: Google | google-generativeai | Gemini provider |
| Optional: ONNX | onnxruntime-genai | Local ONNX model inference |
| Optional: Windows AI | winrt-runtime | Phi Silica NPU (Copilot+ PC) |
Package Structure¶
tools/
├── llm/ # 20 modules — LLM client and provider layer
│ ├── llm_client.py # LLMClient static facade
│ ├── provider_adapters.py # Dispatch table; all provider adapter classes
│ ├── probe_providers.py # Model availability probing (all providers)
│ ├── probe_providers_cloud.py # Cloud-specific probe logic
│ ├── probe_providers_local.py # Local model probe logic
│ ├── probe_discovery.py # Cross-provider discovery orchestration
│ ├── probe_discovery_providers.py # Per-provider discovery helpers
│ ├── probe_config.py # Probe configuration dataclasses
│ ├── model_probe.py # ModelProbe class with persistent JSON cache
│ ├── model_inventory.py # Model inventory management
│ ├── model_locks.py # Concurrency locks for model loading
│ ├── local_model.py # ONNX model wrapper
│ ├── local_models.py # Local model registry
│ ├── local_model_cli.py # CLI for local model operations
│ ├── local_model_discovery.py # Auto-detect models from ~/.cache/aigallery
│ ├── model_bakeoff.py # Multi-model comparison runner
│ ├── bakeoff_tasks.py # Bakeoff task definitions
│ ├── bakeoff_reporting.py # Bakeoff result formatting
│ ├── rank_models.py # Score-based model ranking
│ ├── langchain_adapter.py # LangChain adapter bridge
│ └── windows_ai.py # Windows AI (Phi Silica NPU) integration
├── core/ # 9 modules — shared utilities
│ ├── config.py # ModelConfig, PathConfig, Config dataclasses
│ ├── errors.py # ErrorCode StrEnum, LLMClientError, classify_error
│ ├── cache.py # ResponseCache (SHA-256, 24h TTL, disk-backed)
│ ├── response_cache.py # Low-level cache read/write primitives
│ ├── _encoding.py # Internal encoding helpers
│ ├── tool_init.py # ToolInit bootstrap dataclass + with_retry decorator
│ ├── prompt_db.py # Prompt template database
│ ├── local_media.py # Local media file handling
│ └── model_availability.py # Runtime model availability checks
├── agents/benchmarks/ # 11 modules — benchmark infrastructure
│ ├── registry.py # BenchmarkRegistry + BENCHMARK_DEFINITIONS
│ ├── config.py # BenchmarkConfig dataclass + preset configs
│ ├── datasets.py # Dataset loaders for each benchmark
│ ├── runner.py # BenchmarkRunner (sync)
│ ├── async_runner.py # AsyncBenchmarkRunner
│ ├── llm_evaluator.py # LLM-as-judge evaluator (5 weighted dimensions)
│ ├── scoring.py # Score aggregation and grade assignment
│ ├── reporting.py # Result formatters
│ ├── task.py # BenchmarkTask dataclass
│ ├── loader.py # load_benchmark() entry point
│ └── __init__.py # Public API exports
├── research/ # 2 modules — research library builder
│ ├── library_builder.py # build_library() — scan, classify, consolidate
│ └── helpers.py # URL classification, domain lists
└── tests/ # 10 test modules — all mocked, no live LLM calls
LLM Client¶
Interface¶
class LLMClient:
@staticmethod
def generate_text(
model_name: str,
prompt: str,
system_instruction: str | None = None,
temperature: float = 0.7,
max_tokens: int = 4096,
) -> str: ...
LLMClient is a pure static-method facade. There is no instance state. Thread safety is delegated to the cache layer, which uses a threading.Lock on all read/write operations.
Call Flow¶
Caller
└─ LLMClient.generate_text(model_name, prompt, ...) ← static facade
├─ ResponseCache.get(sha256_key) ← cache layer
│ └─ hit → return cached response
└─ miss → provider_adapters.dispatch(prefix, ...)
├─ local:* → LocalONNXAdapter
├─ ollama:* → OllamaRESTAdapter
├─ windows-ai:* → WindowsAIAdapter (Phi Silica NPU)
├─ azure-foundry:* → AzureFoundryAdapter
├─ azure-openai:* → AzureOpenAIAdapter (slot failover)
├─ gh:* → GitHubModelsAdapter (gh CLI subprocess)
├─ openai:* → OpenAIAdapter
├─ gemini:* → GeminiAdapter
└─ claude:* → AnthropicAdapter
Provider Routing¶
The model_name string prefix selects the backend. Unknown prefixes raise LLMClientError with ErrorCode.UNAVAILABLE_MODEL.
| Prefix | Provider | Backend |
|---|---|---|
local:* |
Local ONNX | onnxruntime-genai; auto-detects from ~/.cache/aigallery or LOCAL_MODEL_PATH |
ollama:* |
Ollama REST | urllib.request to localhost:11434 (default port) |
windows-ai:* |
Windows Copilot Runtime (Phi Silica NPU) | winrt-runtime; requires Windows 11 Copilot+ PC hardware |
azure-foundry:* |
Azure AI Foundry | urllib.request; env: AZURE_FOUNDRY_ENDPOINT, AZURE_FOUNDRY_KEY |
azure-openai:* |
Azure OpenAI Service | openai.AzureOpenAI; numbered slot failover (AZURE_OPENAI_API_KEY_0 through _n) |
gh:* |
GitHub Models | gh CLI subprocess; env: GITHUB_TOKEN; maps 18+ friendly names to GitHub model IDs |
openai:* |
OpenAI | openai.OpenAI SDK; env: OPENAI_API_KEY |
gemini:* |
Google Gemini | google.generativeai SDK; env: GEMINI_API_KEY |
claude:* |
Anthropic Claude | anthropic.Anthropic SDK; env: ANTHROPIC_API_KEY |
Remote Provider Gate¶
Remote providers (OpenAI, Anthropic, Gemini, Azure variants) are disabled by default to prevent accidental spend during local development and CI runs.
| Environment Variable | Default | Effect |
|---|---|---|
PROMPTEVAL_ALLOW_REMOTE |
unset / 0 |
Remote providers raise LLMClientError(PERMISSION_DENIED) |
PROMPTEVAL_ALLOW_REMOTE=1 |
— | All ten providers are active |
Local providers (local:*, ollama:*, windows-ai:*, gh:*) are always allowed.
Response Cache¶
| Property | Value |
|---|---|
| Key algorithm | SHA-256 of (model_name, prompt, system_instruction, temperature, max_tokens) |
| TTL | 24 hours |
| Storage location | ~/.cache/prompts-eval/responses/ |
| Max size | 500 MB (LRU eviction when exceeded) |
| Thread safety | threading.Lock on all read/write operations |
| Enable | PROMPTS_CACHE_ENABLED=1 |
| Disable | Unset or PROMPTS_CACHE_ENABLED=0 |
Disk layout:
~/.cache/prompts-eval/responses/
└── <sha256_hex>.json # one file per unique (model, prompt, params) combination
Each cache entry is a JSON object containing response, model, timestamp, and ttl_seconds fields.
Disable the cache when running evaluations that intentionally test stochastic model behavior (for example, the PatternEvaluator 20-run median computation), or when exercising cache-miss code paths in tests.
Model Probing (ModelProbe)¶
ModelProbe in tools/llm/model_probe.py provides runtime model availability probing with a persistent JSON backing cache.
probe(model_name)— check whether a single model responds to a minimal promptdiscover_all_models()— runs all provider probes in parallel and returns a consolidated list of available models- Results are written to a JSON file under
~/.cache/prompts-eval/model_probe/and reused across sessions until invalidated
Provider-specific probe logic lives in probe_providers_cloud.py and probe_providers_local.py, coordinated by probe_discovery.py.
Error Taxonomy¶
tools.core.errors defines a StrEnum with ten values and helper functions.
| Code | Meaning | Retryable |
|---|---|---|
SUCCESS |
No error | — |
UNAVAILABLE_MODEL |
Model not found or not loaded | No |
PERMISSION_DENIED |
Remote provider blocked by env gate | No |
RATE_LIMITED |
Provider rate limit exceeded | Yes |
TIMEOUT |
Request exceeded timeout | Yes |
PARSE_ERROR |
Could not parse LLM response | No |
FILE_NOT_FOUND |
Required file path does not exist | No |
INVALID_INPUT |
Input failed validation | No |
NETWORK_ERROR |
Network-level failure | Yes |
INTERNAL_ERROR |
Unexpected error in client code | No |
class LLMClientError(RuntimeError):
code: ErrorCode
model: str # model_name that was requested
original_error: Exception | None
Helper functions:
classify_error(exc: Exception) -> ErrorCode— inspects exception type and message to return the appropriate codeis_retryable(code: ErrorCode) -> bool— returnsTrueforRATE_LIMITED,TIMEOUT,NETWORK_ERROR,PARSE_ERRORis_permanent(code: ErrorCode) -> bool— convenience inverse ofis_retryable
Transient error codes (retryable): RATE_LIMITED, TIMEOUT, NETWORK_ERROR, PARSE_ERROR.
Config (tools.core.config)¶
@dataclass
class ModelConfig:
gen_model: str # from GEN_MODEL env var
rev_model: str # from REV_MODEL env var
ref_model: str # from REF_MODEL env var
@dataclass
class PathConfig:
cache_dir: Path
model_dir: Path
output_dir: Path
@dataclass
class Config:
models: ModelConfig
paths: PathConfig
default_config() reads GEN_MODEL, REV_MODEL, REF_MODEL environment variables and returns a Config instance. Defaults are applied for unset variables so the package is usable out of the box for local-only providers.
ToolInit and with_retry¶
ToolInit is a dataclass for script bootstrap. It performs startup environment and path checks and writes structured JSONL log entries.
@dataclass
class ToolInit:
check_env: list[str] # required env var names
check_models: list[str] # required model prefixes to verify
check_paths: list[str] # required file/dir paths
log_file: str | None # JSONL log path; None = no log
exit_code: int # 0 = success after all checks
ToolInit.run() validates all checks sequentially. On any failure it writes a JSONL error entry and raises SystemExit with a non-zero code.
with_retry Decorator¶
def with_retry(
max_attempts: int = 3,
backoff_base: float = 1.0,
transient_only: bool = True,
) -> Callable: ...
Wraps a function with exponential backoff retry logic. When transient_only=True (default), retries only on errors where is_retryable() returns True. Permanent errors such as PERMISSION_DENIED are re-raised immediately without consuming retry budget.
Benchmark Infrastructure¶
Located in tools/agents/benchmarks/. Provides a registry of 8 named benchmarks, an LLM-as-judge evaluator, preset run configurations, and both sync and async runners.
Supported Benchmarks¶
| Benchmark ID | Task Count | Description |
|---|---|---|
swe-bench |
2,294 | Full GitHub issue → patch set |
swe-bench-verified |
500 | Manually verified subset |
swe-bench-lite |
300 | Lightweight subset for fast iteration |
humaneval |
164 | OpenAI function synthesis benchmark |
humaneval-plus |
164 | Extended test cases for HumanEval |
mbpp |
974 | Mostly Basic Python Problems |
mbpp-sanitized |
427 | Cleaned subset of MBPP |
codeclash |
100 | Competitive programming problems |
custom-local |
variable | User-provided local benchmark |
BenchmarkTask¶
@dataclass
class BenchmarkTask:
task_id: str
benchmark_id: str
prompt: str
instruction: str
repo: str | None
base_commit: str | None
golden_patch: str | None
test_cases: list[str]
pass_criteria: str
LLM-as-Judge Evaluator¶
The benchmark LLM evaluator (llm_evaluator.py) scores responses on a 0.0–10.0 scale with five weighted dimensions.
| Dimension | Weight |
|---|---|
| Completeness | 0.25 |
| Correctness | 0.25 |
| Quality | 0.20 |
| Specificity | 0.15 |
| Alignment | 0.15 |
Grade thresholds: A (≥ 9.0), B (≥ 8.0), C (≥ 7.0), D (≥ 6.0), F (< 6.0).
Preset Configurations¶
| Preset | Benchmarks Included | Use Case |
|---|---|---|
quick-test |
HumanEval (first 20 tasks) | Fast smoke test |
swe-bench-eval |
SWE-bench-verified | Standard SWE-bench run |
local-dev |
custom-local | Local iteration |
full-eval |
All benchmarks | Complete evaluation suite |
Research Library Builder¶
tools.research.library_builder provides build_library(sources: list[str]) -> ResearchLibrary.
- Scans the repository for research materials (markdown files, notebooks, annotated references)
- Classifies source URLs against two domain lists:
- Approved domains (35+): arxiv.org, acm.org, ieee.org, nature.com, openai.com, anthropic.com, huggingface.co, microsoft.com, and similar authoritative sources
- Caution domains: unverified blogs, marketing sites
- Consolidates materials into a
ResearchLibraryartifact withcoverage_scoreandsource_quality_scorefields - Research gating thresholds (used by evaluation pipelines):
coverage_score >= 0.80,source_quality_score >= 0.80
Domain classification uses keyword heuristics and falls back to an LLM call for ambiguous URLs.
Stable Public API¶
The following symbols form the cross-package boundary consumed by agentic-v2-eval and agentic-workflows-v2. They are stable and must not be removed or changed incompatibly.
| Symbol | Module | Description |
|---|---|---|
LLMClient |
tools.llm.llm_client |
Static LLM facade |
LLMClientError |
tools.llm.llm_client |
Base error class with code and model fields |
LangChainAdapter |
tools.llm.langchain_adapter |
LangChain-compatible adapter wrapping LLMClient |
ModelProbe |
tools.llm.model_probe |
Runtime model availability prober with persistent cache |
ErrorCode |
tools.core.errors |
StrEnum of 10 error codes |
classify_error |
tools.core.errors |
Map exception to ErrorCode |
is_retryable |
tools.core.errors |
Check if error code is retryable |
ResponseCache |
tools.core.cache |
Disk-backed SHA-256 response cache |
default_config |
tools.core.config |
Return Config from environment variables |
init_tool |
tools.core.tool_init |
Convenience wrapper for ToolInit.run() |
with_retry |
tools.core.tool_init |
Exponential backoff retry decorator |
BenchmarkRegistry |
tools.agents.benchmarks.registry |
Registry of all named benchmarks |
BenchmarkConfig |
tools.agents.benchmarks.config |
Run configuration dataclass |
load_benchmark |
tools.agents.benchmarks.loader |
Load tasks for a named benchmark |
BENCHMARK_DEFINITIONS |
tools.agents.benchmarks.registry |
Dict of all registered benchmark metadata |
Testing¶
| Property | Value |
|---|---|
| Test modules | 10 |
| Live LLM calls | None — all providers are mocked |
| asyncio mode | auto (pytest-asyncio) |
cd tools
pip install -e ".[dev]"
python -m pytest tests/ -v
python -m pytest tests/ --cov=tools --cov-report=term-missing
Static analysis:
All test doubles satisfy the same structural protocols as the real implementations, ensuring mocks remain in sync with the actual interface.
Known Issues¶
Bare Module Import Paths¶
Affected files: tools/llm/llm_client.py, tools/core/cache.py
These files contain bare module imports (from cache import ..., from response_cache import ...) that function when the files are executed directly from their containing directory but silently disable caching when the package is installed conventionally. The Python import resolver cannot find cache as a top-level module; the try/except ImportError catches the failure, and caching is skipped without any warning to the caller.
Workaround: Always install the package with pip install -e . (editable mode) from the tools/ directory, or use the workspace uv install, which handles path resolution correctly.
Fix required: Change bare imports to relative imports (from .cache import ..., from .response_cache import ...).