Changelog¶

All notable changes to this project are documented here.

Unreleased ¶

Security (Sprint 1 — Tool Safety & Silent Failures)¶

S1-01 — Sanitization middleware fail-closed — server/middleware/__init__.py now returns HTTP 400 on sanitization errors instead of silently passing unvalidated input downstream. Any unhandled exception in the sanitization pipeline is treated as a block, not a pass-through.
S1-02 — File tools fail-closed without AGENTIC_FILE_BASE_DIR — All six file tools (FileReadTool, FileWriteTool, FileCopyTool, FileMoveTool, FileDeleteTool, DirectoryCreateTool) reject operations with a clear error when AGENTIC_FILE_BASE_DIR is unset, preventing unintentional host filesystem access. .env.example updated with required operator guidance.
S1-03 — run_id traversal test corpus — Added parametrized tests covering path traversal, null-byte injection, and unicode normalization bypass attempts against the existing run_id validator.
S1-04 — _validate_ast attribute-escape closed — The expression evaluator's AST sandbox now rejects attribute access (ast.Attribute nodes) that could escape to __class__.__mro__ chains or dunder traversal. Existing safe-expression corpus extended with adversarial fixtures.
S1-05 — ShellTool allowlist replaces substring blocklist — AGENTIC_SHELL_ALLOWED_COMMANDS env var (comma-separated command names) controls which executables ShellTool may run. When unset, all shell commands are disabled (fail-closed). Defeats double-space, absolute-path, uppercase, and fullwidth-unicode blocklist bypasses. .env.example entry added. _SHELL_METACHARS + _split_command preserved as a second safety layer.
S1-06 — CodeExecutionTool sandbox hardened — __import__ replaced in the wrapper's _safe_builtins with a constrained importer that enforces the same _DANGEROUS_IMPORTS blocklist at runtime, closing __import__('os') bypass while preserving import math etc. resource.setrlimit(RLIMIT_AS=512 MB, RLIMIT_NPROC=32) added via preexec_fn on POSIX to prevent memory-bomb and fork-bomb DoS. Sandbox escape test corpus added.
S1-06 follow-up — loader-traversal escape closed — "sys.modules" and "__loader__" added to _DANGEROUS_PATTERNS, blocking sys.modules['builtins'].__loader__.load_module('os') traversal on Windows. Regression tests added.
S1-07 — Sparse subprocess env across all tools — New agentic_v2/tools/subprocess_utils.py with minimal_subprocess_env() helper. GitTool and ShellTool (including ShellExecTool) now pass env=minimal_subprocess_env() to all subprocess calls, preventing API key leakage to child processes. Test corpus verifies OPENAI_API_KEY, ANTHROPIC_API_KEY, GITHUB_TOKEN, etc. are excluded.

Documentation¶

Architecture umbrella + roadmap + honesty docs (commit 41bd0d8) — new docs/ARCHITECTURE.md index linking the four existing architecture-*.md deep-dives (closes the broken docs/README.md reference); new docs/ROADMAP.md surfacing Epic 1/2/3/5/6 status, explicit Epic 4 tombstone, and proposed Epic 7 scope (closes the in-repo backlog gap flagged by the final review); new docs/KNOWN_LIMITATIONS.md documenting the 35 unresolved agentic-v2-eval mypy findings, the empty-window SLO trivial-pass, the schema-drift root-only blind spot, and the Python→TypeScript manual mirror; new docs/MIGRATIONS.md with first entries for the presentation/ extraction, the AgentProtocol.run signature tightening, and the langchain adapter deprecation.
Epic retrospectives + three new ADRs (commit 7b082fd) — retrospective plan docs for Epics 3 (DevEx), 5 (UI Polish), and 6 (Eval Depth), covering stories, commits, load-bearing decisions, and process notes; ADR-014 (Pydantic discriminated-union wire format for execution events), ADR-015 (SLO rolling window stored in git as JSON artifacts — acknowledges the empty-window trivial-pass as open debt), ADR-016 (GitHub Models via GITHUB_TOKEN as default E2E LLM provider — documents the zero-cost-vs-vendor-coupling trade-off); ADR-INDEX.md refreshed from 10 → 13 ADRs with updated lineage chains. Archive headers added to the Epic 1/2 prospective plan docs so unchecked boxes read as history rather than WIP.
Stale-artifact triage and count-drift fixes (commit 205b314) — deletes ~2,050 lines of orphaned content: MCP_IMPLEMENTATION_{COMPLETE,PLAN,STATUS}.md, LANGCHAIN_MIGRATION_PLAN.md (directly contradicted ADR-013), chatlg.md (1052-line raw conversation log at agentic-v2-eval/), playwright-tester-training-prompt.md, and several handoff.md stubs. Count drift corrected in ONBOARDING.md / GLOSSARY.md / CLAUDE.md (24 → 7 agent personas, 78+ → 100+ test files, 10 → 6 workflow definitions). Corrected a factual claim in docs/MIGRATIONS.md: presentation/ is not fully gone — the top-level directory retains leftover theme-collection scripts and raw-themes data pending a follow-up cleanup.
Post-v0.3.0 doc cleanup (commit e7c2a69) — drops superseded IMPLEMENTATION_SUMMARY.md, MASTER_MANIFEST.md, and docs/contribution-guide.md (all replaced by newer artifacts landed during the doc overhaul); relocates docs/eval-harness/* planning artifacts to planning-artifacts/eval-harness/ so the user-facing docs/ tree only contains user-facing content; adds a ⚠️ STALE banner to ACTIVE_VS_LEGACY_TOOLING_MAP.md linking to current sources of truth rather than rewriting it blind.

Cleanup & Refactoring¶

UI build artifacts untracked (commit 9522baf) — agentic-workflows-v2/ui/dist/index.html and agentic-workflows-v2/ui/tsconfig.tsbuildinfo are now in .gitignore; they were showing up dirty after every npm run build and polluting every PR diff. Fulfills the spawned task filed during Sprint A triage.

Stabilization (Sprint B)¶

SLO p95 empty-window trivial-pass fixed (commit c9c4f33, Sprint B #2) — readP95 in agentic-workflows-v2/ui/e2e/slo-storage.ts previously returned 0 on an empty rolling window, which silently satisfied the <= 2000ms assertion and produced a permanently green gate after any slo-data branch reset. Now throws a new InsufficientDataError when the window has fewer than DEFAULT_MIN_SAMPLES (= 10) records, and the Playwright spec converts that to test.skip(...) during bootstrap. Bootstrap semantics: deferred, not passed. From the 11th nightly onward the p95 budget is a hard gate.
All 35 agentic-v2-eval mypy findings cleared (commit ed78ee2, Sprint B #1) — dropped continue-on-error: true from .github/workflows/eval-package-ci.yml in the same commit. Workstream breakdown: types-PyYAML added (2 errors), _eval_one refactored to a discriminated union tuple[Literal[True], R] | tuple[Literal[False], Exception] (6 errors across runners/streaming.py), optional-import guards via _require_*_module() helpers in datasets.py (6 union-attr + 1 no-any-return), 10 missing annotations added, 10 no-any-return fixes via typed locals. 241 tests still pass. One new # type: ignore[return-value] at runners/streaming.py:199 documented with justification (inspect.isawaitable narrowing limitation).
Wire-format drift gate (commit ae3878c, Sprint B #3) — new automated Python ↔ TypeScript mirror for agentic_v2.contracts.events.ExecutionEvent. agentic_v2/contracts/events.py is the source of truth; scripts/generate_ts_types.py emits tests/schemas/events.schema.json; ui/scripts/generate-ts-types.mjs uses json-schema-to-typescript to emit ui/src/api/events.generated.ts. New wire-format-drift CI job regenerates and fails the PR on mismatch. The migration caught three latent client-type mismatches (status: StepStatus vs. the wire string, non-nullable input/output that are actually nullable, criteria shape drift on EvaluationCompleteEvent), now coerced at the useWorkflowStream.ts boundary. Client-only transport events (error, keepalive, connection_established) live in a hand-defined ChannelEvent union since they are not in the Python contract. Contributor docs updated.
Sprint B #5 — Placeholder / no-LLM mode. AGENTIC_NO_LLM=1 installs a deterministic placeholder at both engine chokepoints (get_client() → MockBackend; get_chat_model() → PlaceholderChatModel). Native and LangChain engines both run end-to-end without provider credentials. Not a simulator — structured parsers and evaluation runs still need real keys. (commit c2aff71)

0.3.0 — 2026-04-22¶

First tracked release of the agentic-workflows-v2 platform. Bundles Epic 1 (Platform Foundation), Epic 2 (Observable Execution), Epic 3 (DevEx / Windows), Epic 5 (Console UI Polish), and Epic 6 (Evaluation & Data Depth). See Known Limitations below for items shipped with caveats and Migration Notes for breaking changes vs. the prior unversioned state.

Note on Epic 4. Epic numbering jumps from 3 to 5. Epic 4 was never authored in this repository — the number was skipped during planning. Not a regression, just a tombstone for the record.

New Features¶

Epic 1 — Platform Foundation (agentic-workflows-v2)
Typed core protocols — AgentProtocol.run no longer accepts/returns Any; signature tightened to object so type checkers stop treating agent I/O as opaque. Companion ToolProtocol conformance test added.
Consolidated Settings — All environment variable reads routed through a single typed pydantic-settings class; scattered os.environ lookups removed so misconfigured deployments fail fast at startup instead of deep inside a run.
Adapter registry test isolation — Autouse fixture resets the AdapterRegistry singleton between tests, eliminating cross-test leakage and flakes when suites register alternate engine backends.
Schema-drift CI gate — New snapshot test on contracts/ Pydantic models fails the build on any unreviewed wire-format change; scripts/generate_schemas.py refreshes the canonical snapshot.
OTEL parent-child trace assertion — Regression test verifies the engine → agent span chain is preserved end-to-end so distributed traces remain connected in Jaeger / Tempo.
Golden-output regression test — Deterministic fixture locks the code_review workflow's final output; any behavioral drift in the native engine trips the test.
CI lint + coverage enforcement — Ruff runs as a required job and the 80% coverage floor is now enforced in GitHub Actions rather than advisory.
MCP results cleanup — Ruff UP006 / S324 fixes across mcp/results remove legacy typing forms and insecure hash defaults.
Epic 2 — Observable Execution (agentic-workflows-v2)
Typed execution-event wire format — New contracts/events.py Pydantic discriminated union covers workflow_start, step_start, step_end, step_complete, step_error, workflow_end, and evaluation_*. WebSocket and SSE broadcasts validate before emit; client union in ui/src/api/types.ts stays in lockstep.
Live DAG animation — @xyflow/react nodes and edges animate through queued → running → complete / error states as events arrive on /ws/execution/{run_id}, so users can watch a run progress without opening server logs.
StepNode B2 redesign — Each node now renders an ASCII status chip, LLM tier pill, token counter, and a streaming-progress bar driven by live events.
Step drill-down panel — Click a node to see a five-field detail pane (inputs, outputs, status, timing, errors) that handles partial state gracefully while a step is still running.
Playwright streaming PR gate — New E2E job runs the streaming flow 5× per PR using data-testid hooks added across the UI; any single failure blocks merge.
WebSocket reconnect-replay test — Fault-injection E2E kills the socket mid-run and asserts the server's 500-event replay buffer restores UI state on reconnect.
Time-to-first-span SLO + p95 gate — New measurement test records first-span latency and fails the build if the p95 regresses past the contract.
Nightly 50× reliability job — Streaming E2E runs 50× on a nightly cron with a rolling flake-rate gate; reconnect and p95 contracts hardened alongside.
Epic 3 — DevEx / Windows (agentic-workflows-v2)
Windows bring-up hardening — scripts/setup-dev.ps1 one-command bootstrap validated in CI so a fresh Windows clone installs deps, validates workflows, and runs a smoke test without manual intervention.
port-guard devex tool — Detects and reports processes holding dev ports (8010 / 5173 / 6006) before server startup so conflicts surface with an actionable message instead of a cryptic bind error.
workspace-test-runner tool — Single entry point to run the right test suite for whichever package you're in (backend pytest, eval pytest, UI vitest).
workflow-linter tool — Validates YAML workflow definitions against the required-fields contract (name, agent, description, depends_on, inputs, outputs) before they reach the runtime.
Windows Unicode CLI fix — agentic CLI no longer crashes on the Windows default codepage when rendering non-ASCII output; CLI verification added to CI to prevent regressions.
Epic 6 — Evaluation & Data Depth (agentic-workflows-v2)
Contracts — Additive Pydantic v2 models (EvaluationCriterionDetail, ScoreLayersModel, HardGatesModel, FloorViolationModel, RunEvaluationDetail, DatasetSampleSummary, …) mirrored as TypeScript interfaces in ui/src/api/types.ts; EvaluationCompleteEvent expanded with passed, pass_threshold, criteria.
Tokens-30d stat — RunsSummaryResponse gains tokens_30d; run_logger.summary() aggregates tokens across all runs started in the last 30 days; Dashboard live stat wired to real data.
GET /runs/{filename}/evaluation — Returns full rubric breakdown for any stored run, including criteria scores, score layers, hard gates, and floor violations.
Dataset sample endpoints — GET /eval/datasets/sample-list?dataset_id=…&limit=… and GET /eval/datasets/sample-detail?dataset_id=…&sample_index=…; dataset IDs use query params to handle slash characters.
Evaluations page — EvaluationRubricAccordion with lazy [+]/[-] expansion: criterion table (normalized %, ASCII progress bar, [FLOOR] badge), score layers, hard gate [OK]/[FAIL] rows, floor violation list.
Datasets page — 3-pane browser: dataset catalog → SampleIndexGrid (paginated [<]/[>]) → DatasetDetailPane (collapsible [meta +/-], field rendering, JSON viewer, workflow preview badge).
Epic 5 — Console UI Polish (agentic-workflows-v2/ui)
StatusBadge migrated to ASCII bracket format: [OK ] [RUN] [ERR] [WARN] using --b-* CSS tokens; works across dark / paper / bolt themes.
useHotkeys hook — global keyboard shortcuts (n / f / / / j / k / Esc) with input-focus guard and unmount cleanup.
Dashboard filter — / and f focus the filter input; Esc clears and blurs; narrows runs by workflow name or run ID.
State pages — EmptyState ($ no <entity> yet), ErrorBanner ([!] {msg}), NotFoundPage (404 terminal-style), AppErrorBoundary (React error boundary).
Skip-to-main link — visually hidden, appears on first Tab; <main id="main-content"> as target.
Focus ring audit — focus:ring-1 focus:ring-b-clay/50 added to all interactive elements; audit notes at docs/a11y-focus-ring-audit.md.
Paper theme contrast QA — --b-text-dim on --b-bg1 verified at 7.45:1 (passes AA); bolt 5.92:1; dark 3.80:1 (dim tier, intentional).
BDagMini — pure SVG static DAG thumbnail; reuses layoutDAG (Kahn topological sort); linear chains render as vertical, parallel branches center-aligned per rank; themed via CSS vars.
skill-architect agent — New AI persona specialized in designing, extracting, and refactoring skills as reusable prompt programs. Added to the canonical agent roster with full documentation.
verify-and-correct skill — Bounded self-correction loop: automatically runs tests, lint, and type checks after code changes, then retries fixes (up to a limit) before reporting back. Reduces back-and-forth on build failures.
session-plan skill — Plan a focused session with 1–2 goals, explicit success criteria, and a TODO checklist. Prevents mega-sessions that hit rate limits by scoping work upfront.
Headless prompts — New infrastructure for running agent prompts without an interactive session.
Adapter registry CLI wiring (ADR-001) — Orchestrator, workflow runner, and CLI now route through the adapter registry, making engine backends (native DAG, LangGraph) fully swappable at runtime.
SmartModelRouter hardening (ADR-002) — Added persistence across restarts, Retry-After header support, and degraded-mode fallback so the router stays operational when providers are rate-limited or unavailable.

Improvements¶

CI pipeline upgrades — Updated actions/setup-python from v4 → v5 across all workflows. Added cross-package E2E job, expanded security scanning with pip-audit, and tightened dependency review to fail on severity moderate.
Release pipeline — New tag-triggered deploy.yml with build provenance for reproducible releases.
Performance regression detection — Rewrote performance-benchmark.yml to compare against a stored baseline and fail on detected regressions.
Secret sanitization rule — Always-on sanitization middleware now redacts API keys, tokens, and passwords before they reach LLM context. Covers all agent sessions by default.
Backlog tracking — Added 11 new backlog tickets (rows 34–44) covering architectural debt, test gaps, and documentation tasks.

Security¶

P0 tool-layer hardening — http_ops, search_ops, and shell_ops received critical security fixes: SSRF URL validation, shell injection blocklist (20+ blocked patterns), shlex.split with shell=False, path traversal guards, and __builtins__ restriction in the code execution sandbox.

Bug Fixes & Documentation¶

Fixed factual drift across monorepo documentation (CLAUDE.md, AGENTS.md, README files, ADRs).
Corrected 40+ dangling references to deleted prompts and removed workflows.
Removed 23 duplicate tests, fixed 2 broken tests as part of ADR-008 test coverage overhaul.
Aligned stale agent config entries with the current agent roster.

Test Coverage (ADR-008)¶

Added tests for 12+ previously untested modules, including evaluation pipeline, workflow pipeline, server backends, LLM judge, and scoring.
Hardened server test coverage for streaming backends and session handling.
Overall test coverage for agentic_v2 raised toward the 80% target.

Cleanup & Refactoring¶

Presentation system extracted — The presentation/deck builder has been moved to its own standalone repository (present). 245 files removed from this repo; raw-themes data and scripts are preserved in the new repo.
Removed deprecated agent prompts, workflow definitions, Copilot config, and stale GitHub Actions workflows.
Removed dead prompt constants, obsolete artifact files, and cleaned up .gitignore.

Migration Notes¶

presentation/ system extracted to a standalone repo (present). If you imported anything from presentation/* or ran its deck-builder / token-generation scripts, those paths no longer exist here. 245 files moved.
What moved: brutalist deck builder, JSX + PPTX slide generators, theme registry, raw-theme token configs, Storybook catalog, and the accompanying test suite.
What stayed: decks-generated/ (the output artifacts from the old builder) remains in this repo until downstream consumers migrate.
Action required: Update imports of presentation.* or src/tokens/* to point at the new present repo, OR pin a pre-2026-04 commit of this repo if you are not ready to migrate.
agentic_v2.core.protocols.AgentProtocol.run signature tightened from Any to object. Code that relied on implicit-Any call sites may now surface previously hidden mypy errors; update call sites to use the bounded TypeVars (TInput / TOutput) from agentic_v2.agents.base.
langchain adapter is deprecated in favor of the native DAG engine (see ADR-013). Importing agentic_v2.langchain.* still works but emits a DeprecationWarning. Plan to migrate off before v0.5.0.

Known Limitations¶

This release ships with the following known issues. All are tracked for Sprint B or later; none block the primary workflows documented in docs/ONBOARDING.md.

35 mypy findings in agentic-v2-eval/ are now visible in CI (continue-on-error: true, not blocking merges). One of them (agentic_v2_eval/runners/streaming.py:216-235) raises an object that is not derived from BaseException and is a real bug worth fixing in Sprint B. Tracked as "Sprint B #7-followup" in eval-package-ci.yml.
SLO p95 gate can pass trivially on empty data. readP95({ windowDays: 7 }) in ui/e2e/slo-storage.ts returns 0 when the rolling window has no records, which silently satisfies the <= 2000ms assertion. Mitigated by the nightly workflow appending one sample per run, but the first night after an slo-data branch reset is a free pass.
Schema-drift guard is root-properties-only. tests/test_schema_drift.py checks the top-level properties of covered Pydantic models; field removals on nested models referenced via $defs, as well as any type narrowing (e.g., str -> Literal[...]), pass silently. Tracked for extension in Sprint B.
No true placeholder / no-LLM mode. Both local runs and CI require at least one configured LLM provider. In CI we use GITHUB_TOKEN + models: read to reach GitHub Models (zero-cost on public repos). The Epic 2 plan referenced a placeholder toggle, but the toggle was not implemented. See ADR-016 (planned) for the rationale.
Python -> TypeScript wire-format mirror is manual. agentic_v2/contracts/events.py is hand-mirrored in ui/src/api/types.ts; there is no codegen or drift-detection gate. An intentional schema change must be made in both files; a missed edit ships as a silent frontend bug.
UI build artifacts are git-tracked. agentic-workflows-v2/ui/dist/index.html and agentic-workflows-v2/ui/tsconfig.tsbuildinfo are historical tracked files that re-dirty after every npm run build. A follow-up task is filed; expect git status noise after building the UI locally.

Earlier History¶

For changes prior to March 2026, see the git log:

git log --oneline --before="2026-03-01"

Changelog¶

Unreleased¶

Security (Sprint 1 — Tool Safety & Silent Failures)¶

Documentation¶

Cleanup & Refactoring¶

Stabilization (Sprint B)¶

0.3.0 — 2026-04-22¶

New Features¶

Improvements¶

Security¶

Bug Fixes & Documentation¶

Test Coverage (ADR-008)¶

Cleanup & Refactoring¶

Migration Notes¶

Known Limitations¶

Earlier History¶

Version Links¶

Unreleased ¶