Skip to content

Changelog

All notable changes to this project are documented here.


Unreleased

Security (Sprint 1 — Tool Safety & Silent Failures)

  • S1-01 — Sanitization middleware fail-closedserver/middleware/__init__.py now returns HTTP 400 on sanitization errors instead of silently passing unvalidated input downstream. Any unhandled exception in the sanitization pipeline is treated as a block, not a pass-through.
  • S1-02 — File tools fail-closed without AGENTIC_FILE_BASE_DIR — All six file tools (FileReadTool, FileWriteTool, FileCopyTool, FileMoveTool, FileDeleteTool, DirectoryCreateTool) reject operations with a clear error when AGENTIC_FILE_BASE_DIR is unset, preventing unintentional host filesystem access. .env.example updated with required operator guidance.
  • S1-03 — run_id traversal test corpus — Added parametrized tests covering path traversal, null-byte injection, and unicode normalization bypass attempts against the existing run_id validator.
  • S1-04 — _validate_ast attribute-escape closed — The expression evaluator's AST sandbox now rejects attribute access (ast.Attribute nodes) that could escape to __class__.__mro__ chains or dunder traversal. Existing safe-expression corpus extended with adversarial fixtures.
  • S1-05 — ShellTool allowlist replaces substring blocklistAGENTIC_SHELL_ALLOWED_COMMANDS env var (comma-separated command names) controls which executables ShellTool may run. When unset, all shell commands are disabled (fail-closed). Defeats double-space, absolute-path, uppercase, and fullwidth-unicode blocklist bypasses. .env.example entry added. _SHELL_METACHARS + _split_command preserved as a second safety layer.
  • S1-06 — CodeExecutionTool sandbox hardened__import__ replaced in the wrapper's _safe_builtins with a constrained importer that enforces the same _DANGEROUS_IMPORTS blocklist at runtime, closing __import__('os') bypass while preserving import math etc. resource.setrlimit(RLIMIT_AS=512 MB, RLIMIT_NPROC=32) added via preexec_fn on POSIX to prevent memory-bomb and fork-bomb DoS. Sandbox escape test corpus added.
  • S1-06 follow-up — loader-traversal escape closed"sys.modules" and "__loader__" added to _DANGEROUS_PATTERNS, blocking sys.modules['builtins'].__loader__.load_module('os') traversal on Windows. Regression tests added.
  • S1-07 — Sparse subprocess env across all tools — New agentic_v2/tools/subprocess_utils.py with minimal_subprocess_env() helper. GitTool and ShellTool (including ShellExecTool) now pass env=minimal_subprocess_env() to all subprocess calls, preventing API key leakage to child processes. Test corpus verifies OPENAI_API_KEY, ANTHROPIC_API_KEY, GITHUB_TOKEN, etc. are excluded.

Documentation

  • Architecture umbrella + roadmap + honesty docs (commit 41bd0d8) — new docs/ARCHITECTURE.md index linking the four existing architecture-*.md deep-dives (closes the broken docs/README.md reference); new docs/ROADMAP.md surfacing Epic 1/2/3/5/6 status, explicit Epic 4 tombstone, and proposed Epic 7 scope (closes the in-repo backlog gap flagged by the final review); new docs/KNOWN_LIMITATIONS.md documenting the 35 unresolved agentic-v2-eval mypy findings, the empty-window SLO trivial-pass, the schema-drift root-only blind spot, and the Python→TypeScript manual mirror; new docs/MIGRATIONS.md with first entries for the presentation/ extraction, the AgentProtocol.run signature tightening, and the langchain adapter deprecation.
  • Epic retrospectives + three new ADRs (commit 7b082fd) — retrospective plan docs for Epics 3 (DevEx), 5 (UI Polish), and 6 (Eval Depth), covering stories, commits, load-bearing decisions, and process notes; ADR-014 (Pydantic discriminated-union wire format for execution events), ADR-015 (SLO rolling window stored in git as JSON artifacts — acknowledges the empty-window trivial-pass as open debt), ADR-016 (GitHub Models via GITHUB_TOKEN as default E2E LLM provider — documents the zero-cost-vs-vendor-coupling trade-off); ADR-INDEX.md refreshed from 10 → 13 ADRs with updated lineage chains. Archive headers added to the Epic 1/2 prospective plan docs so unchecked boxes read as history rather than WIP.
  • Stale-artifact triage and count-drift fixes (commit 205b314) — deletes ~2,050 lines of orphaned content: MCP_IMPLEMENTATION_{COMPLETE,PLAN,STATUS}.md, LANGCHAIN_MIGRATION_PLAN.md (directly contradicted ADR-013), chatlg.md (1052-line raw conversation log at agentic-v2-eval/), playwright-tester-training-prompt.md, and several handoff.md stubs. Count drift corrected in ONBOARDING.md / GLOSSARY.md / CLAUDE.md (24 → 7 agent personas, 78+ → 100+ test files, 10 → 6 workflow definitions). Corrected a factual claim in docs/MIGRATIONS.md: presentation/ is not fully gone — the top-level directory retains leftover theme-collection scripts and raw-themes data pending a follow-up cleanup.
  • Post-v0.3.0 doc cleanup (commit e7c2a69) — drops superseded IMPLEMENTATION_SUMMARY.md, MASTER_MANIFEST.md, and docs/contribution-guide.md (all replaced by newer artifacts landed during the doc overhaul); relocates docs/eval-harness/* planning artifacts to planning-artifacts/eval-harness/ so the user-facing docs/ tree only contains user-facing content; adds a ⚠️ STALE banner to ACTIVE_VS_LEGACY_TOOLING_MAP.md linking to current sources of truth rather than rewriting it blind.

Cleanup & Refactoring

  • UI build artifacts untracked (commit 9522baf) — agentic-workflows-v2/ui/dist/index.html and agentic-workflows-v2/ui/tsconfig.tsbuildinfo are now in .gitignore; they were showing up dirty after every npm run build and polluting every PR diff. Fulfills the spawned task filed during Sprint A triage.

Stabilization (Sprint B)

  • SLO p95 empty-window trivial-pass fixed (commit c9c4f33, Sprint B #2) — readP95 in agentic-workflows-v2/ui/e2e/slo-storage.ts previously returned 0 on an empty rolling window, which silently satisfied the <= 2000ms assertion and produced a permanently green gate after any slo-data branch reset. Now throws a new InsufficientDataError when the window has fewer than DEFAULT_MIN_SAMPLES (= 10) records, and the Playwright spec converts that to test.skip(...) during bootstrap. Bootstrap semantics: deferred, not passed. From the 11th nightly onward the p95 budget is a hard gate.
  • All 35 agentic-v2-eval mypy findings cleared (commit ed78ee2, Sprint B #1) — dropped continue-on-error: true from .github/workflows/eval-package-ci.yml in the same commit. Workstream breakdown: types-PyYAML added (2 errors), _eval_one refactored to a discriminated union tuple[Literal[True], R] | tuple[Literal[False], Exception] (6 errors across runners/streaming.py), optional-import guards via _require_*_module() helpers in datasets.py (6 union-attr + 1 no-any-return), 10 missing annotations added, 10 no-any-return fixes via typed locals. 241 tests still pass. One new # type: ignore[return-value] at runners/streaming.py:199 documented with justification (inspect.isawaitable narrowing limitation).
  • Wire-format drift gate (commit ae3878c, Sprint B #3) — new automated Python ↔ TypeScript mirror for agentic_v2.contracts.events.ExecutionEvent. agentic_v2/contracts/events.py is the source of truth; scripts/generate_ts_types.py emits tests/schemas/events.schema.json; ui/scripts/generate-ts-types.mjs uses json-schema-to-typescript to emit ui/src/api/events.generated.ts. New wire-format-drift CI job regenerates and fails the PR on mismatch. The migration caught three latent client-type mismatches (status: StepStatus vs. the wire string, non-nullable input/output that are actually nullable, criteria shape drift on EvaluationCompleteEvent), now coerced at the useWorkflowStream.ts boundary. Client-only transport events (error, keepalive, connection_established) live in a hand-defined ChannelEvent union since they are not in the Python contract. Contributor docs updated.
  • Sprint B #5 — Placeholder / no-LLM mode. AGENTIC_NO_LLM=1 installs a deterministic placeholder at both engine chokepoints (get_client()MockBackend; get_chat_model()PlaceholderChatModel). Native and LangChain engines both run end-to-end without provider credentials. Not a simulator — structured parsers and evaluation runs still need real keys. (commit c2aff71)

0.3.0 — 2026-04-22

First tracked release of the agentic-workflows-v2 platform. Bundles Epic 1 (Platform Foundation), Epic 2 (Observable Execution), Epic 3 (DevEx / Windows), Epic 5 (Console UI Polish), and Epic 6 (Evaluation & Data Depth). See Known Limitations below for items shipped with caveats and Migration Notes for breaking changes vs. the prior unversioned state.

Note on Epic 4. Epic numbering jumps from 3 to 5. Epic 4 was never authored in this repository — the number was skipped during planning. Not a regression, just a tombstone for the record.

New Features

  • Epic 1 — Platform Foundation (agentic-workflows-v2)
  • Typed core protocolsAgentProtocol.run no longer accepts/returns Any; signature tightened to object so type checkers stop treating agent I/O as opaque. Companion ToolProtocol conformance test added.
  • Consolidated Settings — All environment variable reads routed through a single typed pydantic-settings class; scattered os.environ lookups removed so misconfigured deployments fail fast at startup instead of deep inside a run.
  • Adapter registry test isolation — Autouse fixture resets the AdapterRegistry singleton between tests, eliminating cross-test leakage and flakes when suites register alternate engine backends.
  • Schema-drift CI gate — New snapshot test on contracts/ Pydantic models fails the build on any unreviewed wire-format change; scripts/generate_schemas.py refreshes the canonical snapshot.
  • OTEL parent-child trace assertion — Regression test verifies the engine → agent span chain is preserved end-to-end so distributed traces remain connected in Jaeger / Tempo.
  • Golden-output regression test — Deterministic fixture locks the code_review workflow's final output; any behavioral drift in the native engine trips the test.
  • CI lint + coverage enforcement — Ruff runs as a required job and the 80% coverage floor is now enforced in GitHub Actions rather than advisory.
  • MCP results cleanup — Ruff UP006 / S324 fixes across mcp/results remove legacy typing forms and insecure hash defaults.

  • Epic 2 — Observable Execution (agentic-workflows-v2)

  • Typed execution-event wire format — New contracts/events.py Pydantic discriminated union covers workflow_start, step_start, step_end, step_complete, step_error, workflow_end, and evaluation_*. WebSocket and SSE broadcasts validate before emit; client union in ui/src/api/types.ts stays in lockstep.
  • Live DAG animation@xyflow/react nodes and edges animate through queued → running → complete / error states as events arrive on /ws/execution/{run_id}, so users can watch a run progress without opening server logs.
  • StepNode B2 redesign — Each node now renders an ASCII status chip, LLM tier pill, token counter, and a streaming-progress bar driven by live events.
  • Step drill-down panel — Click a node to see a five-field detail pane (inputs, outputs, status, timing, errors) that handles partial state gracefully while a step is still running.
  • Playwright streaming PR gate — New E2E job runs the streaming flow 5× per PR using data-testid hooks added across the UI; any single failure blocks merge.
  • WebSocket reconnect-replay test — Fault-injection E2E kills the socket mid-run and asserts the server's 500-event replay buffer restores UI state on reconnect.
  • Time-to-first-span SLO + p95 gate — New measurement test records first-span latency and fails the build if the p95 regresses past the contract.
  • Nightly 50× reliability job — Streaming E2E runs 50× on a nightly cron with a rolling flake-rate gate; reconnect and p95 contracts hardened alongside.

  • Epic 3 — DevEx / Windows (agentic-workflows-v2)

  • Windows bring-up hardeningscripts/setup-dev.ps1 one-command bootstrap validated in CI so a fresh Windows clone installs deps, validates workflows, and runs a smoke test without manual intervention.
  • port-guard devex tool — Detects and reports processes holding dev ports (8010 / 5173 / 6006) before server startup so conflicts surface with an actionable message instead of a cryptic bind error.
  • workspace-test-runner tool — Single entry point to run the right test suite for whichever package you're in (backend pytest, eval pytest, UI vitest).
  • workflow-linter tool — Validates YAML workflow definitions against the required-fields contract (name, agent, description, depends_on, inputs, outputs) before they reach the runtime.
  • Windows Unicode CLI fixagentic CLI no longer crashes on the Windows default codepage when rendering non-ASCII output; CLI verification added to CI to prevent regressions.

  • Epic 6 — Evaluation & Data Depth (agentic-workflows-v2)

  • Contracts — Additive Pydantic v2 models (EvaluationCriterionDetail, ScoreLayersModel, HardGatesModel, FloorViolationModel, RunEvaluationDetail, DatasetSampleSummary, …) mirrored as TypeScript interfaces in ui/src/api/types.ts; EvaluationCompleteEvent expanded with passed, pass_threshold, criteria.
  • Tokens-30d statRunsSummaryResponse gains tokens_30d; run_logger.summary() aggregates tokens across all runs started in the last 30 days; Dashboard live stat wired to real data.
  • GET /runs/{filename}/evaluation — Returns full rubric breakdown for any stored run, including criteria scores, score layers, hard gates, and floor violations.
  • Dataset sample endpointsGET /eval/datasets/sample-list?dataset_id=…&limit=… and GET /eval/datasets/sample-detail?dataset_id=…&sample_index=…; dataset IDs use query params to handle slash characters.
  • Evaluations pageEvaluationRubricAccordion with lazy [+]/[-] expansion: criterion table (normalized %, ASCII progress bar, [FLOOR] badge), score layers, hard gate [OK]/[FAIL] rows, floor violation list.
  • Datasets page — 3-pane browser: dataset catalog → SampleIndexGrid (paginated [<]/[>]) → DatasetDetailPane (collapsible [meta +/-], field rendering, JSON viewer, workflow preview badge).

  • Epic 5 — Console UI Polish (agentic-workflows-v2/ui)

  • StatusBadge migrated to ASCII bracket format: [OK ] [RUN] [ERR] [WARN] using --b-* CSS tokens; works across dark / paper / bolt themes.
  • useHotkeys hook — global keyboard shortcuts (n / f / / / j / k / Esc) with input-focus guard and unmount cleanup.
  • Dashboard filter — / and f focus the filter input; Esc clears and blurs; narrows runs by workflow name or run ID.
  • State pages — EmptyState ($ no <entity> yet), ErrorBanner ([!] {msg}), NotFoundPage (404 terminal-style), AppErrorBoundary (React error boundary).
  • Skip-to-main link — visually hidden, appears on first Tab; <main id="main-content"> as target.
  • Focus ring audit — focus:ring-1 focus:ring-b-clay/50 added to all interactive elements; audit notes at docs/a11y-focus-ring-audit.md.
  • Paper theme contrast QA — --b-text-dim on --b-bg1 verified at 7.45:1 (passes AA); bolt 5.92:1; dark 3.80:1 (dim tier, intentional).
  • BDagMini — pure SVG static DAG thumbnail; reuses layoutDAG (Kahn topological sort); linear chains render as vertical, parallel branches center-aligned per rank; themed via CSS vars.

  • skill-architect agent — New AI persona specialized in designing, extracting, and refactoring skills as reusable prompt programs. Added to the canonical agent roster with full documentation.

  • verify-and-correct skill — Bounded self-correction loop: automatically runs tests, lint, and type checks after code changes, then retries fixes (up to a limit) before reporting back. Reduces back-and-forth on build failures.
  • session-plan skill — Plan a focused session with 1–2 goals, explicit success criteria, and a TODO checklist. Prevents mega-sessions that hit rate limits by scoping work upfront.
  • Headless prompts — New infrastructure for running agent prompts without an interactive session.
  • Adapter registry CLI wiring (ADR-001) — Orchestrator, workflow runner, and CLI now route through the adapter registry, making engine backends (native DAG, LangGraph) fully swappable at runtime.
  • SmartModelRouter hardening (ADR-002) — Added persistence across restarts, Retry-After header support, and degraded-mode fallback so the router stays operational when providers are rate-limited or unavailable.

Improvements

  • CI pipeline upgrades — Updated actions/setup-python from v4 → v5 across all workflows. Added cross-package E2E job, expanded security scanning with pip-audit, and tightened dependency review to fail on severity moderate.
  • Release pipeline — New tag-triggered deploy.yml with build provenance for reproducible releases.
  • Performance regression detection — Rewrote performance-benchmark.yml to compare against a stored baseline and fail on detected regressions.
  • Secret sanitization rule — Always-on sanitization middleware now redacts API keys, tokens, and passwords before they reach LLM context. Covers all agent sessions by default.
  • Backlog tracking — Added 11 new backlog tickets (rows 34–44) covering architectural debt, test gaps, and documentation tasks.

Security

  • P0 tool-layer hardeninghttp_ops, search_ops, and shell_ops received critical security fixes: SSRF URL validation, shell injection blocklist (20+ blocked patterns), shlex.split with shell=False, path traversal guards, and __builtins__ restriction in the code execution sandbox.

Bug Fixes & Documentation

  • Fixed factual drift across monorepo documentation (CLAUDE.md, AGENTS.md, README files, ADRs).
  • Corrected 40+ dangling references to deleted prompts and removed workflows.
  • Removed 23 duplicate tests, fixed 2 broken tests as part of ADR-008 test coverage overhaul.
  • Aligned stale agent config entries with the current agent roster.

Test Coverage (ADR-008)

  • Added tests for 12+ previously untested modules, including evaluation pipeline, workflow pipeline, server backends, LLM judge, and scoring.
  • Hardened server test coverage for streaming backends and session handling.
  • Overall test coverage for agentic_v2 raised toward the 80% target.

Cleanup & Refactoring

  • Presentation system extracted — The presentation/deck builder has been moved to its own standalone repository (present). 245 files removed from this repo; raw-themes data and scripts are preserved in the new repo.
  • Removed deprecated agent prompts, workflow definitions, Copilot config, and stale GitHub Actions workflows.
  • Removed dead prompt constants, obsolete artifact files, and cleaned up .gitignore.

Migration Notes

  • presentation/ system extracted to a standalone repo (present). If you imported anything from presentation/* or ran its deck-builder / token-generation scripts, those paths no longer exist here. 245 files moved.
  • What moved: brutalist deck builder, JSX + PPTX slide generators, theme registry, raw-theme token configs, Storybook catalog, and the accompanying test suite.
  • What stayed: decks-generated/ (the output artifacts from the old builder) remains in this repo until downstream consumers migrate.
  • Action required: Update imports of presentation.* or src/tokens/* to point at the new present repo, OR pin a pre-2026-04 commit of this repo if you are not ready to migrate.
  • agentic_v2.core.protocols.AgentProtocol.run signature tightened from Any to object. Code that relied on implicit-Any call sites may now surface previously hidden mypy errors; update call sites to use the bounded TypeVars (TInput / TOutput) from agentic_v2.agents.base.
  • langchain adapter is deprecated in favor of the native DAG engine (see ADR-013). Importing agentic_v2.langchain.* still works but emits a DeprecationWarning. Plan to migrate off before v0.5.0.

Known Limitations

This release ships with the following known issues. All are tracked for Sprint B or later; none block the primary workflows documented in docs/ONBOARDING.md.

  • 35 mypy findings in agentic-v2-eval/ are now visible in CI (continue-on-error: true, not blocking merges). One of them (agentic_v2_eval/runners/streaming.py:216-235) raises an object that is not derived from BaseException and is a real bug worth fixing in Sprint B. Tracked as "Sprint B #7-followup" in eval-package-ci.yml.
  • SLO p95 gate can pass trivially on empty data. readP95({ windowDays: 7 }) in ui/e2e/slo-storage.ts returns 0 when the rolling window has no records, which silently satisfies the <= 2000ms assertion. Mitigated by the nightly workflow appending one sample per run, but the first night after an slo-data branch reset is a free pass.
  • Schema-drift guard is root-properties-only. tests/test_schema_drift.py checks the top-level properties of covered Pydantic models; field removals on nested models referenced via $defs, as well as any type narrowing (e.g., str -> Literal[...]), pass silently. Tracked for extension in Sprint B.
  • No true placeholder / no-LLM mode. Both local runs and CI require at least one configured LLM provider. In CI we use GITHUB_TOKEN + models: read to reach GitHub Models (zero-cost on public repos). The Epic 2 plan referenced a placeholder toggle, but the toggle was not implemented. See ADR-016 (planned) for the rationale.
  • Python -> TypeScript wire-format mirror is manual. agentic_v2/contracts/events.py is hand-mirrored in ui/src/api/types.ts; there is no codegen or drift-detection gate. An intentional schema change must be made in both files; a missed edit ships as a silent frontend bug.
  • UI build artifacts are git-tracked. agentic-workflows-v2/ui/dist/index.html and agentic-workflows-v2/ui/tsconfig.tsbuildinfo are historical tracked files that re-dirty after every npm run build. A follow-up task is filed; expect git status noise after building the UI locally.

Earlier History

For changes prior to March 2026, see the git log:

git log --oneline --before="2026-03-01"