Changelog¶
All notable changes to this project are documented here.
Unreleased¶
Security (Sprint 1 — Tool Safety & Silent Failures)¶
- S1-01 — Sanitization middleware fail-closed —
server/middleware/__init__.pynow returns HTTP 400 on sanitization errors instead of silently passing unvalidated input downstream. Any unhandled exception in the sanitization pipeline is treated as a block, not a pass-through. - S1-02 — File tools fail-closed without
AGENTIC_FILE_BASE_DIR— All six file tools (FileReadTool,FileWriteTool,FileCopyTool,FileMoveTool,FileDeleteTool,DirectoryCreateTool) reject operations with a clear error whenAGENTIC_FILE_BASE_DIRis unset, preventing unintentional host filesystem access..env.exampleupdated with required operator guidance. - S1-03 —
run_idtraversal test corpus — Added parametrized tests covering path traversal, null-byte injection, and unicode normalization bypass attempts against the existingrun_idvalidator. - S1-04 —
_validate_astattribute-escape closed — The expression evaluator's AST sandbox now rejects attribute access (ast.Attributenodes) that could escape to__class__.__mro__chains or dunder traversal. Existing safe-expression corpus extended with adversarial fixtures. - S1-05 —
ShellToolallowlist replaces substring blocklist —AGENTIC_SHELL_ALLOWED_COMMANDSenv var (comma-separated command names) controls which executablesShellToolmay run. When unset, all shell commands are disabled (fail-closed). Defeats double-space, absolute-path, uppercase, and fullwidth-unicode blocklist bypasses..env.exampleentry added._SHELL_METACHARS+_split_commandpreserved as a second safety layer. - S1-06 —
CodeExecutionToolsandbox hardened —__import__replaced in the wrapper's_safe_builtinswith a constrained importer that enforces the same_DANGEROUS_IMPORTSblocklist at runtime, closing__import__('os')bypass while preservingimport mathetc.resource.setrlimit(RLIMIT_AS=512 MB, RLIMIT_NPROC=32)added viapreexec_fnon POSIX to prevent memory-bomb and fork-bomb DoS. Sandbox escape test corpus added. - S1-06 follow-up — loader-traversal escape closed —
"sys.modules"and"__loader__"added to_DANGEROUS_PATTERNS, blockingsys.modules['builtins'].__loader__.load_module('os')traversal on Windows. Regression tests added. - S1-07 — Sparse subprocess env across all tools — New
agentic_v2/tools/subprocess_utils.pywithminimal_subprocess_env()helper.GitToolandShellTool(includingShellExecTool) now passenv=minimal_subprocess_env()to all subprocess calls, preventing API key leakage to child processes. Test corpus verifiesOPENAI_API_KEY,ANTHROPIC_API_KEY,GITHUB_TOKEN, etc. are excluded.
Documentation¶
- Architecture umbrella + roadmap + honesty docs (commit
41bd0d8) — newdocs/ARCHITECTURE.mdindex linking the four existingarchitecture-*.mddeep-dives (closes the brokendocs/README.mdreference); newdocs/ROADMAP.mdsurfacing Epic 1/2/3/5/6 status, explicit Epic 4 tombstone, and proposed Epic 7 scope (closes the in-repo backlog gap flagged by the final review); newdocs/KNOWN_LIMITATIONS.mddocumenting the 35 unresolvedagentic-v2-evalmypy findings, the empty-window SLO trivial-pass, the schema-drift root-only blind spot, and the Python→TypeScript manual mirror; newdocs/MIGRATIONS.mdwith first entries for thepresentation/extraction, theAgentProtocol.runsignature tightening, and thelangchainadapter deprecation. - Epic retrospectives + three new ADRs (commit
7b082fd) — retrospective plan docs for Epics 3 (DevEx), 5 (UI Polish), and 6 (Eval Depth), covering stories, commits, load-bearing decisions, and process notes; ADR-014 (Pydantic discriminated-union wire format for execution events), ADR-015 (SLO rolling window stored in git as JSON artifacts — acknowledges the empty-window trivial-pass as open debt), ADR-016 (GitHub Models viaGITHUB_TOKENas default E2E LLM provider — documents the zero-cost-vs-vendor-coupling trade-off);ADR-INDEX.mdrefreshed from 10 → 13 ADRs with updated lineage chains. Archive headers added to the Epic 1/2 prospective plan docs so unchecked boxes read as history rather than WIP. - Stale-artifact triage and count-drift fixes (commit
205b314) — deletes ~2,050 lines of orphaned content:MCP_IMPLEMENTATION_{COMPLETE,PLAN,STATUS}.md,LANGCHAIN_MIGRATION_PLAN.md(directly contradicted ADR-013),chatlg.md(1052-line raw conversation log atagentic-v2-eval/),playwright-tester-training-prompt.md, and severalhandoff.mdstubs. Count drift corrected inONBOARDING.md/GLOSSARY.md/CLAUDE.md(24 → 7 agent personas, 78+ → 100+ test files, 10 → 6 workflow definitions). Corrected a factual claim indocs/MIGRATIONS.md:presentation/is not fully gone — the top-level directory retains leftover theme-collection scripts and raw-themes data pending a follow-up cleanup. - Post-v0.3.0 doc cleanup (commit
e7c2a69) — drops supersededIMPLEMENTATION_SUMMARY.md,MASTER_MANIFEST.md, anddocs/contribution-guide.md(all replaced by newer artifacts landed during the doc overhaul); relocatesdocs/eval-harness/*planning artifacts toplanning-artifacts/eval-harness/so the user-facingdocs/tree only contains user-facing content; adds a ⚠️ STALE banner toACTIVE_VS_LEGACY_TOOLING_MAP.mdlinking to current sources of truth rather than rewriting it blind.
Cleanup & Refactoring¶
- UI build artifacts untracked (commit
9522baf) —agentic-workflows-v2/ui/dist/index.htmlandagentic-workflows-v2/ui/tsconfig.tsbuildinfoare now in.gitignore; they were showing up dirty after everynpm run buildand polluting every PR diff. Fulfills the spawned task filed during Sprint A triage.
Stabilization (Sprint B)¶
- SLO p95 empty-window trivial-pass fixed (commit
c9c4f33, Sprint B #2) —readP95inagentic-workflows-v2/ui/e2e/slo-storage.tspreviously returned0on an empty rolling window, which silently satisfied the<= 2000msassertion and produced a permanently green gate after anyslo-databranch reset. Now throws a newInsufficientDataErrorwhen the window has fewer thanDEFAULT_MIN_SAMPLES(= 10) records, and the Playwright spec converts that totest.skip(...)during bootstrap. Bootstrap semantics: deferred, not passed. From the 11th nightly onward the p95 budget is a hard gate. - All 35
agentic-v2-evalmypy findings cleared (commited78ee2, Sprint B #1) — droppedcontinue-on-error: truefrom.github/workflows/eval-package-ci.ymlin the same commit. Workstream breakdown:types-PyYAMLadded (2 errors),_eval_onerefactored to a discriminated uniontuple[Literal[True], R] | tuple[Literal[False], Exception](6 errors acrossrunners/streaming.py), optional-import guards via_require_*_module()helpers indatasets.py(6 union-attr + 1 no-any-return), 10 missing annotations added, 10no-any-returnfixes via typed locals. 241 tests still pass. One new# type: ignore[return-value]atrunners/streaming.py:199documented with justification (inspect.isawaitablenarrowing limitation). - Wire-format drift gate (commit
ae3878c, Sprint B #3) — new automated Python ↔ TypeScript mirror foragentic_v2.contracts.events.ExecutionEvent.agentic_v2/contracts/events.pyis the source of truth;scripts/generate_ts_types.pyemitstests/schemas/events.schema.json;ui/scripts/generate-ts-types.mjsusesjson-schema-to-typescriptto emitui/src/api/events.generated.ts. Newwire-format-driftCI job regenerates and fails the PR on mismatch. The migration caught three latent client-type mismatches (status: StepStatusvs. the wirestring, non-nullableinput/outputthat are actually nullable,criteriashape drift onEvaluationCompleteEvent), now coerced at theuseWorkflowStream.tsboundary. Client-only transport events (error,keepalive,connection_established) live in a hand-definedChannelEventunion since they are not in the Python contract. Contributor docs updated. - Sprint B #5 — Placeholder / no-LLM mode.
AGENTIC_NO_LLM=1installs a deterministic placeholder at both engine chokepoints (get_client()→MockBackend;get_chat_model()→PlaceholderChatModel). Native and LangChain engines both run end-to-end without provider credentials. Not a simulator — structured parsers and evaluation runs still need real keys. (commitc2aff71)
0.3.0 — 2026-04-22¶
First tracked release of the agentic-workflows-v2 platform. Bundles
Epic 1 (Platform Foundation), Epic 2 (Observable Execution),
Epic 3 (DevEx / Windows), Epic 5 (Console UI Polish), and
Epic 6 (Evaluation & Data Depth). See Known Limitations below
for items shipped with caveats and Migration Notes for breaking
changes vs. the prior unversioned state.
Note on Epic 4. Epic numbering jumps from 3 to 5. Epic 4 was never authored in this repository — the number was skipped during planning. Not a regression, just a tombstone for the record.
New Features¶
- Epic 1 — Platform Foundation (agentic-workflows-v2)
- Typed core protocols —
AgentProtocol.runno longer accepts/returnsAny; signature tightened toobjectso type checkers stop treating agent I/O as opaque. CompanionToolProtocolconformance test added. - Consolidated
Settings— All environment variable reads routed through a single typedpydantic-settingsclass; scatteredos.environlookups removed so misconfigured deployments fail fast at startup instead of deep inside a run. - Adapter registry test isolation — Autouse fixture resets the
AdapterRegistrysingleton between tests, eliminating cross-test leakage and flakes when suites register alternate engine backends. - Schema-drift CI gate — New snapshot test on
contracts/Pydantic models fails the build on any unreviewed wire-format change;scripts/generate_schemas.pyrefreshes the canonical snapshot. - OTEL parent-child trace assertion — Regression test verifies the engine → agent span chain is preserved end-to-end so distributed traces remain connected in Jaeger / Tempo.
- Golden-output regression test — Deterministic fixture locks the
code_reviewworkflow's final output; any behavioral drift in the native engine trips the test. - CI lint + coverage enforcement — Ruff runs as a required job and the 80% coverage floor is now enforced in GitHub Actions rather than advisory.
-
MCP results cleanup — Ruff
UP006/S324fixes acrossmcp/resultsremove legacy typing forms and insecure hash defaults. -
Epic 2 — Observable Execution (agentic-workflows-v2)
- Typed execution-event wire format — New
contracts/events.pyPydantic discriminated union coversworkflow_start,step_start,step_end,step_complete,step_error,workflow_end, andevaluation_*. WebSocket and SSE broadcasts validate before emit; client union inui/src/api/types.tsstays in lockstep. - Live DAG animation —
@xyflow/reactnodes and edges animate through queued → running → complete / error states as events arrive on/ws/execution/{run_id}, so users can watch a run progress without opening server logs. - StepNode B2 redesign — Each node now renders an ASCII status chip, LLM tier pill, token counter, and a streaming-progress bar driven by live events.
- Step drill-down panel — Click a node to see a five-field detail pane (inputs, outputs, status, timing, errors) that handles partial state gracefully while a step is still running.
- Playwright streaming PR gate — New E2E job runs the streaming flow 5× per PR using
data-testidhooks added across the UI; any single failure blocks merge. - WebSocket reconnect-replay test — Fault-injection E2E kills the socket mid-run and asserts the server's 500-event replay buffer restores UI state on reconnect.
- Time-to-first-span SLO + p95 gate — New measurement test records first-span latency and fails the build if the p95 regresses past the contract.
-
Nightly 50× reliability job — Streaming E2E runs 50× on a nightly cron with a rolling flake-rate gate; reconnect and p95 contracts hardened alongside.
-
Epic 3 — DevEx / Windows (agentic-workflows-v2)
- Windows bring-up hardening —
scripts/setup-dev.ps1one-command bootstrap validated in CI so a fresh Windows clone installs deps, validates workflows, and runs a smoke test without manual intervention. port-guarddevex tool — Detects and reports processes holding dev ports (8010 / 5173 / 6006) before server startup so conflicts surface with an actionable message instead of a cryptic bind error.workspace-test-runnertool — Single entry point to run the right test suite for whichever package you're in (backend pytest, eval pytest, UI vitest).workflow-lintertool — Validates YAML workflow definitions against the required-fields contract (name,agent,description,depends_on,inputs,outputs) before they reach the runtime.-
Windows Unicode CLI fix —
agenticCLI no longer crashes on the Windows default codepage when rendering non-ASCII output; CLI verification added to CI to prevent regressions. -
Epic 6 — Evaluation & Data Depth (agentic-workflows-v2)
- Contracts — Additive Pydantic v2 models (
EvaluationCriterionDetail,ScoreLayersModel,HardGatesModel,FloorViolationModel,RunEvaluationDetail,DatasetSampleSummary, …) mirrored as TypeScript interfaces inui/src/api/types.ts;EvaluationCompleteEventexpanded withpassed,pass_threshold,criteria. - Tokens-30d stat —
RunsSummaryResponsegainstokens_30d;run_logger.summary()aggregates tokens across all runs started in the last 30 days; Dashboard live stat wired to real data. GET /runs/{filename}/evaluation— Returns full rubric breakdown for any stored run, including criteria scores, score layers, hard gates, and floor violations.- Dataset sample endpoints —
GET /eval/datasets/sample-list?dataset_id=…&limit=…andGET /eval/datasets/sample-detail?dataset_id=…&sample_index=…; dataset IDs use query params to handle slash characters. - Evaluations page —
EvaluationRubricAccordionwith lazy[+]/[-]expansion: criterion table (normalized %, ASCII progress bar,[FLOOR]badge), score layers, hard gate[OK]/[FAIL]rows, floor violation list. -
Datasets page — 3-pane browser: dataset catalog →
SampleIndexGrid(paginated[<]/[>]) →DatasetDetailPane(collapsible[meta +/-], field rendering, JSON viewer, workflow preview badge). -
Epic 5 — Console UI Polish (agentic-workflows-v2/ui)
StatusBadgemigrated to ASCII bracket format:[OK ][RUN][ERR][WARN]using--b-*CSS tokens; works across dark / paper / bolt themes.useHotkeyshook — global keyboard shortcuts (n / f / / / j / k / Esc) with input-focus guard and unmount cleanup.- Dashboard filter —
/andffocus the filter input;Escclears and blurs; narrows runs by workflow name or run ID. - State pages —
EmptyState($ no <entity> yet),ErrorBanner([!] {msg}),NotFoundPage(404 terminal-style),AppErrorBoundary(React error boundary). - Skip-to-main link — visually hidden, appears on first Tab;
<main id="main-content">as target. - Focus ring audit —
focus:ring-1 focus:ring-b-clay/50added to all interactive elements; audit notes atdocs/a11y-focus-ring-audit.md. - Paper theme contrast QA —
--b-text-dimon--b-bg1verified at 7.45:1 (passes AA); bolt 5.92:1; dark 3.80:1 (dim tier, intentional). -
BDagMini— pure SVG static DAG thumbnail; reuseslayoutDAG(Kahn topological sort); linear chains render as vertical, parallel branches center-aligned per rank; themed via CSS vars. -
skill-architectagent — New AI persona specialized in designing, extracting, and refactoring skills as reusable prompt programs. Added to the canonical agent roster with full documentation. verify-and-correctskill — Bounded self-correction loop: automatically runs tests, lint, and type checks after code changes, then retries fixes (up to a limit) before reporting back. Reduces back-and-forth on build failures.session-planskill — Plan a focused session with 1–2 goals, explicit success criteria, and a TODO checklist. Prevents mega-sessions that hit rate limits by scoping work upfront.- Headless prompts — New infrastructure for running agent prompts without an interactive session.
- Adapter registry CLI wiring (ADR-001) — Orchestrator, workflow runner, and CLI now route through the adapter registry, making engine backends (native DAG, LangGraph) fully swappable at runtime.
- SmartModelRouter hardening (ADR-002) — Added persistence across restarts,
Retry-Afterheader support, and degraded-mode fallback so the router stays operational when providers are rate-limited or unavailable.
Improvements¶
- CI pipeline upgrades — Updated
actions/setup-pythonfrom v4 → v5 across all workflows. Added cross-package E2E job, expanded security scanning withpip-audit, and tightened dependency review to fail on severitymoderate. - Release pipeline — New tag-triggered
deploy.ymlwith build provenance for reproducible releases. - Performance regression detection — Rewrote
performance-benchmark.ymlto compare against a stored baseline and fail on detected regressions. - Secret sanitization rule — Always-on sanitization middleware now redacts API keys, tokens, and passwords before they reach LLM context. Covers all agent sessions by default.
- Backlog tracking — Added 11 new backlog tickets (rows 34–44) covering architectural debt, test gaps, and documentation tasks.
Security¶
- P0 tool-layer hardening —
http_ops,search_ops, andshell_opsreceived critical security fixes: SSRF URL validation, shell injection blocklist (20+ blocked patterns),shlex.splitwithshell=False, path traversal guards, and__builtins__restriction in the code execution sandbox.
Bug Fixes & Documentation¶
- Fixed factual drift across monorepo documentation (CLAUDE.md, AGENTS.md, README files, ADRs).
- Corrected 40+ dangling references to deleted prompts and removed workflows.
- Removed 23 duplicate tests, fixed 2 broken tests as part of ADR-008 test coverage overhaul.
- Aligned stale agent config entries with the current agent roster.
Test Coverage (ADR-008)¶
- Added tests for 12+ previously untested modules, including evaluation pipeline, workflow pipeline, server backends, LLM judge, and scoring.
- Hardened server test coverage for streaming backends and session handling.
- Overall test coverage for
agentic_v2raised toward the 80% target.
Cleanup & Refactoring¶
- Presentation system extracted — The presentation/deck builder has been moved to its own standalone repository (
present). 245 files removed from this repo; raw-themes data and scripts are preserved in the new repo. - Removed deprecated agent prompts, workflow definitions, Copilot config, and stale GitHub Actions workflows.
- Removed dead prompt constants, obsolete artifact files, and cleaned up
.gitignore.
Migration Notes¶
presentation/system extracted to a standalone repo (present). If you imported anything frompresentation/*or ran its deck-builder / token-generation scripts, those paths no longer exist here. 245 files moved.- What moved: brutalist deck builder, JSX + PPTX slide generators, theme registry, raw-theme token configs, Storybook catalog, and the accompanying test suite.
- What stayed:
decks-generated/(the output artifacts from the old builder) remains in this repo until downstream consumers migrate. - Action required: Update imports of
presentation.*orsrc/tokens/*to point at the newpresentrepo, OR pin a pre-2026-04 commit of this repo if you are not ready to migrate. agentic_v2.core.protocols.AgentProtocol.runsignature tightened fromAnytoobject. Code that relied on implicit-Anycall sites may now surface previously hiddenmypyerrors; update call sites to use the boundedTypeVars (TInput/TOutput) fromagentic_v2.agents.base.langchainadapter is deprecated in favor of the native DAG engine (see ADR-013). Importingagentic_v2.langchain.*still works but emits aDeprecationWarning. Plan to migrate off before v0.5.0.
Known Limitations¶
This release ships with the following known issues. All are tracked for Sprint B or later; none block the primary workflows documented in docs/ONBOARDING.md.
- 35 mypy findings in
agentic-v2-eval/are now visible in CI (continue-on-error: true, not blocking merges). One of them (agentic_v2_eval/runners/streaming.py:216-235) raises an object that is not derived fromBaseExceptionand is a real bug worth fixing in Sprint B. Tracked as "Sprint B #7-followup" ineval-package-ci.yml. - SLO p95 gate can pass trivially on empty data.
readP95({ windowDays: 7 })inui/e2e/slo-storage.tsreturns0when the rolling window has no records, which silently satisfies the<= 2000msassertion. Mitigated by the nightly workflow appending one sample per run, but the first night after anslo-databranch reset is a free pass. - Schema-drift guard is root-properties-only.
tests/test_schema_drift.pychecks the top-levelpropertiesof covered Pydantic models; field removals on nested models referenced via$defs, as well as any type narrowing (e.g.,str->Literal[...]), pass silently. Tracked for extension in Sprint B. - No true placeholder / no-LLM mode. Both local runs and CI require at least one configured LLM provider. In CI we use
GITHUB_TOKEN+models: readto reach GitHub Models (zero-cost on public repos). The Epic 2 plan referenced a placeholder toggle, but the toggle was not implemented. See ADR-016 (planned) for the rationale. - Python -> TypeScript wire-format mirror is manual.
agentic_v2/contracts/events.pyis hand-mirrored inui/src/api/types.ts; there is no codegen or drift-detection gate. An intentional schema change must be made in both files; a missed edit ships as a silent frontend bug. - UI build artifacts are git-tracked.
agentic-workflows-v2/ui/dist/index.htmlandagentic-workflows-v2/ui/tsconfig.tsbuildinfoare historical tracked files that re-dirty after everynpm run build. A follow-up task is filed; expectgit statusnoise after building the UI locally.
Earlier History¶
For changes prior to March 2026, see the git log: