ADR-012: UI Overhaul — Evaluation Hub & A/B Comparison Interface¶
| Field | Value |
|---|---|
| ID | ADR-012 |
| Status | 🟡 Proposed |
| Date | 2026-03-06 |
| System | agentic-workflows-v2/ui |
| Authors | Platform Engineering |
| Reviewers | Frontend, UX |
| Extends | ADR-011 (API & Interface Design) |
1. TL;DR¶
The existing EvaluationsPage is overhauled from a passive filtered table into a real evaluation hub with two tabs and a "Compare Agents" call-to-action. Two additive screens are introduced: a 2-step wizard for configuring a new comparison, and a single-page result view that auto-transitions from live progress to final results without navigation. All real-time streaming reuses
connectExecutionStream()verbatim — no new client-side WebSocket infrastructure (a new server-side route handler for/ws/eval/{run_id}is required). Contestant configuration is inline in the wizard (no file editing). A visual workflow pipeline builder is identified as the logical next phase but explicitly deferred.
2. Status History¶
| Date | Status | Note |
|---|---|---|
| 2026-03-06 | 🟡 Proposed | Initial UI overhaul design |
3. Context & Problem Statement¶
3.1 Current UI State Audit¶
The existing UI has seven pages. Six are working well and unchanged; one needs an overhaul:
| Page | Route | Disposition | Reason |
|---|---|---|---|
| Dashboard | / |
Keep — minor CTA addition | Add "Compare Agents" shortcut card |
| Workflows | /workflows |
Keep | Workflow library — unchanged |
| Workflow Detail | /workflows/:name |
Keep | DAG + step detail — unchanged |
| Datasets | /datasets |
Keep | Dataset browser — unchanged |
| Evaluations | /evaluations |
Overhaul as hub | Currently thin: only filters runs by score |
| Run Detail | /runs/:filename |
Keep | Step log viewer — unchanged |
| Live | /live/:runId |
Keep | Proven WebSocket real-time pattern to reuse |
EvaluationsPage — current state:
/evaluations (current)
┌──────────────────────────────────────────────────────┐
│ Evaluations │
│ │
│ Workflow | Score | Grade | Steps | Date | Details │
│ (filtered list: runs where evaluation_score != null)│
│ │
│ No entry point to run a new evaluation. │
│ No comparison capability or comparison history. │
│ No A/B results. Passive observer only. │
└──────────────────────────────────────────────────────┘
Note: The "no entry point" observation applies specifically to
EvaluationsPage.tsx.WorkflowDetailPage.tsxalready has an evaluation entry point viaRunConfigForm, which supports evaluation dataset selection, rubric selection, and multi-sample batch runs. The gap is thatEvaluationsPage— the page whose purpose is evaluations — lacks any entry point to initiate one.
This page is a passive read-only table. It provides no user action and is not an evaluation hub.
3.2 UX Requirements¶
The eval harness (ADR-010/011) introduces a primary user action: start and observe an A/B comparison. That action has no home in the current UI:
┌─────────────────────────────────────────────────────────────────┐
│ UX REQUIREMENTS FOR THE EVAL HARNESS UI │
├─────────────────────────────────────────────────────────────────┤
│ R1 │ Start a new A/B comparison from the browser │
│ R2 │ Observe live progress during a 30-120 min comparison │
│ R3 │ See results on the same page — no navigation disruption │
│ R4 │ Configure contestants inline — no terminal or file edit │
│ R5 │ Browse comparison history alongside scored workflow runs │
└─────────────────────────────────────────────────────────────────┘
3.3 Real-Time Infrastructure Assessment¶
The existing WebSocket infrastructure is proven stable in production:
// api/websocket.ts — connectExecutionStream()
// Deployed: stable for 30–60 minute workflow runs in LivePage
// Auto-reconnect: 5 retries, exponential backoff (retryDelayMs × 2^(retryCount-1) → 1s, 2s, 4s, 8s, 16s)
// pathPrefix option: pass pathPrefix="eval" to reach /ws/eval/{runId} — no new function needed
// Used by: useWorkflowStream.ts → LivePage.tsx
This is production infrastructure. The eval feature reuses it by passing pathPrefix: "eval"
to connectExecutionStream() — zero new client-side WebSocket code required.
useEvalStream.ts mirrors useWorkflowStream.ts structurally with eval-specific
event types — the only difference is the shape of the state object.
4. Decision¶
4.1 EvaluationsPage → Evaluation Hub (tab extension)¶
Change principle: additive only. The existing "Scored Runs" table is unchanged in content and styling. A tab selector and "Compare Agents" button are added above it.
/evaluations — overhauled
┌──────────────────────────────────────────────────────────┐
│ Evaluations [Compare Agents →] │
│ │
│ [Scored Runs] [Comparisons] ← tab toggle │
│ │
│ Scored Runs tab: │
│ Workflow │ Score │ Grade │ Steps │ Date │ Details │
│ (existing table — content, sorting, and styling intact) │
│ │
│ Comparisons tab: │
│ Commit │ A vs B │ Winner │ Margin │ Date │ ▶ │
│ abc123 │ workflow vs prompt│ A │ +1.4 │ Mar 6│ ▶ │
│ def456 │ agent vs agent │ tie │ 0.0 │ Mar 5│ ▶ │
└──────────────────────────────────────────────────────────┘
4.2 Compare Wizard — 2 Steps, One Decision Per Step¶
Rationale: GOV.UK Design System's "one thing per page" pattern, validated through lab testing on Register to vote and GOV.UK Verify (design-system.service.gov.uk). Each step focuses attention on a single decision and validates before advancing.
Step 1 — Commit (2 fields, immediate feedback):
Step 1 of 2: Select Commit
┌────────────────────────────────────────────┐
│ │
│ Repository path │
│ ┌──────────────────────────────────────┐ │
│ │ /path/to/repo │ │
│ └──────────────────────────────────────┘ │
│ │
│ Commit SHA [HEAD~1 shortcut] │
│ ┌──────────────────────────────────────┐ │
│ │ abc123def │ │
│ └──────────────────────────────────────┘ │
│ Commit validated: "fix: resolve edge ..." │
│ │
│ [Next →] │
└────────────────────────────────────────────┘
Validation on "Next": send POST /eval/validate-commit — verifies repo path is accessible
and commit SHA resolves. Inline commit summary shown on success.
Step 2 — Contestants (inline config expands by type):
Step 2 of 2: Configure Contestants
┌────────────────────────────────────────────┐
│ │
│ Contestant A │
│ [workflow] [prompt] [agent] ← pill tabs │
│ │
│ -- if workflow: -- │
│ Workflow: [code_review ▼] │
│ Model override: [default ▼] │
│ │
│ -- if prompt: -- │
│ ┌────────────────────────────────────┐ │
│ │ You are an expert Python developer │ │
│ │ specializing in ... │ │
│ └────────────────────────────────────┘ │
│ Model: [claude-sonnet-4-6 ▼] │
│ Temperature: [===●=====] 0.7 │
│ │
│ -- if agent: -- │
│ Class: [ClaudeAgent ] Model: [▼] │
│ System prompt: [optional override...] │
│ │
│ Contestant B [workflow] [prompt] [agent] │
│ (same expanded form for B) │
│ │
│ Rubric: [coding_standards ▼] │
│ │
│ [← Back] [Run Comparison] │
└────────────────────────────────────────────┘
On submit: POST /eval/compare → navigate to /evaluations/compare/{run_id}.
Existing API surface: A partial eval API already exists —
api/client.ts:64defineslistEvaluationDatasets()callingGET /api/eval/datasets. The proposed endpoints (POST /eval/validate-commit,POST /eval/compare) are additive to this existing/eval/namespace.
4.3 Comparison Run Page — Single Page, Two States¶
Design principle: GitHub Actions / Vercel deployment pattern — progress and result on the same URL. No navigation disruption during a long-running job.
State 1 — Running (WebSocket connected, live event-driven updates):
/evaluations/compare/{run_id} (state: running)
┌──────────────────────────────────────────────┐
│ ← Back abc123... ◉ running │
│ │
│ ✅ Extract requirements done 3.2s │
│ ◉ Run Contestant A running 48s │
│ ○ Run Contestant B waiting │
│ ○ Score & generate report waiting │
│ │
│ ┌────────────────────────────────────────┐ │
│ │ Contestant A: code_review │ │
│ │ type: workflow · claude-sonnet-4-6 │ │
│ │ 14 files modified │ │
│ └────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘
State 2 — Complete (auto-transition in place, same URL, no navigation):
/evaluations/compare/{run_id} (state: complete)
┌──────────────────────────────────────────────┐
│ ← Back abc123... ✅ complete │
│ │
│ ┌──────────────────────────────────────┐ │
│ │ A WINS +1.4 points │ │
│ │ code_review outscores experimental │ │
│ └──────────────────────────────────────┘ │
│ │
│ Dimension A B │
│ completeness ████████░ ██████░░░ │
│ correctness ███████░░ █████████ │
│ quality ████████░ ███████░░ │
│ specificity ███████░░ ██████░░░ │
│ alignment █████████ ███████░░ │
│ Rubric (0.0–1.0) 0.783 0.614 │
│ │
│ Tests A: 12/14 ✅ B: 9/14 ✅ │
│ │
│ [▶ Show patch diff] │
│ [↓ HTML report] [↓ Markdown report] │
└──────────────────────────────────────────────┘
The page does not navigate to a new route. The useEvalStream hook drives a state
transition from status: "running" to status: "complete", and the JSX conditionally
renders the progress tracker or the result view.
4.4 Real-Time Hook (useEvalStream.ts)¶
Mirrors useWorkflowStream.ts with eval-specific event types:
type EvalEvent =
| { type: "eval_start"; run_id: string }
| { type: "phase_start"; phase: "extract" | "run_a" | "run_b" | "score" }
| { type: "phase_end"; phase: string; status: "done" | "error"; elapsed_ms: number }
| { type: "eval_complete"; winner: "A" | "B" | "tie"; score_a: number;
score_b: number; result: ComparisonResult }
| { type: "error"; message: string }
interface EvalStreamState {
phases: Record<string, "waiting" | "running" | "done" | "error">;
status: "connecting" | "running" | "complete" | "error";
result: ComparisonResult | null;
error: string | null;
}
export function useEvalStream(runId: string | null): EvalStreamState {
// Mirrors useWorkflowStream.ts exactly — four independent useState calls, no useReducer:
const [phases, setPhases] = useState<EvalStreamState["phases"]>({});
const [status, setStatus] = useState<EvalStreamState["status"]>("connecting");
const [result, setResult] = useState<ComparisonResult | null>(null);
const [error, setError] = useState<string | null>(null);
useEffect(() => {
if (!runId) return;
const stream = connectExecutionStream(runId, (event) => {
// dispatch on event.type → call appropriate setter
}, { pathPrefix: "eval" });
return () => stream.close();
}, [runId]);
return { phases, status, result, error };
}
WebSocket URL: ws://host/ws/eval/{run_id} — reached via pathPrefix: "eval" option; same
host, no new client-side WebSocket infrastructure required.
4.5 New Files¶
| File | Lines (est.) | Purpose |
|---|---|---|
pages/eval/NewComparisonPage.tsx |
140 | 2-step wizard page (/evaluations/compare/new) |
pages/eval/ComparisonRunPage.tsx |
170 | Progress + result, 2 states (/evaluations/compare/:id) |
components/eval/ComparisonWizard.tsx |
100 | Wizard step logic + validation state |
components/eval/ContestantConfig.tsx |
70 | Type pill toggle + ref input (used for A and B) |
components/eval/ContestantPropertiesForm.tsx |
90 | Expands fields by type (workflow/prompt/agent) |
components/eval/EvalProgressTracker.tsx |
80 | 4-step linear progress indicator |
components/eval/ComparisonResultView.tsx |
130 | Winner banner + dimension bars + test summary |
components/eval/PatchDiff.tsx |
80 | Collapsible side-by-side unified diff viewer |
hooks/useEvalStream.ts |
70 | WebSocket hook (mirrors useWorkflowStream) |
api/eval.ts |
50 | Typed fetch wrappers for eval REST endpoints |
4.6 Modified Files¶
| File | Change |
|---|---|
pages/EvaluationsPage.tsx |
Add "Compare Agents →" CTA button + Comparisons tab |
App.tsx |
Add routes: /evaluations/compare/new, /evaluations/compare/:id |
Sidebar: unchanged. The Evaluations link already exists in the nav. Comparison
features live under /evaluations/* — no new nav item required.
4.7 Long-Term Roadmap — In-App Workflow Pipeline Builder¶
Visual drag-and-drop creation and editing of workflow YAML files. React Flow is
already installed and used in WorkflowDAG.tsx. This is deferred to a future phase.
Deferral rationale: The eval harness is a prerequisite for knowing which workflow patterns perform well. It would be premature to build a pipeline authoring UI before having data on what effective pipelines look like.
Future location: /workflows/new and /workflows/:name/edit
Future — two editing modes (toggle):
Visual: React Flow DAG → drag steps → configure → export YAML
YAML: Monaco/CodeMirror editor with JSON schema validation
Bidirectional sync: edit graph ↔ edit YAML (consistent state)
Future backend needed:
POST /workflows → save new YAML to workflows/definitions/
PUT /workflows/:name → update existing definition
GET /workflows/:name/schema → JSON schema for editor validation
5. Rationale¶
5.1 Two-Step Wizard over Single Long Form¶
Nielsen Norman Group's research on form usability (nngroup.com) establishes that wizards "dynamically display relevant fields based on users' prior input" and "minimize the physical effort required to fill out the form and save users from spending attentional resources to scan and filter irrelevant questions."
The key dependency in this form: Step 2 (contestant configuration) requires knowing the
repo path from Step 1 in order to populate the workflow dropdown from GET /workflows.
This dependency makes a strict two-step split natural — not merely cosmetic.
| Factor | 2-Step Wizard | Single Long Form |
|---|---|---|
| Cognitive load per screen | One decision area | All decisions simultaneously |
| Validation feedback | Per-step, immediate (commit validated on "Next") | At submit — ambiguous which field failed |
| Progressive disclosure | Contestant config expands based on type selected | All possible fields visible simultaneously |
| Error recovery | Back button goes to specific failing step | Start over from top |
| Mobile / narrow viewport | One focused section per screen | Vertical scrolling required |
| Step dependency handling | Step 2 receives validated Step 1 data | Must pass validated data down within one form |
5.2 Single-Page Progress → Result (same URL, no navigation)¶
| Factor | Single Page (auto-transition) | Separate Result Route |
|---|---|---|
| Navigation disruption | None — transitions in place during live run | User leaves progress URL mid-operation |
| Deep-linkable | Yes — /evaluations/compare/{id} shows current state |
Requires two bookmarkable URLs |
| WebSocket lifecycle | Single connection, no reconnect on navigate | New connection required on route change |
| Back button behavior | Clear — goes back to /evaluations | Ambiguous — which "back" goes where? |
| Share URL semantics | One URL shows the current state of any run | Two URLs for two "phases" of one run |
Precedents: GitHub Actions workflow run page and Vercel deployment status page both use this pattern. A single URL serves both the in-progress log and the final result. Users can bookmark, share, or return to a URL and see the current state — running or complete — without confusion.
5.3 Inline Contestant Configuration¶
Without inline config, the workflow to set up a prompt contestant is:
- Open terminal → write prompt text to a
.mdfile - Note the file path
- Return to browser → paste file path into the wizard
- If the prompt changes: repeat steps 1–3
With inline config (textarea for prompt type), this collapses to typing the prompt
directly into the wizard. The workflow type gets a dropdown populated from GET /workflows.
The agent type gets a class name input. No file editing, no terminal context-switching.
This is the same design principle used by OpenAI Playground and PromptLayer: reduce the friction between "I have an idea for a prompt" and "I can evaluate it."
5.4 Design Token Consistency¶
All new components must use the existing design token set:
| Token | Purpose |
|---|---|
bg-surface-1, bg-surface-2 |
Card and panel backgrounds |
bg-accent-blue/10, text-accent-blue |
Active state, links |
border-white/5 |
Subtle borders |
text-gray-400, text-gray-500, text-gray-600 |
Secondary text hierarchy |
btn-ghost |
Ghost button variant (Back, Cancel) |
tabular-nums |
Score display, counters |
No new Tailwind classes. The comparison result dimension bars use the same
bg-green-500 / bg-amber-500 / bg-red-500 pattern established in CriterionRow
in LivePage.tsx.
Known defect:
bg-surface-hoveris used in the currentEvaluationsPage.tsx:60but is NOT defined intailwind.config.jsorglobals.css. Tailwind JIT silently drops undefined classes, sobg-surface-hoverproduces no styles. Implementers studying the existing page for token precedents should be aware of this phantom token and avoid replicating it.
5.5 Production Precedents for This UI Pattern¶
| Pattern | Precedent | Applied In |
|---|---|---|
| Evaluation hub with tabs | DORA Dashboard (deploy frequency, failure rate, etc.) | Scored Runs + Comparisons tabs |
| Wizard for parameterized job setup | GitHub Actions workflow_dispatch input form |
2-step Compare Wizard |
| Single-page progress → result auto-transition | GitHub Actions run page, Vercel deploy status | ComparisonRunPage 2 states |
| Dimension-per-row bar chart | SWE-bench leaderboard, HELM benchmark results | ComparisonResultView dimension bars |
| Inline prompt config in UI | OpenAI Playground, PromptLayer | ContestantPropertiesForm for prompt type |
| Wizard step validation before advance | Multi-step checkout (Stripe, Shopify) | Step 1 commit validation before Step 2 |
6. Consequences¶
6.1 Positive Outcomes¶
| Outcome | Mechanism |
|---|---|
| No file editing to run a comparison | Inline contestant config in wizard |
| Live feedback for 30–120 min evals | WebSocket progress events via useEvalStream |
| Comparison history browsable with scored runs | Two-tab EvaluationsPage hub |
| No new client-side WebSocket infrastructure | connectExecutionStream() called with pathPrefix: "eval" — reaches /ws/eval/{run_id} with zero new client code. Note: a new server-side WebSocket route handler for /ws/eval/{run_id} must be written — the server currently only registers /ws/execution/{run_id} |
| Six of seven pages untouched | Only EvaluationsPage modified; all routes additive |
6.2 Trade-offs and Risks¶
| Risk | Severity | Mitigation |
|---|---|---|
| Wizard Step 1 validation requires network round-trip | Medium | Client validates path format locally; server validates SHA; debounced on blur |
| WebSocket keepalive needed for 2+ hour eval runs at load balancer | Medium | Server sends JSON keepalive ping every 30s; client reconnect absorbs any drops |
prompt_text in wizard may be large (tens of KB) |
Low | Frontend limits textarea to 16KB; documented as constraint |
| Comparisons tab empty state needs guidance copy | Low | Empty state component with "No comparisons yet — click Compare Agents" |
| Future pipeline builder (React Flow visual editor) has substantial scope | Low | Explicitly deferred; documented in roadmap section |
7. Alternatives Considered¶
| Alternative | Rejected Because |
|---|---|
Separate /compare top-level section in sidebar |
Discoverability: users expect eval features under Evaluations; adds a 6th nav item without justification |
| Full-page modal for the wizard | No deep-link URL for in-progress comparison; back button behavior undefined inside modal; route-based navigation is cleaner |
| SSE for progress stream | Creates a second streaming protocol; diverges from the proven connectExecutionStream() WebSocket infrastructure (also ADR-011 §6.2) |
| promptfoo web UI for comparison setup | Node.js dependency stack; no native integration with Python/FastAPI backend; no private-commit support; introduces a second UI framework |
| React Query polling (no WebSocket) | 30–120 min runs require hundreds of polling requests; no push semantics; latency between completion and display |
| Separate progress URL + separate result URL | Navigation disruption during live run; bookmark semantics are ambiguous; WebSocket lifecycle breaks on navigate |
| Build pipeline editor now (not deferred) | Eval harness should produce evidence of which workflow patterns work before committing to a pipeline authoring UI; React Flow visual editor scope is large |
8. References¶
| Citation | Relevance |
|---|---|
| Nielsen Norman Group — 4 Principles to Reduce Cognitive Load in Forms (nngroup.com) | Wizard pattern justification; progressive disclosure; one-focus-per-screen |
| Nielsen Norman Group — Wizards: Definition and Design Recommendations (nngroup.com/articles/wizards/) | Source for §5.1 quote: "dynamically display relevant fields based on users' prior input" |
| GOV.UK Design System — Question pages: One thing per page (design-system.service.gov.uk) | Validated through lab testing on Register to vote and GOV.UK Verify; starting pattern for multi-step flows |
| Smashing Magazine — Better Form Design: One Thing Per Page (smashingmagazine.com, 2017) | User research: single-question pages reduce errors and improve completion rates |
| GitHub Actions — workflow run page (github.com) | Single-URL progress → result auto-transition pattern |
| Vercel — deployment status page (vercel.com) | Same-URL live-to-complete state transition precedent |
| DORA State of DevOps 2024 — Dashboard design | Multi-metric tab panel inspiration for Evaluations hub |
| SWE-bench web results viewer (swebench.com) | Dimension-per-row score bar display inspiration |
api/websocket.ts:connectExecutionStream() |
Existing WebSocket client infrastructure — reused verbatim |
hooks/useWorkflowStream.ts |
Existing WebSocket React hook — mirrored for eval event types |
pages/LivePage.tsx |
Proven single-page real-time display; CriterionRow bar chart pattern |
React Flow documentation — @xyflow/react (reactflow.dev) |
Visual builder option for future pipeline editor phase (React Flow v12 publishes under the @xyflow org) |
9. Decision Map¶
┌──────────────────────────────────────────────────────────────────────┐
│ ADR-012 DECISION MAP │
│ │
│ EvaluationsPage Disposition │
│ ├── Leave as-is (passive filtered table) ─────── REJECTED │
│ ├── Replace with comparison-only view ────────── REJECTED │
│ └── Tab extension: hub + CTA + Comparisons ───── CHOSEN │
│ │
│ Comparison Setup UX │
│ ├── Single long form (all fields) ────────────── REJECTED │
│ ├── Full-page modal ──────────────────────────── REJECTED │
│ └── 2-step wizard (one decision per screen) ──── CHOSEN │
│ │
│ Progress and Result Display │
│ ├── Separate progress route + separate result ── REJECTED │
│ └── Single route, auto-transition in place ────── CHOSEN │
│ │
│ Real-Time Streaming │
│ ├── React Query polling ──────────────────────── REJECTED │
│ ├── SSE ──────────────────────────────────────── REJECTED │
│ └── Reuse connectExecutionStream() (WebSocket) ── CHOSEN │
│ │
│ Contestant Configuration │
│ ├── File path only (user edits files in terminal) REJECTED │
│ └── Inline in wizard (textarea/dropdown/inputs) ── CHOSEN │
│ │
│ Workflow Pipeline Builder │
│ ├── Build now (React Flow visual editor) ─────── DEFERRED │
│ └── Flag in roadmap — post eval harness data ──── CURRENT SCOPE │
└──────────────────────────────────────────────────────────────────────┘