ADR-007: Multidimensional Classification Matrix and Stop Policy for Iterative Deep ResearchΒΆ
| Field | Value |
|---|---|
| ID | ADR-007 |
| Status | π‘ Proposed |
| Date | 2026-02-23 |
| System | agentic-workflows-v2 Β· deep-research pipeline |
| Authors | Platform Engineering |
| Reviewers | Research Infra, ML Platform |
| Supersedes | (none) |
1. TL;DRΒΆ
We abandon the single weighted "Confidence Index" scalar as the gate for stopping iterative research rounds. Instead, we adopt a DORA-style multidimensional classification matrix where every dimension must independently achieve "High" or better before stopping. The CI scalar is retained only as a tiebreaker for the
coalesce()best-of-N selector.
2. Status HistoryΒΆ
| Date | Status | Note |
|---|---|---|
| 2026-02-10 | π΅ Draft | Initial CI-weighted-sum proposal |
| 2026-02-18 | π Under Review | Challenge: compensability masks single-dimension failures |
| 2026-02-23 | π‘ Proposed | Pivoted to multidimensional matrix; CI demoted to tiebreaker |
3. Context & Problem StatementΒΆ
The agentic-workflows-v2 system runs a 10-node deep research pipeline with bounded iterative rounds R1βR4. Each round produces a research artifact. The system must answer three questions before committing to a round result or continuing:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β THREE QUESTIONS THE STOP POLICY MUST ANSWER β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Q1 β How should multi-dimensional research quality be scored? β
β Q2 β When should iterative refinement stop? β
β Q3 β How should source temporal freshness factor in? β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3.1 Pipeline ArchitectureΒΆ
flowchart LR
subgraph PIPELINE ["π¬ Deep Research Pipeline (R1βR4)"]
direction LR
R1["Round 1\nResearch Artifact"] -->|score| E1{Gate?}
E1 -->|PASS| C["coalesce()\nbest-of-N"]
E1 -->|FAIL| R2["Round 2\nRefined Artifact"]
R2 -->|score| E2{Gate?}
E2 -->|PASS| C
E2 -->|FAIL| R3["Round 3"]
R3 -->|score| E3{Gate?}
E3 -->|PASS| C
E3 -->|FAIL| R4["Round 4\n(Max)"]
R4 -->|score| C
end
C --> OUT["β
Final Output"]
3.2 Why the Original CI Formula FailedΒΆ
The original proposal was a weighted arithmetic mean (WAM):
CI = 0.25 Γ coverage + 0.20 Γ source_quality + 0.20 Γ agreement + 0.20 Γ verification + 0.15 Γ recency
Stop when: CI β₯ 0.80 AND recent_sources_count β₯ 10 AND critical_contradictions == 0
The compensability problem β a WAM lets high scores in one dimension mask failure in another:
| Scenario | Coverage | Verification | Recency | WAM Result | Actual Quality |
|---|---|---|---|---|---|
| Nominal | 0.90 | 0.85 | 0.80 | 0.87 β | Good |
| Masked failure | 0.95 | 0.20 | 0.95 | 0.67 β | β Dangerous |
| True failure | 0.40 | 0.40 | 0.40 | 0.40 β | Correctly caught |
A verification score of 0.20 represents unverified claims β yet the WAM returns a passing score of 0.67. The single composite number hides the failure.
4. DecisionΒΆ
Adopt a DORA-style multidimensional classification matrix as the primary stop gate. Demote the CI weighted sum to a secondary tiebreaker role only.
4.1 The Multidimensional Classification MatrixΒΆ
Each research dimension is independently scored into one of four performance tiers:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RESEARCH QUALITY CLASSIFICATION MATRIX β
β βββββββββββ¦ββββββββββββ¦βββββββββββββ¦ββββββββββββ¦βββββββββββββ¦βββββ£
β Tier β Coverage β Src Qualityβ Agreement βVerificationβRec.β
β βββββββββββ¬ββββββββββββ¬βββββββββββββ¬ββββββββββββ¬βββββββββββββ¬βββββ£
β π Elite β β₯ 0.90 β β₯ 0.90 β β₯ 0.90 β β₯ 0.90 ββ₯90dβ
β β
High β β₯ 0.75 β β₯ 0.75 β β₯ 0.75 β β₯ 0.75 ββ₯60dβ
β β οΈ Mediumβ β₯ 0.50 β β₯ 0.50 β β₯ 0.50 β β₯ 0.50 ββ₯30dβ
β β Low β < 0.50 β < 0.50 β < 0.50 β < 0.50 β<30dβ
ββββββββββββ©ββββββββββββ©βββββββββββββ©ββββββββββββ©βββββββββββββ©βββββ
Dimension definitions:
- Coverage β breadth of topics addressed relative to the query scope
- Source Quality β authority, peer-review status, and citation depth of sources
- Agreement β cross-source corroboration; absence of contradicting claims
- Verification β active fact-checking and claim provenance tracing
- Recency β proportion of sources published within the domain-adaptive window
4.2 Stop Gate LogicΒΆ
flowchart TD
GS["π Score All 5 Dimensions"]
GS --> CK1{"All dimensions\nβ₯ High?"}
CK1 -->|No| CK1F["β CONTINUE\nto next round"]
CK1 -->|Yes| CK2{"recent_sources_count\nβ₯ 10?"}
CK2 -->|No| CK2F["β CONTINUE\nto next round"]
CK2 -->|Yes| CK3{"critical_contradictions\n== 0?"}
CK3 -->|No| CK3F["π HALT (contradiction)\ncoalesce() selects best prior round"]
CK3 -->|Yes| PASS["β
STOP\nThis round passes"]
CK1F --> REG{"Consecutive\nregression\ndetected?"}
CK2F --> REG
REG -->|Yes| STOP2["π PATIENCE STOP\ncoalesce() selects best round"]
REG -->|No| NEXT["βΆ Next Round"]
Stopping conditions (formal):
| Condition | Type | Trigger |
|---|---|---|
| All dimensions β₯ High | Non-compensatory conjunction | Primary pass gate |
| recent_sources_count β₯ 10 | Hard floor | Minimum evidence gate |
| critical_contradictions == 0 | Veto override | Single disqualifier |
| Consecutive regression (patience = 1) | ML early stopping | Performance degradation |
| Round R4 reached | Hard bound | Max iteration guard |
4.3 Best-of-N Tiebreaker (CI as Secondary Scorer)ΒΆ
When coalesce() must rank multiple passing rounds β or select the "least bad" round when all fail β the CI weighted sum is used only at this step:
CI (tiebreaker) = 0.25 Γ coverage
+ 0.20 Γ source_quality
+ 0.20 Γ agreement
+ 0.20 Γ verification
+ 0.15 Γ recency
β οΈ This weight vector has no empirical calibration. See Β§7 (Open Questions) for sensitivity analysis requirements.
Alternative tiebreaker weight candidates:
| Weight Strategy | Coverage | Src Quality | Agreement | Verification | Recency | Basis |
|---|---|---|---|---|---|---|
| Proposed | 0.25 | 0.20 | 0.20 | 0.20 | 0.15 | Intuitive ranking |
| Equal weights | 0.20 | 0.20 | 0.20 | 0.20 | 0.20 | Dawes (1979) |
| ROC weights | 0.46 | 0.21 | 0.21 | 0.21 | 0.11 | Barron & Barrett (1996) |
5. RationaleΒΆ
5.1 Production Precedents for Multidimensional GatesΒΆ
| System | Approach | Compensatory? | Analog in This ADR |
|---|---|---|---|
| DORA Metrics | Elite/High/Medium/Low per dimension; no aggregate | β No | Direct inspiration for tier model |
| SonarQube | Conjunctive A-ratings; any failure = overall FAIL | β No | critical_contradictions == 0 veto |
| Google SRE Error Budget | Binary gate; budget exhaustion halts all changes | β No | Hard floor constraints |
| Dynatrace Quality Gates | key_SLI flag creates hard non-compensatory gate | β No (for key SLIs) | Dimension-level "veto" semantics |
| OpenAI Evals | Per-metric scores; no built-in composite | β No | Per-dimension reporting |
| Stanford HELM | Equal-weighted composite (abandoned) | β Yes (original) | β οΈ Counter-example β HELM moved away |
5.2 Compensability Danger IllustratedΒΆ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β "The Masking Problem" β WAM Hides Catastrophic Failure β
β β
β Coverage ββββββββββββββββββββ 0.95 β
β Src Quality ββββββββββββββββββββ 0.95 β
β Agreement ββββββββββββββββββββ 0.90 β
β Verification ββββββββββββββββββββ 0.20 β FAILURE β
β Recency ββββββββββββββββββββ 0.95 β
β βββββββββββ β
β WAM Result: 0.79 β
(PASSES old CI gate!) β
β Matrix Gate: β FAIL (Verification = Low) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
5.3 Iterative Refinement Degradation β Why Bounds MatterΒΆ
LLM iterative self-improvement has empirically established limits:
xychart-beta
title "Quality vs. Refinement Round (SELF-REFINE empirical pattern)"
x-axis ["R1", "R2", "R3", "R4", "R5"]
y-axis "Relative Quality" 0 --> 130
line [85, 105, 117, 119, 112]
| Research | Finding | Implication |
|---|---|---|
| SELF-REFINE (NeurIPS 2023) | Quality plateaus by round 3β4 | R1βR4 bound is empirically justified |
| Huang et al. (ICLR 2024) | Without external feedback, self-correction degrades | Consecutive-regression stop is necessary |
| Snell et al. (2024) | 38% of revisions corrupt correct answers | coalesce() best-of-N is required, not optional |
| Gao, Schulman & Hilton (ICML 2023) | BON degrades past Nβ16 under proxy scoring | 4-round bound is safely within optimum |
5.4 Recency β Why 183 Days Is WrongΒΆ
The prior 183-day freshness window was borrowed from international tax residency law (IRS Substantial Presence Test), not information retrieval. It has no IR basis.
Domain-adaptive window targets (to replace 183-day hard cutoff):
| Domain | Recommended Freshness Window | Rationale |
|---|---|---|
| Current events / politics | 7β30 days | High publication velocity |
| Technology releases | 30β90 days | Rapid version churn |
| Academic research surveys | 12β24 months | Peer review cycle |
| Foundational science | Recency not required | Timeless literature |
| Financial market data | Hoursβdays | Real-time decay |
Decision: Replace the fixed 183-day window with a domain-adaptive freshness classifier that infers the appropriate window from source publication rate (following the Google QDF model).
6. ConsequencesΒΆ
6.1 Positive OutcomesΒΆ
| Outcome | Mechanism |
|---|---|
| No more masked failures | Non-compensatory conjunction; one Low blocks the gate |
| Aligns with industry direction | DORA, OpenAI Evals, HELM all moved to per-dimension reporting |
| Empirically bounded iterations | R1βR4 cap validated by SELF-REFINE & BON overoptimization research |
| Scores are interpretable | Each dimension has a clear tier; no opaque composite |
6.2 Trade-offs and RisksΒΆ
| Risk | Severity | Mitigation |
|---|---|---|
| CI tiebreaker weights are uncalibrated | π Medium | Sensitivity analysis (Β§7) before deployment |
| Recency tier thresholds are provisional | π Medium | Domain-adaptive window per Β§5.4 |
| All-fail scenarios need fallback | π‘ Low | coalesce() selects lowest-CI "least bad" round |
| No confidence intervals on tier scores | π‘ Low | Follow Anthropic "Adding Error Bars" guidance |
recent_sources_count β₯ 10 floor is arbitrary |
π‘ Low | Validate against corpus quality benchmarks |
7. Open Questions & Required ActionsΒΆ
| # | Action | Owner | Priority | Deadline |
|---|---|---|---|---|
| 1 | Sensitivity analysis: test Proposed vs. Equal vs. ROC tiebreaker weights on historical pipeline outputs | Research Infra | π΄ P1 | Before prod |
| 2 | Implement domain-adaptive recency window; replace 183-day hard cutoff | Platform Eng | π΄ P1 | Before prod |
| 3 | Add confidence intervals to dimension tier classifications | ML Platform | π P2 | Sprint +1 |
| 4 | Validate recent_sources_count β₯ 10 floor against corpus size benchmarks |
Research Infra | π P2 | Sprint +2 |
| 5 | Difficulty-adaptive thresholds: consider varying "High" bar by query complexity | Research | π‘ P3 | Backlog |
8. Alternatives ConsideredΒΆ
| Alternative | Description | Rejected Because |
|---|---|---|
| Weighted CI as primary gate (prior design) | WAM of 5 dimensions with 0.80 threshold | Compensability masks catastrophic single-dimension failures |
| Geometric mean | Partial compensability penalty | Double-penalizes low scores alongside explicit floor gates; over-penalizes |
| Pure minimum (SonarQube-strict) | Gate = min(all dimensions) β₯ 0.75 |
Too brittle; single noisy dimension could permanently stall pipeline |
| No composite, conjunct-only | Report tier vector; no CI tiebreaker at all | coalesce() requires a scalar to rank candidates; cannot avoid |
9. ReferencesΒΆ
| Citation | Relevance |
|---|---|
| DORA State of DevOps 2024 | Foundation for multidimensional tier model |
| Madaan et al. β SELF-REFINE (NeurIPS 2023) | Validates R1βR4 iteration bound |
| Huang et al. (ICLR 2024) | Demonstrates intrinsic self-correction degradation |
| Snell et al. (2024) β Sequential Revision Models | Motivates coalesce() best-of-N over final-round selection |
| Gao, Schulman & Hilton (ICML 2023) | BON overoptimization scaling laws; validates 4-round bound |
| Dawes (1979) β Robust Beauty of Improper Models | Equal-weight baseline for tiebreaker |
| Barron & Barrett (1996) β ROC Weights | ROC weight alternative for tiebreaker |
| Prechelt (1998) β Early Stopping | Patience-based stopping; maps to consecutive-regression mechanism |
| Barbaresi (2020) β htmldate (JOSS) | Date extraction reliability limits |
| Anthropic β Adding Error Bars to Evals (2024) | Confidence interval guidance |
10. Decision Record SummaryΒΆ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ADR-007 DECISION MAP β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β OLD: One CI score βββ threshold gate βββ stop/continue β
β β
β NEW: 5 Dimensions βββ each classified independently β
β β β
β βββ All β₯ High? βββ AND βββ sources β₯ 10? β
β β AND βββ contradictions=0β
β β β β
β β STOP β
β
β β β
β βββ Any < High? βββ CONTINUE or PATIENCE β
β STOP after regression β
β β
β coalesce(): CI score used ONLY to rank candidates β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Approved by: (pending review) Next review date: 2026-03-23