Skip to content

ADR-007: Multidimensional Classification Matrix and Stop Policy for Iterative Deep ResearchΒΆ


Field Value
ID ADR-007
Status 🟑 Proposed
Date 2026-02-23
System agentic-workflows-v2 Β· deep-research pipeline
Authors Platform Engineering
Reviewers Research Infra, ML Platform
Supersedes (none)

1. TL;DRΒΆ

We abandon the single weighted "Confidence Index" scalar as the gate for stopping iterative research rounds. Instead, we adopt a DORA-style multidimensional classification matrix where every dimension must independently achieve "High" or better before stopping. The CI scalar is retained only as a tiebreaker for the coalesce() best-of-N selector.


2. Status HistoryΒΆ

Date Status Note
2026-02-10 πŸ”΅ Draft Initial CI-weighted-sum proposal
2026-02-18 🟠 Under Review Challenge: compensability masks single-dimension failures
2026-02-23 🟑 Proposed Pivoted to multidimensional matrix; CI demoted to tiebreaker

3. Context & Problem StatementΒΆ

The agentic-workflows-v2 system runs a 10-node deep research pipeline with bounded iterative rounds R1–R4. Each round produces a research artifact. The system must answer three questions before committing to a round result or continuing:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           THREE QUESTIONS THE STOP POLICY MUST ANSWER         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Q1 β”‚ How should multi-dimensional research quality be scored? β”‚
β”‚  Q2 β”‚ When should iterative refinement stop?                   β”‚
β”‚  Q3 β”‚ How should source temporal freshness factor in?          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3.1 Pipeline ArchitectureΒΆ

flowchart LR
    subgraph PIPELINE ["πŸ”¬ Deep Research Pipeline (R1–R4)"]
        direction LR
        R1["Round 1\nResearch Artifact"] -->|score| E1{Gate?}
        E1 -->|PASS| C["coalesce()\nbest-of-N"]
        E1 -->|FAIL| R2["Round 2\nRefined Artifact"]
        R2 -->|score| E2{Gate?}
        E2 -->|PASS| C
        E2 -->|FAIL| R3["Round 3"]
        R3 -->|score| E3{Gate?}
        E3 -->|PASS| C
        E3 -->|FAIL| R4["Round 4\n(Max)"]
        R4 -->|score| C
    end
    C --> OUT["βœ… Final Output"]

3.2 Why the Original CI Formula FailedΒΆ

The original proposal was a weighted arithmetic mean (WAM):

CI = 0.25 Γ— coverage + 0.20 Γ— source_quality + 0.20 Γ— agreement + 0.20 Γ— verification + 0.15 Γ— recency

Stop when: CI β‰₯ 0.80 AND recent_sources_count β‰₯ 10 AND critical_contradictions == 0

The compensability problem β€” a WAM lets high scores in one dimension mask failure in another:

Scenario Coverage Verification Recency WAM Result Actual Quality
Nominal 0.90 0.85 0.80 0.87 βœ… Good
Masked failure 0.95 0.20 0.95 0.67 βœ… ❌ Dangerous
True failure 0.40 0.40 0.40 0.40 ❌ Correctly caught

A verification score of 0.20 represents unverified claims β€” yet the WAM returns a passing score of 0.67. The single composite number hides the failure.


4. DecisionΒΆ

Adopt a DORA-style multidimensional classification matrix as the primary stop gate. Demote the CI weighted sum to a secondary tiebreaker role only.

4.1 The Multidimensional Classification MatrixΒΆ

Each research dimension is independently scored into one of four performance tiers:

╔══════════════════════════════════════════════════════════════════╗
β•‘          RESEARCH QUALITY CLASSIFICATION MATRIX                  β•‘
╠══════════╦═══════════╦════════════╦═══════════╦════════════╦════╣
β•‘ Tier     β•‘ Coverage  β•‘ Src Qualityβ•‘ Agreement β•‘Verificationβ•‘Rec.β•‘
╠══════════╬═══════════╬════════════╬═══════════╬════════════╬════╣
β•‘ πŸ† Elite β•‘  β‰₯ 0.90   β•‘  β‰₯ 0.90   β•‘  β‰₯ 0.90   β•‘  β‰₯ 0.90   β•‘β‰₯90dβ•‘
β•‘ βœ… High  β•‘  β‰₯ 0.75   β•‘  β‰₯ 0.75   β•‘  β‰₯ 0.75   β•‘  β‰₯ 0.75   β•‘β‰₯60dβ•‘
β•‘ ⚠️ Mediumβ•‘  β‰₯ 0.50   β•‘  β‰₯ 0.50   β•‘  β‰₯ 0.50   β•‘  β‰₯ 0.50   β•‘β‰₯30dβ•‘
β•‘ ❌ Low   β•‘  < 0.50   β•‘  < 0.50   β•‘  < 0.50   β•‘  < 0.50   β•‘<30dβ•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•©β•β•β•β•β•β•β•β•β•β•β•β•©β•β•β•β•β•β•β•β•β•β•β•β•β•©β•β•β•β•β•β•β•β•β•β•β•β•©β•β•β•β•β•β•β•β•β•β•β•β•β•©β•β•β•β•β•

Dimension definitions:

  • Coverage β€” breadth of topics addressed relative to the query scope
  • Source Quality β€” authority, peer-review status, and citation depth of sources
  • Agreement β€” cross-source corroboration; absence of contradicting claims
  • Verification β€” active fact-checking and claim provenance tracing
  • Recency β€” proportion of sources published within the domain-adaptive window

4.2 Stop Gate LogicΒΆ

flowchart TD
    GS["πŸ“Š Score All 5 Dimensions"]
    GS --> CK1{"All dimensions\nβ‰₯ High?"}
    CK1 -->|No| CK1F["❌ CONTINUE\nto next round"]
    CK1 -->|Yes| CK2{"recent_sources_count\nβ‰₯ 10?"}
    CK2 -->|No| CK2F["❌ CONTINUE\nto next round"]
    CK2 -->|Yes| CK3{"critical_contradictions\n== 0?"}
    CK3 -->|No| CK3F["πŸ›‘ HALT (contradiction)\ncoalesce() selects best prior round"]
    CK3 -->|Yes| PASS["βœ… STOP\nThis round passes"]

    CK1F --> REG{"Consecutive\nregression\ndetected?"}
    CK2F --> REG
    REG -->|Yes| STOP2["πŸ›‘ PATIENCE STOP\ncoalesce() selects best round"]
    REG -->|No| NEXT["β–Ά Next Round"]

Stopping conditions (formal):

Condition Type Trigger
All dimensions β‰₯ High Non-compensatory conjunction Primary pass gate
recent_sources_count β‰₯ 10 Hard floor Minimum evidence gate
critical_contradictions == 0 Veto override Single disqualifier
Consecutive regression (patience = 1) ML early stopping Performance degradation
Round R4 reached Hard bound Max iteration guard

4.3 Best-of-N Tiebreaker (CI as Secondary Scorer)ΒΆ

When coalesce() must rank multiple passing rounds β€” or select the "least bad" round when all fail β€” the CI weighted sum is used only at this step:

CI (tiebreaker) = 0.25 Γ— coverage
                + 0.20 Γ— source_quality
                + 0.20 Γ— agreement
                + 0.20 Γ— verification
                + 0.15 Γ— recency

⚠️ This weight vector has no empirical calibration. See §7 (Open Questions) for sensitivity analysis requirements.

Alternative tiebreaker weight candidates:

Weight Strategy Coverage Src Quality Agreement Verification Recency Basis
Proposed 0.25 0.20 0.20 0.20 0.15 Intuitive ranking
Equal weights 0.20 0.20 0.20 0.20 0.20 Dawes (1979)
ROC weights 0.46 0.21 0.21 0.21 0.11 Barron & Barrett (1996)

5. RationaleΒΆ

5.1 Production Precedents for Multidimensional GatesΒΆ

System Approach Compensatory? Analog in This ADR
DORA Metrics Elite/High/Medium/Low per dimension; no aggregate ❌ No Direct inspiration for tier model
SonarQube Conjunctive A-ratings; any failure = overall FAIL ❌ No critical_contradictions == 0 veto
Google SRE Error Budget Binary gate; budget exhaustion halts all changes ❌ No Hard floor constraints
Dynatrace Quality Gates key_SLI flag creates hard non-compensatory gate ❌ No (for key SLIs) Dimension-level "veto" semantics
OpenAI Evals Per-metric scores; no built-in composite ❌ No Per-dimension reporting
Stanford HELM Equal-weighted composite (abandoned) βœ… Yes (original) ⚠️ Counter-example β€” HELM moved away

5.2 Compensability Danger IllustratedΒΆ

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  "The Masking Problem" β€” WAM Hides Catastrophic Failure  β”‚
β”‚                                                         β”‚
β”‚  Coverage      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  0.95              β”‚
β”‚  Src Quality   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  0.95              β”‚
β”‚  Agreement     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘  0.90              β”‚
β”‚  Verification  β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘  0.20  ← FAILURE  β”‚
β”‚  Recency       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  0.95              β”‚
β”‚                                            ─────────── β”‚
β”‚  WAM Result:   0.79  βœ… (PASSES old CI gate!)          β”‚
β”‚  Matrix Gate:  ❌ FAIL  (Verification = Low)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

5.3 Iterative Refinement Degradation β€” Why Bounds MatterΒΆ

LLM iterative self-improvement has empirically established limits:

xychart-beta
    title "Quality vs. Refinement Round (SELF-REFINE empirical pattern)"
    x-axis ["R1", "R2", "R3", "R4", "R5"]
    y-axis "Relative Quality" 0 --> 130
    line [85, 105, 117, 119, 112]
Research Finding Implication
SELF-REFINE (NeurIPS 2023) Quality plateaus by round 3–4 R1–R4 bound is empirically justified
Huang et al. (ICLR 2024) Without external feedback, self-correction degrades Consecutive-regression stop is necessary
Snell et al. (2024) 38% of revisions corrupt correct answers coalesce() best-of-N is required, not optional
Gao, Schulman & Hilton (ICML 2023) BON degrades past Nβ‰ˆ16 under proxy scoring 4-round bound is safely within optimum

5.4 Recency β€” Why 183 Days Is WrongΒΆ

The prior 183-day freshness window was borrowed from international tax residency law (IRS Substantial Presence Test), not information retrieval. It has no IR basis.

Domain-adaptive window targets (to replace 183-day hard cutoff):

Domain Recommended Freshness Window Rationale
Current events / politics 7–30 days High publication velocity
Technology releases 30–90 days Rapid version churn
Academic research surveys 12–24 months Peer review cycle
Foundational science Recency not required Timeless literature
Financial market data Hours–days Real-time decay

Decision: Replace the fixed 183-day window with a domain-adaptive freshness classifier that infers the appropriate window from source publication rate (following the Google QDF model).


6. ConsequencesΒΆ

6.1 Positive OutcomesΒΆ

Outcome Mechanism
No more masked failures Non-compensatory conjunction; one Low blocks the gate
Aligns with industry direction DORA, OpenAI Evals, HELM all moved to per-dimension reporting
Empirically bounded iterations R1–R4 cap validated by SELF-REFINE & BON overoptimization research
Scores are interpretable Each dimension has a clear tier; no opaque composite

6.2 Trade-offs and RisksΒΆ

Risk Severity Mitigation
CI tiebreaker weights are uncalibrated 🟠 Medium Sensitivity analysis (§7) before deployment
Recency tier thresholds are provisional 🟠 Medium Domain-adaptive window per §5.4
All-fail scenarios need fallback 🟑 Low coalesce() selects lowest-CI "least bad" round
No confidence intervals on tier scores 🟑 Low Follow Anthropic "Adding Error Bars" guidance
recent_sources_count β‰₯ 10 floor is arbitrary 🟑 Low Validate against corpus quality benchmarks

7. Open Questions & Required ActionsΒΆ

# Action Owner Priority Deadline
1 Sensitivity analysis: test Proposed vs. Equal vs. ROC tiebreaker weights on historical pipeline outputs Research Infra πŸ”΄ P1 Before prod
2 Implement domain-adaptive recency window; replace 183-day hard cutoff Platform Eng πŸ”΄ P1 Before prod
3 Add confidence intervals to dimension tier classifications ML Platform 🟠 P2 Sprint +1
4 Validate recent_sources_count β‰₯ 10 floor against corpus size benchmarks Research Infra 🟠 P2 Sprint +2
5 Difficulty-adaptive thresholds: consider varying "High" bar by query complexity Research 🟑 P3 Backlog

8. Alternatives ConsideredΒΆ

Alternative Description Rejected Because
Weighted CI as primary gate (prior design) WAM of 5 dimensions with 0.80 threshold Compensability masks catastrophic single-dimension failures
Geometric mean Partial compensability penalty Double-penalizes low scores alongside explicit floor gates; over-penalizes
Pure minimum (SonarQube-strict) Gate = min(all dimensions) β‰₯ 0.75 Too brittle; single noisy dimension could permanently stall pipeline
No composite, conjunct-only Report tier vector; no CI tiebreaker at all coalesce() requires a scalar to rank candidates; cannot avoid

9. ReferencesΒΆ

Citation Relevance
DORA State of DevOps 2024 Foundation for multidimensional tier model
Madaan et al. β€” SELF-REFINE (NeurIPS 2023) Validates R1–R4 iteration bound
Huang et al. (ICLR 2024) Demonstrates intrinsic self-correction degradation
Snell et al. (2024) β€” Sequential Revision Models Motivates coalesce() best-of-N over final-round selection
Gao, Schulman & Hilton (ICML 2023) BON overoptimization scaling laws; validates 4-round bound
Dawes (1979) β€” Robust Beauty of Improper Models Equal-weight baseline for tiebreaker
Barron & Barrett (1996) β€” ROC Weights ROC weight alternative for tiebreaker
Prechelt (1998) β€” Early Stopping Patience-based stopping; maps to consecutive-regression mechanism
Barbaresi (2020) β€” htmldate (JOSS) Date extraction reliability limits
Anthropic β€” Adding Error Bars to Evals (2024) Confidence interval guidance

10. Decision Record SummaryΒΆ

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ADR-007 DECISION MAP                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                          β”‚
β”‚  OLD:  One CI score ──→ threshold gate ──→ stop/continue β”‚
β”‚                                                          β”‚
β”‚  NEW:  5 Dimensions ──→ each classified independently    β”‚
β”‚             β”‚                                            β”‚
β”‚             β”œβ”€β”€ All β‰₯ High? ──→ AND ──→ sources β‰₯ 10?   β”‚
β”‚             β”‚                    AND ──→ contradictions=0β”‚
β”‚             β”‚                     β”‚                     β”‚
β”‚             β”‚                  STOP βœ…                   β”‚
β”‚             β”‚                                            β”‚
β”‚             └── Any < High? ──→ CONTINUE or PATIENCE     β”‚
β”‚                                    STOP after regression β”‚
β”‚                                                          β”‚
β”‚  coalesce(): CI score used ONLY to rank candidates       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Approved by: (pending review) Next review date: 2026-03-23