Skip to content

Consensus

consensus() runs N independent completions in parallel with the same prompt at higher temperature, normalizes whitespace, and votes on the result. The winning response is returned with an agreement_ratio score so the caller can gate on confidence.

When to use / when not to use

Use it when… Avoid it when…
The answer is a short factual or classification label. The answer is long-form prose — voting on long strings rarely matches.
You can tolerate cost and want to reduce hallucination. Latency matters more than reliability.
You can run calls in parallel (the provider supports concurrency). The model is rate-limited tightly enough that N parallel calls trigger 429s.
Tie-handling is acceptable (you check tie_count). You require a single deterministic answer per prompt.

Call flow

sequenceDiagram
    participant App
    participant consensus
    participant Provider
    App->>consensus: consensus(provider, prompt, num_samples=5)
    consensus->>consensus: spawn 5 coros via gather_strict
    par parallel
        consensus->>Provider: complete(messages, temperature=0.9)
        consensus->>Provider: complete(messages, temperature=0.9)
        consensus->>Provider: complete(messages, temperature=0.9)
        consensus->>Provider: complete(messages, temperature=0.9)
        consensus->>Provider: complete(messages, temperature=0.9)
    end
    consensus->>consensus: normalize whitespace + Counter most_common
    alt strategy = unanimous and not all match
        consensus-->>App: raise ConsensusFailedError
    else
        consensus-->>App: PatternResult(value=winner, score=agreement_ratio, ...)
    end

Minimal example

import asyncio
import os
from executionkit import Provider, consensus

async def main() -> None:
    async with Provider(
        base_url="https://api.openai.com/v1",
        api_key=os.environ["OPENAI_API_KEY"],
        model="gpt-4o-mini",
    ) as provider:
        result = await consensus(
            provider,
            "Classify this support ticket as exactly one of "
            "'billing', 'tech', or 'other':\n\n"
            "'My card was charged twice this month.'",
            num_samples=5,
            strategy="majority",                # or "unanimous"
        )

        print(result.value)                          # 'billing'
        print(result.metadata["agreement_ratio"])    # e.g. 0.8 = 4 of 5
        print(result.metadata["unique_responses"])   # 2
        print(result.metadata["tie_count"])          # 1 = no tie
        print(result.cost.llm_calls)                 # 5

asyncio.run(main())

Configuration knobs

Parameter Default Description
num_samples 5 Parallel completions to run. Must be >= 1.
strategy "majority" "majority" or "unanimous". Accepts VotingStrategy enum or string.
temperature 0.9 Higher = more diverse samples (better for voting).
max_tokens 4096 Per-completion token cap.
max_concurrency 5 Semaphore limit for parallel calls.
retry DEFAULT_RETRY Per-call retry config for transient errors.
max_cost None TokenUsage budget shared across all samples.

Metadata keys

Key Type Meaning
agreement_ratio float Fraction of samples matching the winner (top_count / num_samples).
unique_responses int Number of distinct response strings observed (after whitespace normalization).
tie_count int Number of responses tied for the top vote count. 1 = clean win.

Cost characteristics

  • O(num_samples) LLM calls. All calls are issued concurrently up to max_concurrency.
  • Parallelizable. Total wall-clock latency ≈ slowest sample, not the sum.
  • Budget enforcement is TOCTOU-safe. checked_complete reserves the call slot before awaiting, so concurrent samples cannot race past max_cost.llm_calls.
  • No retry amplification by default. Each sample uses the shared RetryConfig; transient failures retry the failing sample only.

Errors

Exception Cause
ValueError num_samples < 1.
ConsensusFailedError strategy="unanimous" and responses differ.
BudgetExhaustedError max_cost exceeded mid-sample.
RateLimitError / ProviderError Bubbled from Provider.complete after retry exhaustion.

Tips

  • Whitespace is normalized for voting (re.sub(r"\s+", " ", text.strip())). Two responses differing only in trailing newlines count as identical. The original (un-normalized) winning string is returned.
  • Use higher temperature (0.7–1.0) than you would for a single call — diverse samples are what voting fixes.
  • Constrain the answer space in the prompt ("answer with exactly one of: …"). Free-form responses rarely vote cleanly.
  • Gate on agreement_ratio before trusting the answer:
if result.metadata["agreement_ratio"] < 0.6:
    # Fall back to a stronger model or human review
    ...

Source

executionkit/patterns/consensus.py