consensus() runs N independent completions in parallel with the same prompt at higher temperature, normalizes whitespace, and votes on the result. The winning response is returned with an agreement_ratio score so the caller can gate on confidence.
importasyncioimportosfromexecutionkitimportProvider,consensusasyncdefmain()->None:asyncwithProvider(base_url="https://api.openai.com/v1",api_key=os.environ["OPENAI_API_KEY"],model="gpt-4o-mini",)asprovider:result=awaitconsensus(provider,"Classify this support ticket as exactly one of ""'billing', 'tech', or 'other':\n\n""'My card was charged twice this month.'",num_samples=5,strategy="majority",# or "unanimous")print(result.value)# 'billing'print(result.metadata["agreement_ratio"])# e.g. 0.8 = 4 of 5print(result.metadata["unique_responses"])# 2print(result.metadata["tie_count"])# 1 = no tieprint(result.cost.llm_calls)# 5asyncio.run(main())
Whitespace is normalized for voting (re.sub(r"\s+", " ", text.strip())). Two responses differing only in trailing newlines count as identical. The original (un-normalized) winning string is returned.
Use higher temperature (0.7–1.0) than you would for a single call — diverse samples are what voting fixes.
Constrain the answer space in the prompt ("answer with exactly one of: …"). Free-form responses rarely vote cleanly.
Gate on agreement_ratio before trusting the answer:
ifresult.metadata["agreement_ratio"]<0.6:# Fall back to a stronger model or human review...