Iterative Refinement¶

refine_loop() generates an initial response, scores it, then asks the model to improve it. The loop stops when the score crosses target_score, score deltas stall beyond patience, or max_iterations is hit. The best-scoring response is returned.

When to use / when not to use¶

Use it when…	Avoid it when…
You can write or pick an evaluator that scores quality on `[0.0, 1.0]`.	Latency matters more than quality.
Quality matters more than cost (writing, code review, summaries).	Each iteration is unlikely to actually improve the answer (e.g. multiple-choice).
You can bound iterations (e.g. `max_iterations=4`).	The task is purely factual — use Consensus instead.
You want a quality gate (`target_score`) on output.	The default LLM-as-judge evaluator is unsafe for adversarial input — write a custom evaluator.

Call flow¶

sequenceDiagram
    participant App
    participant refine
    participant Provider
    participant Evaluator
    App->>refine: refine_loop(provider, prompt, target_score=0.9)
    refine->>Provider: complete(prompt)
    Provider-->>refine: initial response
    refine->>Evaluator: score(initial)
    Evaluator-->>refine: 0.72
    loop until converged or max_iterations
        refine->>Provider: complete(prompt + best + "improve")
        Provider-->>refine: refined response
        refine->>Evaluator: score(refined)
        Evaluator-->>refine: 0.86
        refine->>refine: ConvergenceDetector.should_stop(score)
    end
    refine-->>App: PatternResult(value=best, score=best_score, ...)

Minimal example¶

import asyncio
import os
from executionkit import Provider, refine_loop

async def main() -> None:
    async with Provider(
        base_url="https://api.openai.com/v1",
        api_key=os.environ["OPENAI_API_KEY"],
        model="gpt-4o-mini",
    ) as provider:
        result = await refine_loop(
            provider,
            "Write a one-paragraph explanation of the Turing test "
            "for a non-technical reader.",
            target_score=0.85,
            max_iterations=4,
            patience=2,
        )

        print(result.value)                            # best response
        print(result.score)                            # e.g. 0.91
        print(result.metadata["iterations"])           # e.g. 2
        print(result.metadata["converged"])            # True
        print(result.metadata["score_history"])        # [0.72, 0.86, 0.91]

asyncio.run(main())

Custom evaluator¶

For production, supply your own evaluator. The default uses an LLM-as-judge prompt with XML-delimiter sandboxing — fine for development, but you should write a deterministic or domain-specific scorer when input may contain adversarial content.

async def length_evaluator(text: str, _: object) -> float:
    """Score 1.0 for 80–200 word answers, lower for outliers."""
    n = len(text.split())
    if 80 <= n <= 200:
        return 1.0
    if n < 80:
        return n / 80
    return max(0.0, 1.0 - (n - 200) / 200)

result = await refine_loop(
    provider,
    "Summarize the Turing test in 80–200 words.",
    evaluator=length_evaluator,
    target_score=0.95,
)

Configuration knobs¶

Parameter	Default	Description
`evaluator`	`None`	`async (text, provider) -> float in [0,1]`. `None` uses the default LLM-as-judge.
`max_eval_chars`	`32_768`	Truncation limit for text passed to the default evaluator.
`target_score`	`0.9`	Convergence target. The loop stops when score >= this.
`max_iterations`	`5`	Maximum refinement iterations after the initial generation.
`patience`	`3`	Stale-delta iterations before declaring convergence.
`delta_threshold`	`0.01`	Minimum meaningful score improvement.
`temperature`	`0.7`	Sampling temperature for generation calls (evaluator uses `0.1`).
`max_tokens`	`4096`	Per-completion token cap.
`max_cost`	`None`	Optional `TokenUsage` budget across all calls.
`retry`	`DEFAULT_RETRY`	Per-call retry config.

Metadata keys¶

Key	Type	Meaning
`iterations`	`int`	Refinement iterations performed (`0` = converged on first attempt).
`converged`	`bool`	`True` if the loop converged before `max_iterations`.
`score_history`	`list[float]`	Score at each iteration, including the initial generation.

Cost characteristics¶

Up to 2 × (1 + max_iterations) LLM calls when using the default evaluator (one generation + one evaluation per round). A custom evaluator that doesn't call the LLM cuts this in half.
Sequential. Each iteration depends on the previous response — no parallelism.
Best-result tracking. The returned value is always the highest-scoring response seen, even if a later iteration regressed.
max_cost is checked before every call and raises BudgetExhaustedError immediately on overrun.

Errors¶

Exception	Cause
`ValueError`	Default evaluator parsed a score outside `[0, 10]`.
`BudgetExhaustedError`	`max_cost` exceeded between iterations.
`RateLimitError` / `ProviderError`	Bubbled from `Provider.complete` after retry exhaustion.

Security note¶

The default evaluator wraps the text being scored in <response_to_rate> XML delimiters and instructs the LLM to ignore any instructions inside them. This mitigates prompt injection attacks where adversarial content in a generated response would otherwise override the scoring instruction. Text is also truncated to max_eval_chars (default 32 768) before being sent to the evaluator. Even with these defenses, LLM-as-judge is not safe against motivated attackers — write a custom evaluator for production workloads with untrusted input.

Source¶

executionkit/patterns/refine_loop.py