refine_loop() generates an initial response, scores it, then asks the model to improve it. The loop stops when the score crosses target_score, score deltas stall beyond patience, or max_iterations is hit. The best-scoring response is returned.
importasyncioimportosfromexecutionkitimportProvider,refine_loopasyncdefmain()->None:asyncwithProvider(base_url="https://api.openai.com/v1",api_key=os.environ["OPENAI_API_KEY"],model="gpt-4o-mini",)asprovider:result=awaitrefine_loop(provider,"Write a one-paragraph explanation of the Turing test ""for a non-technical reader.",target_score=0.85,max_iterations=4,patience=2,)print(result.value)# best responseprint(result.score)# e.g. 0.91print(result.metadata["iterations"])# e.g. 2print(result.metadata["converged"])# Trueprint(result.metadata["score_history"])# [0.72, 0.86, 0.91]asyncio.run(main())
For production, supply your own evaluator. The default uses an LLM-as-judge prompt with XML-delimiter sandboxing — fine for development, but you should write a deterministic or domain-specific scorer when input may contain adversarial content.
asyncdeflength_evaluator(text:str,_:object)->float:"""Score 1.0 for 80–200 word answers, lower for outliers."""n=len(text.split())if80<=n<=200:return1.0ifn<80:returnn/80returnmax(0.0,1.0-(n-200)/200)result=awaitrefine_loop(provider,"Summarize the Turing test in 80–200 words.",evaluator=length_evaluator,target_score=0.95,)
Up to 2 × (1 + max_iterations) LLM calls when using the default evaluator (one generation + one evaluation per round). A custom evaluator that doesn't call the LLM cuts this in half.
Sequential. Each iteration depends on the previous response — no parallelism.
Best-result tracking. The returned value is always the highest-scoring response seen, even if a later iteration regressed.
max_cost is checked before every call and raises BudgetExhaustedError immediately on overrun.
The default evaluator wraps the text being scored in <response_to_rate> XML delimiters and instructs the LLM to ignore any instructions inside them. This mitigates prompt injection attacks where adversarial content in a generated response would otherwise override the scoring instruction. Text is also truncated to max_eval_chars (default 32 768) before being sent to the evaluator. Even with these defenses, LLM-as-judge is not safe against motivated attackers — write a custom evaluator for production workloads with untrusted input.