ADR-016: GitHub Models via GITHUB_TOKEN as Default E2E LLM Provider¶
Status: Accepted Date: 2026-04-22 Related: Epic 2 (Observable Execution), Epic 6 (Evaluation & Data Depth)
Context¶
Every CI job that exercises the platform end-to-end — Playwright streaming 5×, nightly reliability 50×, evaluation runs in eval-package-ci.yml, prompt quality gates, performance benchmarks — needs a working LLM provider. Without one, the workflows either skip the interesting code path or fail at the model call.
The candidate providers we consider in CI:
| Provider | Credential shape | Cost model | Rate limit |
|---|---|---|---|
| OpenAI | OPENAI_API_KEY |
Paid per token | Tier-based per account |
| Anthropic | ANTHROPIC_API_KEY |
Paid per token | Tier-based per account |
| Google Gemini | GEMINI_API_KEY |
Free tier + paid | Aggressive rate limits on free tier |
| Azure OpenAI | AZURE_OPENAI_API_KEY_* |
Paid | Per-deployment |
| GitHub Models | GITHUB_TOKEN |
Free for GitHub users, with daily cap | Daily quota |
| Ollama / local ONNX | none | CPU/RAM only | None |
For open-source CI — where every contributor PR must run against the same gates — we need a provider that:
- Requires no secret management for contributors. Maintainer-owned secrets are fine; forks cannot read them.
- Costs nothing to operate. Real LLM costs on a CI pipeline that runs 50× nightly plus every PR would accumulate fast.
- Covers enough model variety that tier-routing code paths actually exercise.
Ollama and local ONNX satisfy 1 and 2 but not 3 on GitHub-hosted runners — the runner CPU budget is insufficient for tier-2+ model latency expectations without aggressive timeouts that themselves become a flake source.
Decision¶
Use GitHub Models via GITHUB_TOKEN as the default LLM provider for all CI-executed end-to-end paths.
GITHUB_TOKENis automatically available on GitHub-hosted runners in workflows checked out from any branch on this repo.- The SmartModelRouter is configured to prefer
gh:*model identifiers when a liveGITHUB_TOKENis present. - Tier-3 workflows route to
gh:gpt-4o; tier-1 / tier-2 route togh:gpt-4o-mini. Both are in the GitHub Models catalogue as of 2026-04. - Local developers who want to run the same gates locally can set a personal-access token with
read:organd export it asGITHUB_TOKEN.
Forks that do not receive GITHUB_TOKEN fall back to deterministic (tier-0) workflows plus a skip-set for the LLM-dependent jobs. PRs from forks that touch LLM-dependent code cannot fully validate in fork CI — maintainers re-run the job against the base branch after review.
Consequences¶
Positive¶
- Zero out-of-pocket LLM cost for the repo maintainer.
- No contributor needs an OpenAI or Anthropic key to land a PR.
- GitHub Models covers
gpt-4o,gpt-4o-mini, and Phi / Llama variants — enough model surface that router fallback paths are genuinely tested. GITHUB_TOKENis scoped to the workflow run — no long-lived credential to rotate.
Negative¶
- Provider outages break CI. GitHub Models rate-limit events, maintenance windows, and model retirements cascade into "CI is red" with no code change. Has happened. See
KNOWN_LIMITATIONS.md§3.1. - Daily quota ceiling. Running the platform end-to-end ~100× per day (every PR push + nightly 50×) approaches the daily cap. We rely on the cap being generous; a tightening would force us to revisit.
- Fork CI is asymmetric. A contributor's PR in their fork runs a subset of gates; maintainers re-run against the base branch. This is a workflow cost, not a technical blocker.
- Vendor coupling. The platform officially supports 8+ providers; CI validates only one. Regressions in Anthropic or Gemini backends are caught only by local dev testing or by a paying customer.
- Not suitable for cleared federal deploys. The real deploy target uses Azure OpenAI or on-prem providers; the CI choice is an open-source-pragma, not a statement of intended production posture.
Mitigations in place¶
- Provider-specific unit tests run with mocked backends — a Gemini-specific bug caught by unit tests does not need GitHub Models to validate.
- Tier-0 (
test_deterministic) is the only guaranteed-green workflow — used for smoke tests where LLM availability cannot be assumed. - The router's circuit-breaker pattern means a GitHub Models outage degrades gracefully rather than crashing — though CI still fails because the gate expects a successful run.
Alternatives considered¶
- Paid OpenAI + maintainer secret. Rejected because cost is non-trivial at nightly 50× cadence and forks still cannot access the secret.
- Mocked LLM with recorded responses. Considered for the deterministic fixture work (see
test(golden): add deterministic golden-output regression test, commita6ffeba). Used for specific regression tests but not extensible to the full streaming surface. - Self-hosted Ollama runner. Viable but introduces runner-fleet maintenance burden.
- Azure OpenAI via maintainer secret. Works for maintainer runs; does not solve the fork asymmetry.
Review cadence¶
Re-evaluate when:
- GitHub Models tightens the daily quota in a way that breaks nightly cadence.
- GitHub Models is deprecated or materially re-priced.
- A separate hermetic-mock provider is built (Sprint B candidate — see ROADMAP).
- The platform gets a paying production deployment that is willing to fund a proper provider for CI.
Implementation references¶
- Token provisioning:
.github/workflows/*.yml— every job that referencesGITHUB_TOKENinenv:. - Router configuration:
agentic-workflows-v2/agentic_v2/models/smart_router.pyandbackends.py(thegh:*provider prefix). - Fork-skip logic: job-level
if: secrets.GITHUB_TOKEN != ''guards in relevant workflows.