ADR-017: Dataset Identifiers as Query Parameters, Not Path Segments¶
Status: Accepted Date: 2026-04-22 Related: Epic 6 (Evaluation & Data Depth), Sprint B #4 (Dataset API HF Integration)
Context¶
The evaluation surface exposes three dataset-aware endpoints on the FastAPI server:
GET /api/eval/datasets— list available datasets.GET /api/eval/datasets/sample-list?dataset_source=…&dataset_id=…&offset=…&limit=…— paginate summaries.GET /api/eval/datasets/sample-detail?dataset_source=…&dataset_id=…&sample_index=…— fetch full sample.
All three pass the dataset identifier as a query parameter (?dataset_id=humaneval)
rather than a path segment (/api/eval/datasets/humaneval/sample-list).
During Sprint B #4 the orchestrator flagged this as a candidate bug: the
query-param design looks less RESTful than /datasets/{id}/samples, and
the current registry uses short slug-like IDs (humaneval, swe-bench-verified)
so nothing today forces the query-param shape. Real-usage exercise against
HuggingFace confirmed the design is correct — this ADR ratifies the decision
after review so the question does not re-open in a future sprint.
Decision¶
Dataset identity travels on the query string, not the path, for every evaluation endpoint that references a dataset.
dataset_source("repository" | "local") anddataset_idare both query parameters.- The client in
ui/src/api/client.tsconstructs these URLs withURLSearchParams, which produces correctly percent-encoded pairs for any allowed identifier shape. - Path segments are reserved for resource kind (
/eval/datasets/sample-list), not resource identity.
Rationale¶
The identifiers we may need to accept¶
The current registry uses short aliases. Future registries we have already scoped but not yet shipped include:
| Source | Example ID | Slashes? |
|---|---|---|
| Current registry | humaneval |
no |
| Current registry | swe-bench-verified |
no |
| Raw HuggingFace repo | codeparrot/apps |
yes |
| Raw HuggingFace repo | openai/openai_humaneval |
yes |
| GitHub gold-standard | princeton-nlp/SWE-bench_Verified |
yes |
As soon as the registry accepts any Tier-B source that reuses an upstream identifier verbatim (which is the ergonomic default — rewriting every HF ID into a slug is friction we do not want), identifiers contain forward slashes.
Why path segments break under slashes¶
- Reverse proxies and WAFs normalize paths inconsistently. Azure Front
Door, CloudFront, and nginx each handle
%2Fin path segments differently. Some unescape, some reject, some rewrite. Query strings have no such ambiguity — the?boundary is well-defined and%2Finside the query is transparent to every proxy we have surveyed. - Framework routers split on raw slashes first. FastAPI's router
resolves
/eval/datasets/codeparrot/apps/sample-listas a four-segment path, not a dataset-id-with-slash. Working around it requires regex routes ({dataset_id:path}) that defeat automatic OpenAPI parameter typing and leak into every downstream client generator. - OpenAPI clients mis-encode the segment. Generated clients for TS,
Python, and Go all encode path params through
encodeURIComponentby default — which escapes/to%2Fand re-triggers the proxy normalization problem in step 1.
Query params sidestep all three issues with no tradeoff today (the current registry does not exercise the slash case) and no migration cost later (endpoints already accept the shape).
Consequences¶
Positive¶
- Slash-safe forever. Whenever the registry grows to accept raw HuggingFace or GitHub identifiers, the endpoints Just Work — no URL rewrites, no proxy-config audits, no client regeneration.
- Percent-encoding is the client's problem, and the client already handles it.
URLSearchParamsin the TypeScript client correctly encodes/,:,#, and Unicode. Custom path-segment escaping would duplicate that logic. - Endpoint surface stays uniform. Every dataset-aware endpoint follows
the same
?dataset_source=…&dataset_id=…shape regardless of identifier content.
Negative¶
- Slightly less idiomatic-REST on the surface. Strict REST fans prefer
/datasets/{id}/samples. We accept this aesthetic cost — safety beats style on this axis. - Deep-linking relies on query strings. Shared links embed the query params verbatim. Mitigated: the UI already preserves query params across navigation.
Neutral¶
- Caching is unaffected. Most caches (CloudFront, Cloudflare, nginx) key on path + query by default; this decision does not change cache behavior.
Alternatives Considered¶
Alt 1: Path segments with regex-path-capture ({dataset_id:path})¶
Rejected. Loses OpenAPI parameter typing, forces handler code to re-parse the tail, and still hits the proxy-normalization problem in step 1 above.
Alt 2: Base64url-encoded dataset IDs in path segments¶
Rejected. Adds ceremony on both sides (encode client, decode server), breaks the human-readability of URLs in logs and bookmarks, and only sidesteps the slash problem at the cost of turning every other debugging affordance off.
Alt 3: Hybrid — path for slug IDs, query for raw HF IDs¶
Rejected. Two endpoint shapes for the same operation doubles the client surface, the documentation burden, and the test matrix, for no upside over "always query."
Lineage¶
Builds on commit c990538 (original /api/eval/datasets endpoints landing).
This ADR ratifies the existing design after Sprint B #4 review confirmed it
is correct and forward-compatible. No code changes land with this ADR — it is
a decision record only.
Implementation Notes¶
- FastAPI route declarations:
agentic_v2/server/routes/evaluation_routes.py(list_dataset_samples,get_dataset_sample_detail). - Client:
agentic-workflows-v2/ui/src/api/client.ts— usesURLSearchParamsfor dataset-endpoint URL construction. - Integration test:
tests/server/test_evaluation_routes.pyexercises the query-param shape end-to-end.