feat(eval): per-agent per-version eval scorecard + release gate#1873
feat(eval): per-agent per-version eval scorecard + release gate#1873itomek wants to merge 20 commits into
Conversation
Tests encode the full acceptance criteria for gaia.eval.release_scorecard and gaia.eval.scorecard_gate before any implementation exists. Includes the email benchmark fixture used by the adapter tests.
…s 1-3) Core harness-agnostic scorecard generator and standalone release gate. - ResultPayload dataclass, compute_aggregate (guard empty/zero-weight) - render_scorecard + parse_scorecard (safe_load on first ---...--- slice) - validate_scorecard + REQUIRED_FIELDS; anchored semver path guard - latest_version_below (stdlib int-tuple, skips non-semver filenames) - carry_forward (patch-only, sets inherited_from, raises on minor/major) - scorecard_gate.main(argv)->int with --version/--manifest/--allow-regression - 38/44 tests pass; 4 adapter tests pending gen_scorecard.py (incr 4) - 1 CI test pending workflow update (incr 6) - 1 loose-coupling test false-positive: pytest_benchmark matches 'benchmark'
…est (increments 4) - gen_scorecard.py: reads benchmark scorecard.json (or any scenarios JSON) + ground_truth.json -> ResultPayload -> writes scorecards/<version>.md - Judged = quality.category_accuracy is finite float in [0,1]; zero judged raises - test_cases_run = sum(total_emails over judged); dataset_size excl _meta - Path derivation mirrors stamp_version.py (parents[...] from __file__) - Fix loose-coupling test: subprocess instead of sys.modules (avoids pytest_benchmark FP) (orchestrator-authorized replacement) - 43/44 tests pass; 1 remaining = CI workflow test (incr 6)
…nts 5-6) - docs/reference/eval-scorecard.mdx: schema, storage, formula, versioning policy, gate - docs/docs.json: nav entry in Evaluation Framework group - hub/agents/python/hello-world/scorecards/0.1.0.md: generator-produced generalization proof - hub/agents/npm/agent-email/scorecards/.gitkeep: placeholder for real scorecard - hub/agents/npm/agent-email/README.md: eval scorecard link to ./scorecards/0.2.4.md - hub/agents/npm/agent-email/package.json: add scorecards/ to files array - .github/workflows/release_agent_email.yml: scorecard-gate job + publish.needs update - lint fixes: remove unused imports from test files; black/isort/pylint/flake8 clean - 44/44 target tests pass; lint: ALL QUALITY CHECKS PASSED
The AC requires 'missing ANY required field ⇒ invalid', but the validator
only checked 5 top-level keys. Add nested checks for agent.{name,version},
recipe.dataset.{reference,size}, recipe.{methodology,config},
results.{test_cases_run,metrics}, aggregate.{name,formula,value}, with
non-dict-parent guards and a non-empty metrics-list requirement. Add
TestSchemaValidator cases for missing nested fields, empty metrics, and
non-dict sections. Also baseline sys.modules before import in the
loose-coupling test so editable-install path finders don't false-positive.
The benchmark scorecard.json has no top-level model/limit, so config.limit was always null — defeating the comparability note in eval-scorecard.mdx. Add a --limit CLI arg threaded into config.limit, and derive config.model from the run's scenarios[0].category (the model id in benchmark output), falling back to gaia-agent.yaml models[0]. Drop the dead list-comprehension in the final print.
Match the rest of release_agent_email.yml, which already uses actions/setup-python@v6.
Generate the email-triage agent's v0.2.4 release scorecard from an actual `gaia eval benchmark` run (Gemma-4-E4B, 25 of 220 corpus emails) on AMD Strix Halo hardware: category_accuracy 0.04 -> aggregate 4.0/100. The low value is a taxonomy mismatch (the agent's triage labels and the ground-truth priority labels overlap only on 'urgent'), not triage quality -- tracked in #1266 and recorded in the scorecard's own methodology. Adapter hardening: store a repo-relative ground_truth path (no absolute-path leak in the published artifact), record the eval limit for comparability, and carry the taxonomy caveat. README surfaces the aggregate with the caveat and a relative link; docs example aligned to the 4-category label set.
… taxonomy ref - Add .github/workflows/email_scorecard_refresh.yml: on agent/corpus changes the self-hosted AMD runner re-runs the eval, regenerates the scorecard, commits it when the score holds/improves, and FAILS on a regression (same-version vs the committed card + cross-version via scorecard_gate). Hosted-CI backstop stays the release-time scorecard-gate job. - Add .claude/skills/adding-eval-scorecard: a phased skill so adopting a scorecard is invocable, not a prose walkthrough; referenced from eval-scorecard.mdx. - Document the update/reject loop in eval-scorecard.mdx. - Correct the scorecard's taxonomy reference from the closed #1266 (old 4-way) to #1874 (corpus labels stale vs schema-2.0 5-bucket taxonomy); regenerate the card.
Adds `eval_scorecard_url` and `eval_score` fields end-to-end through the worker catalog pipeline so the Agent Hub listing can show a benchmark aggregate and link to the full scorecard. Worker: `evalScorecardKey()` storage helper, optional `eval_scorecard` multipart part in POST /publish (stored as `eval-scorecard.md` per version), YAML front-matter parse of `aggregate.value` in `toIndexEntry`, and both fields carried through `rebuildIndex`. Missing/unparseable scorecard yields undefined fields, never throws. Publish: `--eval-scorecard <path>` flag in `publish_to_r2.py`; the GHA release workflow conditionally passes the versioned scorecard file when it exists under `hub/agents/npm/agent-email/scorecards/<version>.md`. Python catalog: `merge_with_registry` threads the two new optional fields from the R2 index entry into the unified catalog dict so the UI backend serves them alongside existing agent metadata. Tests: two focused tests in routes.test.ts cover the present/absent scorecard cases (69 tests total, all pass).
Adds `eval_score` and `eval_scorecard_url` optional fields to `AgentInfo` in the frontend type definitions. When an agent has an eval score, the detail modal renders an "Eval scorecard" section showing the numeric score out of 100, with a "View scorecard" link when the URL is present. Renders nothing when neither field is set (no empty section).
Verdict: Approve with suggestionsThis adds a per-agent / per-version release eval scorecard — a versioned Markdown+YAML artifact, a standalone presence+regression release gate, a self-hosted refresh loop, and hub/Agent-UI surfacing — with the email agent as the first real adopter. It's a well-built, well-tested feature: harness-agnostic core, strong TDD coverage, fail-loudly on zero judged scenarios, and an unusually honest write-up of the 4.0 labeling artifact. Nothing here is blocking to merge. The one thing worth resolving before any non-email agent publishes: the committed hello-world scorecard uses hand-authored numbers ( No security concerns. 🔍 Technical details🟡 ImportantHand-authored hello-world scorecard would surface as a real hub score ( 🟢 MinorDeprecated (Several call sites — Strengths
|
After #1875 relabeled the eval corpus to the schema-2.0 triage taxonomy, the email agent's predictions and the ground-truth labels share one vocabulary, so category_accuracy now measures real agreement: 0.40 over 25 of 220 emails -> aggregate 40.0/100 (was 4.0, a labeling artifact). Fresh gaia eval benchmark run on AMD Strix Halo. Drop the now-resolved #1874 caveat from the adapter methodology + README; align the dataset description to the schema-2.0 taxonomy.
| @@ -0,0 +1,68 @@ | |||
| --- | |||
There was a problem hiding this comment.
Use README.md, SKILL.md, SPEC.md as examples. A single SCORECARD.md that gets updated and packaged up for release is what we need, not a separate named version file.
| "SKILL.md", | ||
| "LICENSE" | ||
| "LICENSE", | ||
| "scorecards/" |
There was a problem hiding this comment.
we dont want all scorecards across all versions for every single release.
| @@ -0,0 +1,318 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
How do we actually run eval to reproduce the results?
| @@ -0,0 +1,68 @@ | |||
| --- | |||
| schema_version: 1 | |||
There was a problem hiding this comment.
Scorecard should include reproduction steps or a reference to a doc that contains those instructions.
| ground_truth: tests/fixtures/email/ground_truth.json | ||
| limit: 25 | ||
| results: | ||
| test_cases_run: 25 |
|
40% is a pretty bad score, do we know why? |
|
@claude review this PR in detail. Make sure the eval scorecard clearly outlines the steps to reproduce the results and the scorecard provides high confidence of the agent being production-ready. Identify any gaps or issues. |
|
Can you include a screenshot of the rendered scorecard of the email agent hub site? |
|
The scorecard system is well-built and well-tested, but measured against your two bars — "clearly reproducible" and "high confidence of production-ready" — it's only partway there, and the honest answer to @kovtcharov-amd's earlier 40% question is the crux: this scorecard demonstrates the email agent is not production-ready yet, and it's correct to surface that rather than hide it. On the 40%. After the #1875 corpus relabel, 40.0 is genuine agreement between the agent's triage labels and the ground truth (the earlier 4.0 was a vocabulary mismatch, now fixed). On a 5-bucket taxonomy, random chance is ~20%, so the Gemma-4-E4B agent is ~2× chance but agrees with only ~2 of every 5 emails — well below any production bar. The card is doing its job by reporting this number truthfully. Reproducibility — partial. The reproduce command exists, but in the docs, not in the card; and the card pins only Confidence — statistically weak as recorded. Only 25 of 220 emails (11%) were scored. A 40% point estimate on n=25 carries roughly a ±19-point 95% confidence interval — you genuinely can't distinguish 40% from ~25% or ~60%. A production-confidence claim needs the full 220 run (or the CI reported on the card). Biggest diagnostic gap (also answers "why 40%"): the benchmark already computes per-category accuracy, but the card averages it into one number. For triage, missing an Plus the previously-flagged hand-authored hello-world Net: merge the system on its merits, but I wouldn't read this card as a production-readiness sign-off — it's an honest "needs work" baseline. Concrete asks below. 🔍 Technical detailsReproducibility
Sample size / CI
Per-category breakdown
hello-world fabricated score (re-affirming the earlier review)
Minor
@kovtcharov-amd — flagging for your call: the scorecard framework is solid and worth landing, but I'd treat the 40.0 as a baseline-to-improve, not a green light. The separate "rendered hub screenshot" request can't be produced in-PR — the hub worker goes live only after a manual |
How to render the scorecard as a tab on the hub pageRight now the scorecard surfaces as a score badge + external link in the detail modal, not as a doc tab. The reason: the hub page builds its doc tabs from the markdown body fields the worker puts in Making it a tab is the same shape as the 🔍 Concrete steps1. Worker — persist the scorecard markdown into
That's the whole worker change — 2. Hub frontend — add the tab. The public hub page (the consumer of 3. (Optional) Electron Net: one field carried through the worker + one tab in the hub frontend, both following the |
…production section Storage convention changes from scorecards/<version>.md to a single SCORECARD.md updated in place (versioned via publish snapshot, same as README.md). - release_scorecard.py: add reproduction_command to ResultPayload; render_scorecard emits a Reproduction section; carry_forward reads version from front matter instead of filename stem; remove latest_version_below (per-version dirs gone); fix utcnow -> now(utc) - scorecard_gate.py: redesigned to accept --scorecard SCORECARD.md + optional --baseline-file / --baseline-ref (mutually exclusive); no --scorecards-dir or --version flags; --baseline-ref resolves via git show; absence at ref = first adoption pass; git-shellout-free when --baseline-file is used - gen_scorecard.py: writes hub/agents/npm/agent-email/SCORECARD.md (not scorecards/<ver>.md); supplies reproduction_command with exact env vars and commands - tests: updated for new carry_forward signature, new gate interface, reproduction section assertions, second-agent generalization test, utcnow -> now(utc)
… agent - hub/agents/npm/agent-email/SCORECARD.md: generated from relabeled-corpus run (placeholder; orchestrator will regenerate from full run) - hub/agents/npm/agent-email/package.json: files array includes SCORECARD.md, removes scorecards/ (don't ship all versions in the npm tarball) - hub/agents/npm/agent-email/README.md: scorecard link updated to ./SCORECARD.md - Delete hub/agents/npm/agent-email/scorecards/ (per-version dir, now obsolete) - Delete hub/agents/python/hello-world/scorecards/ (contained fabricated 90.0 score)
…RD.md - storage.ts: evalScorecardKey now returns SCORECARD.md (was eval-scorecard.md) - publish.ts: update comment for SCORECARD.md - routes.test.ts: expect eval_scorecard_url to end in /SCORECARD.md - publish_to_r2.py: update --eval-scorecard help text to reference SCORECARD.md - release_agent_email.yml: scorecard-gate uses new --scorecard / --baseline-ref interface; computes prev tag via git describe; publish step points at SCORECARD.md - email_scorecard_refresh.yml: use SCORECARD.md env var throughout; same-version check and cross-version gate use new gate interface with --baseline-ref
…onvention - eval-scorecard.mdx: storage convention is now a single SCORECARD.md (not scorecards/<ver>.md); gate uses --scorecard + --baseline-ref/--baseline-file; carry_forward reads version from front matter; Reproduction section documented; npm files include SCORECARD.md only (not scorecards/ dir) - SKILL.md: doc-root/SCORECARD.md as single file; reproduction_command in adapter; gate CLI updated to --scorecard / --baseline-ref pattern; Phase 4 examples updated
Move the eval into the release pipeline — don't run it per-PRThe eval gating is wired backwards. The real eval lives in It's also never actually executed: both pushes on this branch queued with 0 jobs, waiting on the absent The eval should run once, at tagged release, as a publish blocker — and meet-or-beat the previously published score:
Net: eval runs at release only, the published number is tied to the shipped bytes, "no worse than last release" becomes a hard publish gate, and scorecards stop being mid-PR branch mutations. 🔍 Two correctness issues spotted while reviewing the eval code
|
- subprocess.run: add check=False (W1510) - Remove bare f-strings with no interpolated vars (W1309) - black reformatted test_scorecard_gate.py
…reproduction Regenerate the email v0.2.4 SCORECARD.md from a full-corpus gaia eval benchmark run on AMD Strix Halo: category_accuracy 0.46 over 100 of 220 emails (the triage tool processes up to 100 per call) -> aggregate 46.0/100. Errors are dominated by the inherently-ambiguous fyi<->needs_response boundary; the model over-assigns NEEDS_RESPONSE. Fix the adapter's reproduction command to be portable (generic /tmp/email-eval output dir, full model/mbox/ground-truth/output-dir flags) — no local absolute path in the published artifact. README reflects 46.0.
|
Thanks for the review — addressed all of it; pushed to the branch. Single Reproduction (your "how do we actually run eval?"): every scorecard now has a "Why only 25?" → now the full corpus. Re-ran over the whole corpus on AMD Strix Halo. The triage tool processes up to 100 emails per call, so the effective run is 100 of 220 labeled examples — recorded distinctly ( "40% — do we know why?" Yes. Full-corpus result is [bot review] Fixed both: removed the committed hello-world card (its hand-authored Gate, after the redesignThe gate now compares a single One thing to confirm: the cross-version baseline resolves the previous release via |
Closes #1862
Why this matters
Every GAIA hub agent ships a README and changelog, but nothing told a prospective user how well the agent actually performs — no accuracy, no test counts, no dataset size. An agent looked like a proof-of-concept, and a quality regression could ship silently.
This adds a standardized per-agent eval scorecard: a single
SCORECARD.mdper agent (updated in place per release and packaged likeREADME.md/SPEC.md/SKILL.md) recording the eval recipe, measured results (accuracy, test-cases-run, dataset size), a deterministic aggregate score a reader can recompute by hand, and reproduction steps. A standalone release gate blocks packaging on a missing or regressed scorecard, a self-hosted refresh loop keeps it current (update-if-better / reject-if-worse), and the score is surfaced on the hub listing + Agent UI. The email agent is the first end-to-end adopter; the format is agent-agnostic (proven by a generalization test).What changed
src/gaia/eval/):release_scorecard.py(payload → generator → validator → carry-forward; stdlib + PyYAML only) andscorecard_gate.py(standalone gate). Distinct from the per-runscorecard.py.SCORECARD.md: one file per agent, versioned via the publish snapshot (R2 storesagents/<id>/<version>/SCORECARD.md).package.jsonships only the current file.hub/agents/npm/agent-email/SCORECARD.md, linked + surfaced from the canonical npm README; gate wired intorelease_agent_email.ymlas apublish-blocking job.email_scorecard_refresh.yml): self-hosted AMD runner re-runs the eval on agent/corpus changes → commits the refreshed card if the score holds/improves, fails on a regression.eval_score+eval_scorecard_url(parsed from the publishedSCORECARD.md), publish uploads it to R2, the Agent UI detail modal renders score + link.## Reproductionsection;.claude/skills/adding-eval-scorecard/makes adoption invocable;docs/reference/eval-scorecard.mdxdocuments schema, formula, versioning, the gate, and the refresh loop.The number, and why
Real
gaia eval benchmarkrun on AMD Strix Halo (Gemma-4-E4B), full corpus:category_accuracy0.46 → aggregate 46.0/100 over 100 of 220 labeled emails (the triage tool processes up to 100 per call — recorded distinctly astest_cases_run: 100/dataset.size: 220).Errors are dominated by the inherently-ambiguous fyi ↔ needs_response boundary (28 of 54), with the model over-assigning
NEEDS_RESPONSE. ~2.3× the random baseline for a 4B local model on 5-way subjective triage — honest, with headroom. (An earlier 4.0 was a stale-label artifact; the corpus was relabeled to the schema-2.0 taxonomy by #1875, which closed the #1874 I filed, and this branch is current with it.)How it was tested
tsc --noEmitclean (worker + webui);python util/lint.py --all→ ALL QUALITY CHECKS PASSED.exit 0; baseline 55.0 vs candidate 46.0 →exit 1;--allow-regression→exit 0. Reader recompute:round(100 × 0.46, 2) = 46.0✓. Generalization (second agent through the same generator) → gateexit 0.Notes
main(includes fix(eval): relabel email triage corpus to schema-2.0 5-bucket taxonomy #1875's relabel); PR diff is exactly the scorecard system's files — it does not re-touch fix(eval): relabel email triage corpus to schema-2.0 5-bucket taxonomy #1875's corpus/metrics.wrangler deploy+ an email re-publish; in-PR verification is typecheck/build/worker-tests, not live hub rendering.git describe --tags --match 'agent-pkg-email-*'; confirm the tag pattern matches the email release convention.