feat(eval): per-agent per-version eval scorecard + release gate by itomek · Pull Request #1873 · amd/gaia

itomek · 2026-06-26T14:11:50Z

Why this matters

Every GAIA hub agent ships a README and changelog, but nothing told a prospective user how well the agent actually performs — no accuracy, no test counts, no dataset size. An agent looked like a proof-of-concept, and a quality regression could ship silently.

This adds a standardized per-agent eval scorecard: a single SCORECARD.md per agent (updated in place per release and packaged like README.md/SPEC.md/SKILL.md) recording the eval recipe, measured results (accuracy, test-cases-run, dataset size), a deterministic aggregate score a reader can recompute by hand, and reproduction steps. A standalone release gate blocks packaging on a missing or regressed scorecard, a self-hosted refresh loop keeps it current (update-if-better / reject-if-worse), and the score is surfaced on the hub listing + Agent UI. The email agent is the first end-to-end adopter; the format is agent-agnostic (proven by a generalization test).

What changed

Core, harness-agnostic (src/gaia/eval/): release_scorecard.py (payload → generator → validator → carry-forward; stdlib + PyYAML only) and scorecard_gate.py (standalone gate). Distinct from the per-run scorecard.py.
Single SCORECARD.md: one file per agent, versioned via the publish snapshot (R2 stores agents/<id>/<version>/SCORECARD.md). package.json ships only the current file.
Email adoption: real hub/agents/npm/agent-email/SCORECARD.md, linked + surfaced from the canonical npm README; gate wired into release_agent_email.yml as a publish-blocking job.
Keep-it-honest loop (email_scorecard_refresh.yml): self-hosted AMD runner re-runs the eval on agent/corpus changes → commits the refreshed card if the score holds/improves, fails on a regression.
Hub display: the Cloudflare worker serves eval_score + eval_scorecard_url (parsed from the published SCORECARD.md), publish uploads it to R2, the Agent UI detail modal renders score + link.
Reproduction + skill + docs: each scorecard carries a ## Reproduction section; .claude/skills/adding-eval-scorecard/ makes adoption invocable; docs/reference/eval-scorecard.mdx documents schema, formula, versioning, the gate, and the refresh loop.

The number, and why

Real gaia eval benchmark run on AMD Strix Halo (Gemma-4-E4B), full corpus: category_accuracy 0.46 → aggregate 46.0/100 over 100 of 220 labeled emails (the triage tool processes up to 100 per call — recorded distinctly as test_cases_run: 100 / dataset.size: 220).

Errors are dominated by the inherently-ambiguous fyi ↔ needs_response boundary (28 of 54), with the model over-assigning NEEDS_RESPONSE. ~2.3× the random baseline for a 4B local model on 5-way subjective triage — honest, with headroom. (An earlier 4.0 was a stale-label artifact; the corpus was relabeled to the schema-2.0 taxonomy by #1875, which closed the #1874 I filed, and this branch is current with it.)

How it was tested

65 Python unit tests + 69 hub-worker tests; tsc --noEmit clean (worker + webui); python util/lint.py --all → ALL QUALITY CHECKS PASSED.
Gate (real exit codes, at 46.0): presence → exit 0; baseline 55.0 vs candidate 46.0 → exit 1; --allow-regression → exit 0. Reader recompute: round(100 × 0.46, 2) = 46.0 ✓. Generalization (second agent through the same generator) → gate exit 0.

Notes

Branch is current with main (includes fix(eval): relabel email triage corpus to schema-2.0 5-bucket taxonomy #1875's relabel); PR diff is exactly the scorecard system's files — it does not re-touch fix(eval): relabel email triage corpus to schema-2.0 5-bucket taxonomy #1875's corpus/metrics.
The hub-display worker goes live only after a manual wrangler deploy + an email re-publish; in-PR verification is typecheck/build/worker-tests, not live hub rendering.
The release gate resolves the previous version via git describe --tags --match 'agent-pkg-email-*'; confirm the tag pattern matches the email release convention.

Tests encode the full acceptance criteria for gaia.eval.release_scorecard and gaia.eval.scorecard_gate before any implementation exists. Includes the email benchmark fixture used by the adapter tests.

…s 1-3) Core harness-agnostic scorecard generator and standalone release gate. - ResultPayload dataclass, compute_aggregate (guard empty/zero-weight) - render_scorecard + parse_scorecard (safe_load on first ---...--- slice) - validate_scorecard + REQUIRED_FIELDS; anchored semver path guard - latest_version_below (stdlib int-tuple, skips non-semver filenames) - carry_forward (patch-only, sets inherited_from, raises on minor/major) - scorecard_gate.main(argv)->int with --version/--manifest/--allow-regression - 38/44 tests pass; 4 adapter tests pending gen_scorecard.py (incr 4) - 1 CI test pending workflow update (incr 6) - 1 loose-coupling test false-positive: pytest_benchmark matches 'benchmark'

…est (increments 4) - gen_scorecard.py: reads benchmark scorecard.json (or any scenarios JSON) + ground_truth.json -> ResultPayload -> writes scorecards/<version>.md - Judged = quality.category_accuracy is finite float in [0,1]; zero judged raises - test_cases_run = sum(total_emails over judged); dataset_size excl _meta - Path derivation mirrors stamp_version.py (parents[...] from __file__) - Fix loose-coupling test: subprocess instead of sys.modules (avoids pytest_benchmark FP) (orchestrator-authorized replacement) - 43/44 tests pass; 1 remaining = CI workflow test (incr 6)

…nts 5-6) - docs/reference/eval-scorecard.mdx: schema, storage, formula, versioning policy, gate - docs/docs.json: nav entry in Evaluation Framework group - hub/agents/python/hello-world/scorecards/0.1.0.md: generator-produced generalization proof - hub/agents/npm/agent-email/scorecards/.gitkeep: placeholder for real scorecard - hub/agents/npm/agent-email/README.md: eval scorecard link to ./scorecards/0.2.4.md - hub/agents/npm/agent-email/package.json: add scorecards/ to files array - .github/workflows/release_agent_email.yml: scorecard-gate job + publish.needs update - lint fixes: remove unused imports from test files; black/isort/pylint/flake8 clean - 44/44 target tests pass; lint: ALL QUALITY CHECKS PASSED

The AC requires 'missing ANY required field ⇒ invalid', but the validator only checked 5 top-level keys. Add nested checks for agent.{name,version}, recipe.dataset.{reference,size}, recipe.{methodology,config}, results.{test_cases_run,metrics}, aggregate.{name,formula,value}, with non-dict-parent guards and a non-empty metrics-list requirement. Add TestSchemaValidator cases for missing nested fields, empty metrics, and non-dict sections. Also baseline sys.modules before import in the loose-coupling test so editable-install path finders don't false-positive.

The benchmark scorecard.json has no top-level model/limit, so config.limit was always null — defeating the comparability note in eval-scorecard.mdx. Add a --limit CLI arg threaded into config.limit, and derive config.model from the run's scenarios[0].category (the model id in benchmark output), falling back to gaia-agent.yaml models[0]. Drop the dead list-comprehension in the final print.

Match the rest of release_agent_email.yml, which already uses actions/setup-python@v6.

Generate the email-triage agent's v0.2.4 release scorecard from an actual `gaia eval benchmark` run (Gemma-4-E4B, 25 of 220 corpus emails) on AMD Strix Halo hardware: category_accuracy 0.04 -> aggregate 4.0/100. The low value is a taxonomy mismatch (the agent's triage labels and the ground-truth priority labels overlap only on 'urgent'), not triage quality -- tracked in #1266 and recorded in the scorecard's own methodology. Adapter hardening: store a repo-relative ground_truth path (no absolute-path leak in the published artifact), record the eval limit for comparability, and carry the taxonomy caveat. README surfaces the aggregate with the caveat and a relative link; docs example aligned to the 4-category label set.

… taxonomy ref - Add .github/workflows/email_scorecard_refresh.yml: on agent/corpus changes the self-hosted AMD runner re-runs the eval, regenerates the scorecard, commits it when the score holds/improves, and FAILS on a regression (same-version vs the committed card + cross-version via scorecard_gate). Hosted-CI backstop stays the release-time scorecard-gate job. - Add .claude/skills/adding-eval-scorecard: a phased skill so adopting a scorecard is invocable, not a prose walkthrough; referenced from eval-scorecard.mdx. - Document the update/reject loop in eval-scorecard.mdx. - Correct the scorecard's taxonomy reference from the closed #1266 (old 4-way) to #1874 (corpus labels stale vs schema-2.0 5-bucket taxonomy); regenerate the card.

Adds `eval_scorecard_url` and `eval_score` fields end-to-end through the worker catalog pipeline so the Agent Hub listing can show a benchmark aggregate and link to the full scorecard. Worker: `evalScorecardKey()` storage helper, optional `eval_scorecard` multipart part in POST /publish (stored as `eval-scorecard.md` per version), YAML front-matter parse of `aggregate.value` in `toIndexEntry`, and both fields carried through `rebuildIndex`. Missing/unparseable scorecard yields undefined fields, never throws. Publish: `--eval-scorecard <path>` flag in `publish_to_r2.py`; the GHA release workflow conditionally passes the versioned scorecard file when it exists under `hub/agents/npm/agent-email/scorecards/<version>.md`. Python catalog: `merge_with_registry` threads the two new optional fields from the R2 index entry into the unified catalog dict so the UI backend serves them alongside existing agent metadata. Tests: two focused tests in routes.test.ts cover the present/absent scorecard cases (69 tests total, all pass).

Adds `eval_score` and `eval_scorecard_url` optional fields to `AgentInfo` in the frontend type definitions. When an agent has an eval score, the detail modal renders an "Eval scorecard" section showing the numeric score out of 100, with a "View scorecard" link when the URL is present. Renders nothing when neither field is set (no empty section).

github-actions · 2026-06-26T15:18:39Z

Verdict: Approve with suggestions

This adds a per-agent / per-version release eval scorecard — a versioned Markdown+YAML artifact, a standalone presence+regression release gate, a self-hosted refresh loop, and hub/Agent-UI surfacing — with the email agent as the first real adopter. It's a well-built, well-tested feature: harness-agnostic core, strong TDD coverage, fail-loudly on zero judged scenarios, and an unusually honest write-up of the 4.0 labeling artifact. Nothing here is blocking to merge.

The one thing worth resolving before any non-email agent publishes: the committed hello-world scorecard uses hand-authored numbers (90.0), which the new hub/UI path would surface verbatim as a real eval_score with no "illustrative" marker. That directly contradicts the system's own hard rule ("the scorecard MUST come from an actual eval, never hand-authored numbers"). It's harmless as an in-repo format demo, but it would mislead users if hello-world is ever published — so either keep it unpublished, or mark illustrative cards machine-readably so the hub doesn't show a fabricated score next to the email agent's honest one.

No security concerns.

🔍 Technical details

🟡 Important

Hand-authored hello-world scorecard would surface as a real hub score (hub/agents/python/hello-world/scorecards/0.1.0.md)
The card carries invented numbers (response_quality: 0.9 → aggregate.value: 90.0). The new publish path (publish_to_r2.py --eval-scorecard → worker parseScorecardScore → eval_score → AgentDetailModal) surfaces aggregate.value verbatim with no notion of "illustrative". This contradicts SKILL.md Phase 3 ("hard gate … never hand-authored numbers") and eval-scorecard.mdx. The body's "Illustrative metric" note is human-only — eval_score: 90.0 on the hub carries no caveat, so a fabricated 90.0 would sit next to the email agent's honest 4.0.
Recommend one of: (a) don't ship/publish a scorecard for the reference agent, or (b) add a machine-readable flag (e.g. illustrative: true in front matter) that the gate tolerates and the worker/UI badge or exclude from eval_score. If shipping it purely as an in-repo format example is intentional, a one-line note in eval-scorecard.mdx saying the hello-world card is illustrative-only and must not be published would close the contradiction.

🟢 Minor

Deprecated datetime.datetime.utcnow() (src/gaia/eval/release_scorecard.py:1708)
utcnow() is deprecated as of Python 3.12 and returns a naive datetime — inconsistent with gen_scorecard.py which correctly uses the timezone-aware form. Same pattern in the two test _make_payload helpers. Align on the tz-aware call:

        generated_at=datetime.datetime.now(datetime.timezone.utc).isoformat(),

(Several call sites — release_scorecard.py:1708, tests/unit/eval/test_release_scorecard.py, tests/unit/eval/test_scorecard_gate.py — share the pattern; worth a single sweep rather than separate fixes.)

Strengths

Loose coupling is enforced, not just asserted in prose — test_no_benchmark_or_agent_modules_imported baselines sys.modules in a fresh subprocess before importing, so it actually proves the core pulls in neither the harness nor the agent package.
Fail-loudly where it matters — build_payload raises on zero judged scenarios instead of silently emitting 0.0, and the worker's parseScorecardScore is the correct inverse: it never throws so a malformed scorecard can't break the catalog build.
Honest methodology — the 4.0 artifact is explained in the machine-readable methodology string, the README, and a tracking issue (fix(eval): email triage corpus ground-truth labels stale vs schema-2.0 5-bucket taxonomy #1874) rather than hidden or inflated, which is exactly the integrity the feature is meant to provide.
Hosted gate stays dependency-light correctly — gaia/eval/__init__.py is empty, so the release-gate job's pip install -e . pyyaml imports scorecard_gate without the eval extras; the split between the hosted reject-on-worse gate and the self-hosted run-and-refresh loop is sound, and the refresh trigger paths exclude the npm scorecard dir so auto-commit can't self-retrigger.

After #1875 relabeled the eval corpus to the schema-2.0 triage taxonomy, the email agent's predictions and the ground-truth labels share one vocabulary, so category_accuracy now measures real agreement: 0.40 over 25 of 220 emails -> aggregate 40.0/100 (was 4.0, a labeling artifact). Fresh gaia eval benchmark run on AMD Strix Halo. Drop the now-resolved #1874 caveat from the adapter methodology + README; align the dataset description to the schema-2.0 taxonomy.

…scorecard

kovtcharov-amd · 2026-06-26T17:08:00Z

@@ -0,0 +1,68 @@
+---


Use README.md, SKILL.md, SPEC.md as examples. A single SCORECARD.md that gets updated and packaged up for release is what we need, not a separate named version file.

kovtcharov-amd · 2026-06-26T17:08:26Z

    "SKILL.md",
-    "LICENSE"
+    "LICENSE",
+    "scorecards/"


we dont want all scorecards across all versions for every single release.

kovtcharov-amd · 2026-06-26T17:09:34Z

@@ -0,0 +1,318 @@
+#!/usr/bin/env python3


How do we actually run eval to reproduce the results?

kovtcharov-amd · 2026-06-26T17:10:25Z

@@ -0,0 +1,68 @@
+---
+schema_version: 1


Scorecard should include reproduction steps or a reference to a doc that contains those instructions.

kovtcharov-amd · 2026-06-26T17:10:41Z

+    ground_truth: tests/fixtures/email/ground_truth.json
+    limit: 25
+results:
+  test_cases_run: 25


Why only 25?

kovtcharov-amd · 2026-06-26T17:13:51Z

40% is a pretty bad score, do we know why?

kovtcharov-amd · 2026-06-26T17:14:52Z

@claude review this PR in detail. Make sure the eval scorecard clearly outlines the steps to reproduce the results and the scorecard provides high confidence of the agent being production-ready. Identify any gaps or issues.

kovtcharov-amd · 2026-06-26T17:15:35Z

Can you include a screenshot of the rendered scorecard of the email agent hub site?

github-actions · 2026-06-26T17:17:36Z

The scorecard system is well-built and well-tested, but measured against your two bars — "clearly reproducible" and "high confidence of production-ready" — it's only partway there, and the honest answer to @kovtcharov-amd's earlier 40% question is the crux: this scorecard demonstrates the email agent is not production-ready yet, and it's correct to surface that rather than hide it.

On the 40%. After the #1875 corpus relabel, 40.0 is genuine agreement between the agent's triage labels and the ground truth (the earlier 4.0 was a vocabulary mismatch, now fixed). On a 5-bucket taxonomy, random chance is ~20%, so the Gemma-4-E4B agent is ~2× chance but agrees with only ~2 of every 5 emails — well below any production bar. The card is doing its job by reporting this number truthfully.

Reproducibility — partial. The reproduce command exists, but in the docs, not in the card; and the card pins only model name + limit. It does not pin the things that actually determine the number: the GAIA commit, the Lemonade Server version, the exact Gemma model build/quantization, the hardware (Strix Halo), and decoding params (temperature/seed). Re-run on a different model build or box and the 40 can move — so "reproduce exactly from the card alone" isn't guaranteed today.

Confidence — statistically weak as recorded. Only 25 of 220 emails (11%) were scored. A 40% point estimate on n=25 carries roughly a ±19-point 95% confidence interval — you genuinely can't distinguish 40% from ~25% or ~60%. A production-confidence claim needs the full 220 run (or the CI reported on the card).

Biggest diagnostic gap (also answers "why 40%"): the benchmark already computes per-category accuracy, but the card averages it into one number. For triage, missing an urgent email matters far more than mislabeling a promotional one — and right now nobody can see which buckets are failing. A per-bucket breakdown / confusion matrix is the single most useful addition, both for the score's credibility and for knowing what to fix.

Plus the previously-flagged hand-authored hello-world 90.0 — it would render on the hub as a real score next to the email agent's honest 40, so resolve it before any second agent publishes.

Net: merge the system on its merits, but I wouldn't read this card as a production-readiness sign-off — it's an honest "needs work" baseline. Concrete asks below.

🔍 Technical details

Reproducibility

Reproduce command lives only in docs/reference/eval-scorecard.mdx:167-175; the committed card (hub/agents/npm/agent-email/scorecards/0.2.4.md) recipe.config has harness/model/corpus/ground_truth/limit but no copy-paste command and no environment pin.
Add to recipe.config (or a new environment block): GAIA git SHA, Lemonade Server version, exact model id/quantization/build hash, hardware, and temperature/seed. Without these the regression gate (scorecard_gate.py, aggregate-only compare) can move for reasons unrelated to the agent — exactly the failure mode the --limit Warning at eval-scorecard.mdx:99-101 calls out, generalized to the whole environment.

Sample size / CI

results.test_cases_run: 25, recipe.dataset.size: 220. SE ≈ √(0.4·0.6/25) ≈ 0.098 → 95% CI ≈ 40% ± 19pp (~21–59%). Recommend either --limit 220 (full corpus) for the release card, or record n and the CI alongside the point estimate so the gate/readers aren't comparing noise.

Per-category breakdown

tests/fixtures/eval/email_benchmark_scorecard.json shows each scenario already carries quality.category_accuracy per category; build_payload in hub/agents/python/email/packaging/gen_scorecard.py collapses them to a single mean (category_accuracy 0.4). The per-bucket signal is computed and then discarded. Emitting per-class precision/recall (esp. urgent recall) into results.metrics would make the card actionable and let the aggregate weight safety-critical buckets.

hello-world fabricated score (re-affirming the earlier review)

hub/agents/python/hello-world/scorecards/0.1.0.md has aggregate.value: 90.0 from hand-authored response_quality: 0.9. The publish path (publish_to_r2.py --eval-scorecard → worker parseScorecardScore → eval_score → AgentDetailModal.tsx) surfaces it verbatim with no "illustrative" marker, contradicting the system's own "never hand-authored numbers" rule. Either don't publish it, or add a machine-readable illustrative: true the gate tolerates and the worker excludes from eval_score.

Minor

release_scorecard.py:1708 uses deprecated naive datetime.datetime.utcnow(); gen_scorecard.py uses the tz-aware form. Align (also in the two test _make_payload helpers).

@kovtcharov-amd — flagging for your call: the scorecard framework is solid and worth landing, but I'd treat the 40.0 as a baseline-to-improve, not a green light. The separate "rendered hub screenshot" request can't be produced in-PR — the hub worker goes live only after a manual wrangler deploy + email re-publish (per the PR's own Notes), so there's nothing rendered to capture until then.

kovtcharov · 2026-06-26T17:23:50Z

How to render the scorecard as a tab on the hub page

Right now the scorecard surfaces as a score badge + external link in the detail modal, not as a doc tab. The reason: the hub page builds its doc tabs from the markdown body fields the worker puts in IndexEntry — readme, changelog, and the two this PR added, spec and skill. For the scorecard the worker reads the markdown (readEvalScorecard) but then only keeps the parsed number + URL (eval_score, eval_scorecard_url) and discards the body, so there's nothing for a tab to render against.

Making it a tab is the same shape as the spec/skill work already in this PR — carry the markdown through, then add the tab:

🔍 Concrete steps

1. Worker — persist the scorecard markdown into IndexEntry (mirror spec/skill):

workers/agent-hub/src/types.ts — add the field next to spec/skill:

/** Eval scorecard markdown of the latest version; "" if none was published. */
eval_scorecard: string;

workers/agent-hub/src/catalog.ts — in toIndexEntry, evalScorecard is already a param (you only consume it for parseScorecardScore + the URL today). Also store the body next to spec, skill,:
```
spec,
skill,
eval_scorecard: evalScorecard ?? "",
```

That's the whole worker change — evalScorecardKey, readEvalScorecard, the R2 upload, and rebuildIndex plumbing all already exist. Add one route test asserting entry.eval_scorecard carries the body (copy the spec/skill assertion you just added in test/routes.test.ts).

2. Hub frontend — add the tab. The public hub page (the consumer of index.json, per docs/spec/agent-hub-restructure.mdx) renders readme/changelog/spec/skill as sanitized-markdown tabs. Add a "Scorecard" entry to that same tab list pointing at entry.eval_scorecard, reusing the existing markdown renderer, and only show the tab when the body is non-empty (so agents without a scorecard don't get an empty tab). Keep the existing badge as a quick-glance summary — the tab is the full recipe + recomputation.

3. (Optional) Electron AgentDetailModal. This modal has no tab strip — it's stacked sections, so leaving the badge + "View scorecard" link there is reasonable. If you want the body inline here too, render agent.eval_scorecard through the same markdown component used for other sections.

Net: one field carried through the worker + one tab in the hub frontend, both following the spec/skill pattern this PR already establishes. The badge stays as the at-a-glance score; the tab gives readers the full recipe and the by-hand recomputation without leaving the page.

…production section Storage convention changes from scorecards/<version>.md to a single SCORECARD.md updated in place (versioned via publish snapshot, same as README.md). - release_scorecard.py: add reproduction_command to ResultPayload; render_scorecard emits a Reproduction section; carry_forward reads version from front matter instead of filename stem; remove latest_version_below (per-version dirs gone); fix utcnow -> now(utc) - scorecard_gate.py: redesigned to accept --scorecard SCORECARD.md + optional --baseline-file / --baseline-ref (mutually exclusive); no --scorecards-dir or --version flags; --baseline-ref resolves via git show; absence at ref = first adoption pass; git-shellout-free when --baseline-file is used - gen_scorecard.py: writes hub/agents/npm/agent-email/SCORECARD.md (not scorecards/<ver>.md); supplies reproduction_command with exact env vars and commands - tests: updated for new carry_forward signature, new gate interface, reproduction section assertions, second-agent generalization test, utcnow -> now(utc)

… agent - hub/agents/npm/agent-email/SCORECARD.md: generated from relabeled-corpus run (placeholder; orchestrator will regenerate from full run) - hub/agents/npm/agent-email/package.json: files array includes SCORECARD.md, removes scorecards/ (don't ship all versions in the npm tarball) - hub/agents/npm/agent-email/README.md: scorecard link updated to ./SCORECARD.md - Delete hub/agents/npm/agent-email/scorecards/ (per-version dir, now obsolete) - Delete hub/agents/python/hello-world/scorecards/ (contained fabricated 90.0 score)

…RD.md - storage.ts: evalScorecardKey now returns SCORECARD.md (was eval-scorecard.md) - publish.ts: update comment for SCORECARD.md - routes.test.ts: expect eval_scorecard_url to end in /SCORECARD.md - publish_to_r2.py: update --eval-scorecard help text to reference SCORECARD.md - release_agent_email.yml: scorecard-gate uses new --scorecard / --baseline-ref interface; computes prev tag via git describe; publish step points at SCORECARD.md - email_scorecard_refresh.yml: use SCORECARD.md env var throughout; same-version check and cross-version gate use new gate interface with --baseline-ref

…onvention - eval-scorecard.mdx: storage convention is now a single SCORECARD.md (not scorecards/<ver>.md); gate uses --scorecard + --baseline-ref/--baseline-file; carry_forward reads version from front matter; Reproduction section documented; npm files include SCORECARD.md only (not scorecards/ dir) - SKILL.md: doc-root/SCORECARD.md as single file; reproduction_command in adapter; gate CLI updated to --scorecard / --baseline-ref pattern; Phase 4 examples updated

kovtcharov · 2026-06-26T17:33:37Z

Move the eval into the release pipeline — don't run it per-PR

The eval gating is wired backwards. The real eval lives in email_scorecard_refresh.yml, which fires on every push to a non-main branch touching hub/agents/python/email/** or tests/fixtures/email/** (and auto-commits a regenerated scorecard back to the branch). Meanwhile the release pipeline's scorecard-gate job runs no eval at all — it parses a pre-committed card on ubuntu-latest. So the published score is whatever was hand-committed, and the "keep-it-honest" eval never runs against the bytes being shipped.

It's also never actually executed: both pushes on this branch queued with 0 jobs, waiting on the absent [self-hosted, lemonade-eval] runner (e.g. run 28252399720, pending 30+ min). So the refresh loop has run zero times.

The eval should run once, at tagged release, as a publish blocker — and meet-or-beat the previously published score:

Drop the push: trigger from email_scorecard_refresh.yml (keep workflow_dispatch as a manual refresh, or delete the file). PRs shouldn't run the eval or auto-commit scorecards — that's a 90-min self-hosted job on a single shared backend slot, fired per commit.
Replace the file-parsing scorecard-gate in release_agent_email.yml with a real-eval gate on [self-hosted, lemonade-eval] that, before publish:
- runs gaia eval benchmark for the tagged version,
- regenerates the card via gen_scorecard.py,
- fails unless the fresh aggregate ≥ the previously published version's score (reuse gaia.eval.scorecard_gate's cross-version check — just feed it the freshly-evaluated card instead of a committed one),
- and publishes that fresh card.
Keep publish needs:-ing this job, so a regression or an offline runner blocks the release — fail loud, never publish without proof.

Net: eval runs at release only, the published number is tied to the shipped bytes, "no worse than last release" becomes a hard publish gate, and scorecards stop being mid-PR branch mutations.

🔍 Two correctness issues spotted while reviewing the eval code

gen_scorecard.py (build_payload) — aggregate is an unweighted scenario mean, reported against a summed email count. category_accuracy = sum(s.quality.category_accuracy for s in judged) / len(judged) while test_cases_run = sum(total_emails). With one scenario (today's run) it's fine, but a multi-scenario run with unequal total_emails (e.g. --experiments, multiple models, partial runs) reports a per-scenario mean as if it were per-email accuracy — and that's the headline number the gate compares. Weight each scenario by its total_emails.
scorecard_gate.py:230 — float(candidate) < float(prev) doesn't reject non-finite values. validate_scorecard checks aggregate.value is present but not that it's finite; a hand-edited/corrupt card with value: .nan makes every comparison x < nan → False → permanent PASS. Low likelihood (the generator can't emit NaN), but a math.isfinite guard in validate_scorecard closes it cheaply.

- subprocess.run: add check=False (W1510) - Remove bare f-strings with no interpolated vars (W1309) - black reformatted test_scorecard_gate.py

…reproduction Regenerate the email v0.2.4 SCORECARD.md from a full-corpus gaia eval benchmark run on AMD Strix Halo: category_accuracy 0.46 over 100 of 220 emails (the triage tool processes up to 100 per call) -> aggregate 46.0/100. Errors are dominated by the inherently-ambiguous fyi<->needs_response boundary; the model over-assigns NEEDS_RESPONSE. Fix the adapter's reproduction command to be portable (generic /tmp/email-eval output dir, full model/mbox/ground-truth/output-dir flags) — no local absolute path in the published artifact. README reflects 46.0.

itomek · 2026-06-26T17:42:37Z

Thanks for the review — addressed all of it; pushed to the branch.

Single SCORECARD.md (your main point): switched from per-version scorecards/<version>.md files to one SCORECARD.md per agent, updated in place and packaged per release — exactly like README.md / SPEC.md / SKILL.md. Per-version uniqueness rides the publish snapshot (R2 stores it at agents/<id>/<version>/SCORECARD.md). Deleted the scorecards/ dirs; package.json now ships only SCORECARD.md, not all versions.

Reproduction (your "how do we actually run eval?"): every scorecard now has a ## Reproduction section with the exact, copy-pasteable commands (env + gaia eval benchmark + gen_scorecard.py) plus links to the docs and the new adding-eval-scorecard skill.

"Why only 25?" → now the full corpus. Re-ran over the whole corpus on AMD Strix Halo. The triage tool processes up to 100 emails per call, so the effective run is 100 of 220 labeled examples — recorded distinctly (test_cases_run: 100, dataset.size: 220).

"40% — do we know why?" Yes. Full-corpus result is category_accuracy 0.46 → aggregate 46.0. The errors are dominated by the genuinely-ambiguous fyi ↔ needs_response boundary (28 of 54 errors), and the model over-assigns NEEDS_RESPONSE (urgent/promotional leak into it). Per-class recall 36–54%. For a 4B local model (Gemma-4-E4B) on a 5-way subjective triage task, that's ~2.3× the random baseline — a real number with clear headroom (prompt / few-shot tuning), recorded honestly rather than inflated.

[bot review] Fixed both: removed the committed hello-world card (its hand-authored 90.0 would have surfaced as a real hub eval_score) — generalization is now proven by a unit test instead; and swept the deprecated datetime.utcnow() → tz-aware.

Gate, after the redesign

The gate now compares a single SCORECARD.md against a baseline: --baseline-file <path> (unit tests) or --baseline-ref <git-ref> (CI uses the previous release tag, best-effort). Presence-only passes when there's no baseline. Verified live: presence → exit 0; baseline 55.0 vs candidate 46.0 → exit 1; --allow-regression → exit 0. 65 Python + 69 worker tests pass; lint clean; both tsc --noEmit clean.

One thing to confirm: the cross-version baseline resolves the previous release via git describe --tags --match 'agent-pkg-email-*'. If the email release tags use a different pattern, point me at it and I'll fix the match.

Tomasz Iniewicz added 8 commits June 25, 2026 18:18

test(eval): add TDD tests for release scorecard + gate (frozen contract)

753b88d

Tests encode the full acceptance criteria for gaia.eval.release_scorecard and gaia.eval.scorecard_gate before any implementation exists. Includes the email benchmark fixture used by the adapter tests.

ci(eval): pin scorecard-gate setup-python to @v6

2f931a1

Match the rest of release_agent_email.yml, which already uses actions/setup-python@v6.

github-actions Bot added documentation Documentation changes devops DevOps/infrastructure changes eval Evaluation framework changes tests Test changes performance Performance-critical changes labels Jun 26, 2026

Tomasz Iniewicz and others added 4 commits June 26, 2026 10:42

Merge branch 'main' into feat/issue-1862-eval-scorecard

a3dd996

itomek marked this pull request as ready for review June 26, 2026 15:14

itomek requested a review from kovtcharov-amd as a code owner June 26, 2026 15:14

itomek self-assigned this Jun 26, 2026

Tomasz Iniewicz added 2 commits June 26, 2026 12:49

Merge remote-tracking branch 'origin/main' into feat/issue-1862-eval-…

178266c

…scorecard

kovtcharov-amd reviewed Jun 26, 2026

View reviewed changes

Tomasz Iniewicz added 4 commits June 26, 2026 13:29

Tomasz Iniewicz added 2 commits June 26, 2026 13:38

fix(eval): scorecard_gate pylint and black formatting

20dbdbe

- subprocess.run: add check=False (W1510) - Remove bare f-strings with no interpolated vars (W1309) - black reformatted test_scorecard_gate.py

		@@ -0,0 +1,68 @@
		---
		schema_version: 1

Uh oh!

Conversation

itomek commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this matters

What changed

The number, and why

How it was tested

Notes

Uh oh!

github-actions Bot commented Jun 26, 2026

Verdict: Approve with suggestions

🟡 Important

🟢 Minor

Strengths

Uh oh!

kovtcharov-amd Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

kovtcharov-amd Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

kovtcharov-amd Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

kovtcharov-amd Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

kovtcharov-amd Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

kovtcharov-amd commented Jun 26, 2026

Uh oh!

kovtcharov-amd commented Jun 26, 2026

Uh oh!

kovtcharov-amd commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

kovtcharov commented Jun 26, 2026

How to render the scorecard as a tab on the hub page

Uh oh!

kovtcharov commented Jun 26, 2026

Move the eval into the release pipeline — don't run it per-PR

Uh oh!

itomek commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

itomek commented Jun 26, 2026 •

edited

Loading