Skip to content

feat(eval): per-agent per-version eval scorecard + release gate#1873

Open
itomek wants to merge 20 commits into
mainfrom
feat/issue-1862-eval-scorecard
Open

feat(eval): per-agent per-version eval scorecard + release gate#1873
itomek wants to merge 20 commits into
mainfrom
feat/issue-1862-eval-scorecard

Conversation

@itomek

@itomek itomek commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Closes #1862

Why this matters

Every GAIA hub agent ships a README and changelog, but nothing told a prospective user how well the agent actually performs — no accuracy, no test counts, no dataset size. An agent looked like a proof-of-concept, and a quality regression could ship silently.

This adds a standardized per-agent eval scorecard: a single SCORECARD.md per agent (updated in place per release and packaged like README.md/SPEC.md/SKILL.md) recording the eval recipe, measured results (accuracy, test-cases-run, dataset size), a deterministic aggregate score a reader can recompute by hand, and reproduction steps. A standalone release gate blocks packaging on a missing or regressed scorecard, a self-hosted refresh loop keeps it current (update-if-better / reject-if-worse), and the score is surfaced on the hub listing + Agent UI. The email agent is the first end-to-end adopter; the format is agent-agnostic (proven by a generalization test).

What changed

  • Core, harness-agnostic (src/gaia/eval/): release_scorecard.py (payload → generator → validator → carry-forward; stdlib + PyYAML only) and scorecard_gate.py (standalone gate). Distinct from the per-run scorecard.py.
  • Single SCORECARD.md: one file per agent, versioned via the publish snapshot (R2 stores agents/<id>/<version>/SCORECARD.md). package.json ships only the current file.
  • Email adoption: real hub/agents/npm/agent-email/SCORECARD.md, linked + surfaced from the canonical npm README; gate wired into release_agent_email.yml as a publish-blocking job.
  • Keep-it-honest loop (email_scorecard_refresh.yml): self-hosted AMD runner re-runs the eval on agent/corpus changes → commits the refreshed card if the score holds/improves, fails on a regression.
  • Hub display: the Cloudflare worker serves eval_score + eval_scorecard_url (parsed from the published SCORECARD.md), publish uploads it to R2, the Agent UI detail modal renders score + link.
  • Reproduction + skill + docs: each scorecard carries a ## Reproduction section; .claude/skills/adding-eval-scorecard/ makes adoption invocable; docs/reference/eval-scorecard.mdx documents schema, formula, versioning, the gate, and the refresh loop.

The number, and why

Real gaia eval benchmark run on AMD Strix Halo (Gemma-4-E4B), full corpus: category_accuracy 0.46 → aggregate 46.0/100 over 100 of 220 labeled emails (the triage tool processes up to 100 per call — recorded distinctly as test_cases_run: 100 / dataset.size: 220).

Errors are dominated by the inherently-ambiguous fyi ↔ needs_response boundary (28 of 54), with the model over-assigning NEEDS_RESPONSE. ~2.3× the random baseline for a 4B local model on 5-way subjective triage — honest, with headroom. (An earlier 4.0 was a stale-label artifact; the corpus was relabeled to the schema-2.0 taxonomy by #1875, which closed the #1874 I filed, and this branch is current with it.)

How it was tested

  • 65 Python unit tests + 69 hub-worker tests; tsc --noEmit clean (worker + webui); python util/lint.py --all → ALL QUALITY CHECKS PASSED.
  • Gate (real exit codes, at 46.0): presence → exit 0; baseline 55.0 vs candidate 46.0 → exit 1; --allow-regressionexit 0. Reader recompute: round(100 × 0.46, 2) = 46.0 ✓. Generalization (second agent through the same generator) → gate exit 0.

Notes

Tomasz Iniewicz added 8 commits June 25, 2026 18:18
Tests encode the full acceptance criteria for gaia.eval.release_scorecard
and gaia.eval.scorecard_gate before any implementation exists. Includes
the email benchmark fixture used by the adapter tests.
…s 1-3)

Core harness-agnostic scorecard generator and standalone release gate.
- ResultPayload dataclass, compute_aggregate (guard empty/zero-weight)
- render_scorecard + parse_scorecard (safe_load on first ---...--- slice)
- validate_scorecard + REQUIRED_FIELDS; anchored semver path guard
- latest_version_below (stdlib int-tuple, skips non-semver filenames)
- carry_forward (patch-only, sets inherited_from, raises on minor/major)
- scorecard_gate.main(argv)->int with --version/--manifest/--allow-regression
- 38/44 tests pass; 4 adapter tests pending gen_scorecard.py (incr 4)
- 1 CI test pending workflow update (incr 6)
- 1 loose-coupling test false-positive: pytest_benchmark matches 'benchmark'
…est (increments 4)

- gen_scorecard.py: reads benchmark scorecard.json (or any scenarios JSON)
  + ground_truth.json -> ResultPayload -> writes scorecards/<version>.md
- Judged = quality.category_accuracy is finite float in [0,1]; zero judged raises
- test_cases_run = sum(total_emails over judged); dataset_size excl _meta
- Path derivation mirrors stamp_version.py (parents[...] from __file__)
- Fix loose-coupling test: subprocess instead of sys.modules (avoids pytest_benchmark FP)
  (orchestrator-authorized replacement)
- 43/44 tests pass; 1 remaining = CI workflow test (incr 6)
…nts 5-6)

- docs/reference/eval-scorecard.mdx: schema, storage, formula, versioning policy, gate
- docs/docs.json: nav entry in Evaluation Framework group
- hub/agents/python/hello-world/scorecards/0.1.0.md: generator-produced generalization proof
- hub/agents/npm/agent-email/scorecards/.gitkeep: placeholder for real scorecard
- hub/agents/npm/agent-email/README.md: eval scorecard link to ./scorecards/0.2.4.md
- hub/agents/npm/agent-email/package.json: add scorecards/ to files array
- .github/workflows/release_agent_email.yml: scorecard-gate job + publish.needs update
- lint fixes: remove unused imports from test files; black/isort/pylint/flake8 clean
- 44/44 target tests pass; lint: ALL QUALITY CHECKS PASSED
The AC requires 'missing ANY required field ⇒ invalid', but the validator
only checked 5 top-level keys. Add nested checks for agent.{name,version},
recipe.dataset.{reference,size}, recipe.{methodology,config},
results.{test_cases_run,metrics}, aggregate.{name,formula,value}, with
non-dict-parent guards and a non-empty metrics-list requirement. Add
TestSchemaValidator cases for missing nested fields, empty metrics, and
non-dict sections. Also baseline sys.modules before import in the
loose-coupling test so editable-install path finders don't false-positive.
The benchmark scorecard.json has no top-level model/limit, so config.limit
was always null — defeating the comparability note in eval-scorecard.mdx.
Add a --limit CLI arg threaded into config.limit, and derive config.model
from the run's scenarios[0].category (the model id in benchmark output),
falling back to gaia-agent.yaml models[0]. Drop the dead list-comprehension
in the final print.
Match the rest of release_agent_email.yml, which already uses
actions/setup-python@v6.
Generate the email-triage agent's v0.2.4 release scorecard from an actual
`gaia eval benchmark` run (Gemma-4-E4B, 25 of 220 corpus emails) on AMD
Strix Halo hardware: category_accuracy 0.04 -> aggregate 4.0/100. The low
value is a taxonomy mismatch (the agent's triage labels and the ground-truth
priority labels overlap only on 'urgent'), not triage quality -- tracked in
#1266 and recorded in the scorecard's own methodology.

Adapter hardening: store a repo-relative ground_truth path (no absolute-path
leak in the published artifact), record the eval limit for comparability, and
carry the taxonomy caveat. README surfaces the aggregate with the caveat and a
relative link; docs example aligned to the 4-category label set.
@github-actions github-actions Bot added documentation Documentation changes devops DevOps/infrastructure changes eval Evaluation framework changes tests Test changes performance Performance-critical changes labels Jun 26, 2026
Tomasz Iniewicz and others added 4 commits June 26, 2026 10:42
… taxonomy ref

- Add .github/workflows/email_scorecard_refresh.yml: on agent/corpus changes the
  self-hosted AMD runner re-runs the eval, regenerates the scorecard, commits it
  when the score holds/improves, and FAILS on a regression (same-version vs the
  committed card + cross-version via scorecard_gate). Hosted-CI backstop stays the
  release-time scorecard-gate job.
- Add .claude/skills/adding-eval-scorecard: a phased skill so adopting a scorecard
  is invocable, not a prose walkthrough; referenced from eval-scorecard.mdx.
- Document the update/reject loop in eval-scorecard.mdx.
- Correct the scorecard's taxonomy reference from the closed #1266 (old 4-way) to
  #1874 (corpus labels stale vs schema-2.0 5-bucket taxonomy); regenerate the card.
Adds `eval_scorecard_url` and `eval_score` fields end-to-end through the
worker catalog pipeline so the Agent Hub listing can show a benchmark
aggregate and link to the full scorecard.

Worker: `evalScorecardKey()` storage helper, optional `eval_scorecard`
multipart part in POST /publish (stored as `eval-scorecard.md` per version),
YAML front-matter parse of `aggregate.value` in `toIndexEntry`, and both
fields carried through `rebuildIndex`. Missing/unparseable scorecard yields
undefined fields, never throws.

Publish: `--eval-scorecard <path>` flag in `publish_to_r2.py`; the GHA
release workflow conditionally passes the versioned scorecard file when it
exists under `hub/agents/npm/agent-email/scorecards/<version>.md`.

Python catalog: `merge_with_registry` threads the two new optional fields
from the R2 index entry into the unified catalog dict so the UI backend
serves them alongside existing agent metadata.

Tests: two focused tests in routes.test.ts cover the present/absent
scorecard cases (69 tests total, all pass).
Adds `eval_score` and `eval_scorecard_url` optional fields to `AgentInfo`
in the frontend type definitions. When an agent has an eval score, the
detail modal renders an "Eval scorecard" section showing the numeric score
out of 100, with a "View scorecard" link when the URL is present. Renders
nothing when neither field is set (no empty section).
@itomek itomek marked this pull request as ready for review June 26, 2026 15:14
@itomek itomek requested a review from kovtcharov-amd as a code owner June 26, 2026 15:14
@itomek itomek self-assigned this Jun 26, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Verdict: Approve with suggestions

This adds a per-agent / per-version release eval scorecard — a versioned Markdown+YAML artifact, a standalone presence+regression release gate, a self-hosted refresh loop, and hub/Agent-UI surfacing — with the email agent as the first real adopter. It's a well-built, well-tested feature: harness-agnostic core, strong TDD coverage, fail-loudly on zero judged scenarios, and an unusually honest write-up of the 4.0 labeling artifact. Nothing here is blocking to merge.

The one thing worth resolving before any non-email agent publishes: the committed hello-world scorecard uses hand-authored numbers (90.0), which the new hub/UI path would surface verbatim as a real eval_score with no "illustrative" marker. That directly contradicts the system's own hard rule ("the scorecard MUST come from an actual eval, never hand-authored numbers"). It's harmless as an in-repo format demo, but it would mislead users if hello-world is ever published — so either keep it unpublished, or mark illustrative cards machine-readably so the hub doesn't show a fabricated score next to the email agent's honest one.

No security concerns.

🔍 Technical details

🟡 Important

Hand-authored hello-world scorecard would surface as a real hub score (hub/agents/python/hello-world/scorecards/0.1.0.md)
The card carries invented numbers (response_quality: 0.9aggregate.value: 90.0). The new publish path (publish_to_r2.py --eval-scorecard → worker parseScorecardScoreeval_scoreAgentDetailModal) surfaces aggregate.value verbatim with no notion of "illustrative". This contradicts SKILL.md Phase 3 ("hard gate … never hand-authored numbers") and eval-scorecard.mdx. The body's "Illustrative metric" note is human-only — eval_score: 90.0 on the hub carries no caveat, so a fabricated 90.0 would sit next to the email agent's honest 4.0.
Recommend one of: (a) don't ship/publish a scorecard for the reference agent, or (b) add a machine-readable flag (e.g. illustrative: true in front matter) that the gate tolerates and the worker/UI badge or exclude from eval_score. If shipping it purely as an in-repo format example is intentional, a one-line note in eval-scorecard.mdx saying the hello-world card is illustrative-only and must not be published would close the contradiction.

🟢 Minor

Deprecated datetime.datetime.utcnow() (src/gaia/eval/release_scorecard.py:1708)
utcnow() is deprecated as of Python 3.12 and returns a naive datetime — inconsistent with gen_scorecard.py which correctly uses the timezone-aware form. Same pattern in the two test _make_payload helpers. Align on the tz-aware call:

        generated_at=datetime.datetime.now(datetime.timezone.utc).isoformat(),

(Several call sites — release_scorecard.py:1708, tests/unit/eval/test_release_scorecard.py, tests/unit/eval/test_scorecard_gate.py — share the pattern; worth a single sweep rather than separate fixes.)

Strengths

  • Loose coupling is enforced, not just asserted in prosetest_no_benchmark_or_agent_modules_imported baselines sys.modules in a fresh subprocess before importing, so it actually proves the core pulls in neither the harness nor the agent package.
  • Fail-loudly where it mattersbuild_payload raises on zero judged scenarios instead of silently emitting 0.0, and the worker's parseScorecardScore is the correct inverse: it never throws so a malformed scorecard can't break the catalog build.
  • Honest methodology — the 4.0 artifact is explained in the machine-readable methodology string, the README, and a tracking issue (fix(eval): email triage corpus ground-truth labels stale vs schema-2.0 5-bucket taxonomy #1874) rather than hidden or inflated, which is exactly the integrity the feature is meant to provide.
  • Hosted gate stays dependency-light correctlygaia/eval/__init__.py is empty, so the release-gate job's pip install -e . pyyaml imports scorecard_gate without the eval extras; the split between the hosted reject-on-worse gate and the self-hosted run-and-refresh loop is sound, and the refresh trigger paths exclude the npm scorecard dir so auto-commit can't self-retrigger.

Tomasz Iniewicz added 2 commits June 26, 2026 12:49
After #1875 relabeled the eval corpus to the schema-2.0 triage taxonomy, the
email agent's predictions and the ground-truth labels share one vocabulary, so
category_accuracy now measures real agreement: 0.40 over 25 of 220 emails ->
aggregate 40.0/100 (was 4.0, a labeling artifact). Fresh gaia eval benchmark run
on AMD Strix Halo. Drop the now-resolved #1874 caveat from the adapter
methodology + README; align the dataset description to the schema-2.0 taxonomy.
@@ -0,0 +1,68 @@
---

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use README.md, SKILL.md, SPEC.md as examples. A single SCORECARD.md that gets updated and packaged up for release is what we need, not a separate named version file.

Comment thread hub/agents/npm/agent-email/package.json Outdated
"SKILL.md",
"LICENSE"
"LICENSE",
"scorecards/"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we dont want all scorecards across all versions for every single release.

@@ -0,0 +1,318 @@
#!/usr/bin/env python3

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we actually run eval to reproduce the results?

@@ -0,0 +1,68 @@
---
schema_version: 1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scorecard should include reproduction steps or a reference to a doc that contains those instructions.

ground_truth: tests/fixtures/email/ground_truth.json
limit: 25
results:
test_cases_run: 25

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only 25?

@kovtcharov-amd

Copy link
Copy Markdown
Collaborator

40% is a pretty bad score, do we know why?

@kovtcharov-amd

Copy link
Copy Markdown
Collaborator

@claude review this PR in detail. Make sure the eval scorecard clearly outlines the steps to reproduce the results and the scorecard provides high confidence of the agent being production-ready. Identify any gaps or issues.

@kovtcharov-amd

Copy link
Copy Markdown
Collaborator

Can you include a screenshot of the rendered scorecard of the email agent hub site?

@github-actions

Copy link
Copy Markdown
Contributor

The scorecard system is well-built and well-tested, but measured against your two bars — "clearly reproducible" and "high confidence of production-ready" — it's only partway there, and the honest answer to @kovtcharov-amd's earlier 40% question is the crux: this scorecard demonstrates the email agent is not production-ready yet, and it's correct to surface that rather than hide it.

On the 40%. After the #1875 corpus relabel, 40.0 is genuine agreement between the agent's triage labels and the ground truth (the earlier 4.0 was a vocabulary mismatch, now fixed). On a 5-bucket taxonomy, random chance is ~20%, so the Gemma-4-E4B agent is ~2× chance but agrees with only ~2 of every 5 emails — well below any production bar. The card is doing its job by reporting this number truthfully.

Reproducibility — partial. The reproduce command exists, but in the docs, not in the card; and the card pins only model name + limit. It does not pin the things that actually determine the number: the GAIA commit, the Lemonade Server version, the exact Gemma model build/quantization, the hardware (Strix Halo), and decoding params (temperature/seed). Re-run on a different model build or box and the 40 can move — so "reproduce exactly from the card alone" isn't guaranteed today.

Confidence — statistically weak as recorded. Only 25 of 220 emails (11%) were scored. A 40% point estimate on n=25 carries roughly a ±19-point 95% confidence interval — you genuinely can't distinguish 40% from ~25% or ~60%. A production-confidence claim needs the full 220 run (or the CI reported on the card).

Biggest diagnostic gap (also answers "why 40%"): the benchmark already computes per-category accuracy, but the card averages it into one number. For triage, missing an urgent email matters far more than mislabeling a promotional one — and right now nobody can see which buckets are failing. A per-bucket breakdown / confusion matrix is the single most useful addition, both for the score's credibility and for knowing what to fix.

Plus the previously-flagged hand-authored hello-world 90.0 — it would render on the hub as a real score next to the email agent's honest 40, so resolve it before any second agent publishes.

Net: merge the system on its merits, but I wouldn't read this card as a production-readiness sign-off — it's an honest "needs work" baseline. Concrete asks below.

🔍 Technical details

Reproducibility

  • Reproduce command lives only in docs/reference/eval-scorecard.mdx:167-175; the committed card (hub/agents/npm/agent-email/scorecards/0.2.4.md) recipe.config has harness/model/corpus/ground_truth/limit but no copy-paste command and no environment pin.
  • Add to recipe.config (or a new environment block): GAIA git SHA, Lemonade Server version, exact model id/quantization/build hash, hardware, and temperature/seed. Without these the regression gate (scorecard_gate.py, aggregate-only compare) can move for reasons unrelated to the agent — exactly the failure mode the --limit Warning at eval-scorecard.mdx:99-101 calls out, generalized to the whole environment.

Sample size / CI

  • results.test_cases_run: 25, recipe.dataset.size: 220. SE ≈ √(0.4·0.6/25) ≈ 0.098 → 95% CI ≈ 40% ± 19pp (~21–59%). Recommend either --limit 220 (full corpus) for the release card, or record n and the CI alongside the point estimate so the gate/readers aren't comparing noise.

Per-category breakdown

  • tests/fixtures/eval/email_benchmark_scorecard.json shows each scenario already carries quality.category_accuracy per category; build_payload in hub/agents/python/email/packaging/gen_scorecard.py collapses them to a single mean (category_accuracy 0.4). The per-bucket signal is computed and then discarded. Emitting per-class precision/recall (esp. urgent recall) into results.metrics would make the card actionable and let the aggregate weight safety-critical buckets.

hello-world fabricated score (re-affirming the earlier review)

  • hub/agents/python/hello-world/scorecards/0.1.0.md has aggregate.value: 90.0 from hand-authored response_quality: 0.9. The publish path (publish_to_r2.py --eval-scorecard → worker parseScorecardScoreeval_scoreAgentDetailModal.tsx) surfaces it verbatim with no "illustrative" marker, contradicting the system's own "never hand-authored numbers" rule. Either don't publish it, or add a machine-readable illustrative: true the gate tolerates and the worker excludes from eval_score.

Minor

  • release_scorecard.py:1708 uses deprecated naive datetime.datetime.utcnow(); gen_scorecard.py uses the tz-aware form. Align (also in the two test _make_payload helpers).

@kovtcharov-amd — flagging for your call: the scorecard framework is solid and worth landing, but I'd treat the 40.0 as a baseline-to-improve, not a green light. The separate "rendered hub screenshot" request can't be produced in-PR — the hub worker goes live only after a manual wrangler deploy + email re-publish (per the PR's own Notes), so there's nothing rendered to capture until then.

@kovtcharov

Copy link
Copy Markdown
Collaborator

How to render the scorecard as a tab on the hub page

Right now the scorecard surfaces as a score badge + external link in the detail modal, not as a doc tab. The reason: the hub page builds its doc tabs from the markdown body fields the worker puts in IndexEntryreadme, changelog, and the two this PR added, spec and skill. For the scorecard the worker reads the markdown (readEvalScorecard) but then only keeps the parsed number + URL (eval_score, eval_scorecard_url) and discards the body, so there's nothing for a tab to render against.

Making it a tab is the same shape as the spec/skill work already in this PR — carry the markdown through, then add the tab:

🔍 Concrete steps

1. Worker — persist the scorecard markdown into IndexEntry (mirror spec/skill):

  • workers/agent-hub/src/types.ts — add the field next to spec/skill:
    /** Eval scorecard markdown of the latest version; "" if none was published. */
    eval_scorecard: string;
  • workers/agent-hub/src/catalog.ts — in toIndexEntry, evalScorecard is already a param (you only consume it for parseScorecardScore + the URL today). Also store the body next to spec, skill,:
    spec,
    skill,
    eval_scorecard: evalScorecard ?? "",

That's the whole worker change — evalScorecardKey, readEvalScorecard, the R2 upload, and rebuildIndex plumbing all already exist. Add one route test asserting entry.eval_scorecard carries the body (copy the spec/skill assertion you just added in test/routes.test.ts).

2. Hub frontend — add the tab. The public hub page (the consumer of index.json, per docs/spec/agent-hub-restructure.mdx) renders readme/changelog/spec/skill as sanitized-markdown tabs. Add a "Scorecard" entry to that same tab list pointing at entry.eval_scorecard, reusing the existing markdown renderer, and only show the tab when the body is non-empty (so agents without a scorecard don't get an empty tab). Keep the existing badge as a quick-glance summary — the tab is the full recipe + recomputation.

3. (Optional) Electron AgentDetailModal. This modal has no tab strip — it's stacked sections, so leaving the badge + "View scorecard" link there is reasonable. If you want the body inline here too, render agent.eval_scorecard through the same markdown component used for other sections.

Net: one field carried through the worker + one tab in the hub frontend, both following the spec/skill pattern this PR already establishes. The badge stays as the at-a-glance score; the tab gives readers the full recipe and the by-hand recomputation without leaving the page.

Tomasz Iniewicz added 4 commits June 26, 2026 13:29
…production section

Storage convention changes from scorecards/<version>.md to a single SCORECARD.md
updated in place (versioned via publish snapshot, same as README.md).

- release_scorecard.py: add reproduction_command to ResultPayload; render_scorecard
  emits a Reproduction section; carry_forward reads version from front matter instead
  of filename stem; remove latest_version_below (per-version dirs gone); fix utcnow
  -> now(utc)
- scorecard_gate.py: redesigned to accept --scorecard SCORECARD.md + optional
  --baseline-file / --baseline-ref (mutually exclusive); no --scorecards-dir or
  --version flags; --baseline-ref resolves via git show; absence at ref = first
  adoption pass; git-shellout-free when --baseline-file is used
- gen_scorecard.py: writes hub/agents/npm/agent-email/SCORECARD.md (not
  scorecards/<ver>.md); supplies reproduction_command with exact env vars and commands
- tests: updated for new carry_forward signature, new gate interface, reproduction
  section assertions, second-agent generalization test, utcnow -> now(utc)
… agent

- hub/agents/npm/agent-email/SCORECARD.md: generated from relabeled-corpus run
  (placeholder; orchestrator will regenerate from full run)
- hub/agents/npm/agent-email/package.json: files array includes SCORECARD.md,
  removes scorecards/ (don't ship all versions in the npm tarball)
- hub/agents/npm/agent-email/README.md: scorecard link updated to ./SCORECARD.md
- Delete hub/agents/npm/agent-email/scorecards/ (per-version dir, now obsolete)
- Delete hub/agents/python/hello-world/scorecards/ (contained fabricated 90.0 score)
…RD.md

- storage.ts: evalScorecardKey now returns SCORECARD.md (was eval-scorecard.md)
- publish.ts: update comment for SCORECARD.md
- routes.test.ts: expect eval_scorecard_url to end in /SCORECARD.md
- publish_to_r2.py: update --eval-scorecard help text to reference SCORECARD.md
- release_agent_email.yml: scorecard-gate uses new --scorecard / --baseline-ref
  interface; computes prev tag via git describe; publish step points at SCORECARD.md
- email_scorecard_refresh.yml: use SCORECARD.md env var throughout; same-version
  check and cross-version gate use new gate interface with --baseline-ref
…onvention

- eval-scorecard.mdx: storage convention is now a single SCORECARD.md (not
  scorecards/<ver>.md); gate uses --scorecard + --baseline-ref/--baseline-file;
  carry_forward reads version from front matter; Reproduction section documented;
  npm files include SCORECARD.md only (not scorecards/ dir)
- SKILL.md: doc-root/SCORECARD.md as single file; reproduction_command in adapter;
  gate CLI updated to --scorecard / --baseline-ref pattern; Phase 4 examples updated
@kovtcharov

Copy link
Copy Markdown
Collaborator

Move the eval into the release pipeline — don't run it per-PR

The eval gating is wired backwards. The real eval lives in email_scorecard_refresh.yml, which fires on every push to a non-main branch touching hub/agents/python/email/** or tests/fixtures/email/** (and auto-commits a regenerated scorecard back to the branch). Meanwhile the release pipeline's scorecard-gate job runs no eval at all — it parses a pre-committed card on ubuntu-latest. So the published score is whatever was hand-committed, and the "keep-it-honest" eval never runs against the bytes being shipped.

It's also never actually executed: both pushes on this branch queued with 0 jobs, waiting on the absent [self-hosted, lemonade-eval] runner (e.g. run 28252399720, pending 30+ min). So the refresh loop has run zero times.

The eval should run once, at tagged release, as a publish blocker — and meet-or-beat the previously published score:

  1. Drop the push: trigger from email_scorecard_refresh.yml (keep workflow_dispatch as a manual refresh, or delete the file). PRs shouldn't run the eval or auto-commit scorecards — that's a 90-min self-hosted job on a single shared backend slot, fired per commit.
  2. Replace the file-parsing scorecard-gate in release_agent_email.yml with a real-eval gate on [self-hosted, lemonade-eval] that, before publish:
    • runs gaia eval benchmark for the tagged version,
    • regenerates the card via gen_scorecard.py,
    • fails unless the fresh aggregate ≥ the previously published version's score (reuse gaia.eval.scorecard_gate's cross-version check — just feed it the freshly-evaluated card instead of a committed one),
    • and publishes that fresh card.
  3. Keep publish needs:-ing this job, so a regression or an offline runner blocks the release — fail loud, never publish without proof.

Net: eval runs at release only, the published number is tied to the shipped bytes, "no worse than last release" becomes a hard publish gate, and scorecards stop being mid-PR branch mutations.

🔍 Two correctness issues spotted while reviewing the eval code
  • gen_scorecard.py (build_payload) — aggregate is an unweighted scenario mean, reported against a summed email count. category_accuracy = sum(s.quality.category_accuracy for s in judged) / len(judged) while test_cases_run = sum(total_emails). With one scenario (today's run) it's fine, but a multi-scenario run with unequal total_emails (e.g. --experiments, multiple models, partial runs) reports a per-scenario mean as if it were per-email accuracy — and that's the headline number the gate compares. Weight each scenario by its total_emails.
  • scorecard_gate.py:230float(candidate) < float(prev) doesn't reject non-finite values. validate_scorecard checks aggregate.value is present but not that it's finite; a hand-edited/corrupt card with value: .nan makes every comparison x < nanFalse → permanent PASS. Low likelihood (the generator can't emit NaN), but a math.isfinite guard in validate_scorecard closes it cheaply.

Tomasz Iniewicz added 2 commits June 26, 2026 13:38
- subprocess.run: add check=False (W1510)
- Remove bare f-strings with no interpolated vars (W1309)
- black reformatted test_scorecard_gate.py
…reproduction

Regenerate the email v0.2.4 SCORECARD.md from a full-corpus gaia eval benchmark
run on AMD Strix Halo: category_accuracy 0.46 over 100 of 220 emails (the triage
tool processes up to 100 per call) -> aggregate 46.0/100. Errors are dominated by
the inherently-ambiguous fyi<->needs_response boundary; the model over-assigns
NEEDS_RESPONSE. Fix the adapter's reproduction command to be portable (generic
/tmp/email-eval output dir, full model/mbox/ground-truth/output-dir flags) — no
local absolute path in the published artifact. README reflects 46.0.
@itomek

itomek commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks for the review — addressed all of it; pushed to the branch.

Single SCORECARD.md (your main point): switched from per-version scorecards/<version>.md files to one SCORECARD.md per agent, updated in place and packaged per release — exactly like README.md / SPEC.md / SKILL.md. Per-version uniqueness rides the publish snapshot (R2 stores it at agents/<id>/<version>/SCORECARD.md). Deleted the scorecards/ dirs; package.json now ships only SCORECARD.md, not all versions.

Reproduction (your "how do we actually run eval?"): every scorecard now has a ## Reproduction section with the exact, copy-pasteable commands (env + gaia eval benchmark + gen_scorecard.py) plus links to the docs and the new adding-eval-scorecard skill.

"Why only 25?" → now the full corpus. Re-ran over the whole corpus on AMD Strix Halo. The triage tool processes up to 100 emails per call, so the effective run is 100 of 220 labeled examples — recorded distinctly (test_cases_run: 100, dataset.size: 220).

"40% — do we know why?" Yes. Full-corpus result is category_accuracy 0.46 → aggregate 46.0. The errors are dominated by the genuinely-ambiguous fyi ↔ needs_response boundary (28 of 54 errors), and the model over-assigns NEEDS_RESPONSE (urgent/promotional leak into it). Per-class recall 36–54%. For a 4B local model (Gemma-4-E4B) on a 5-way subjective triage task, that's ~2.3× the random baseline — a real number with clear headroom (prompt / few-shot tuning), recorded honestly rather than inflated.

[bot review] Fixed both: removed the committed hello-world card (its hand-authored 90.0 would have surfaced as a real hub eval_score) — generalization is now proven by a unit test instead; and swept the deprecated datetime.utcnow() → tz-aware.

Gate, after the redesign

The gate now compares a single SCORECARD.md against a baseline: --baseline-file <path> (unit tests) or --baseline-ref <git-ref> (CI uses the previous release tag, best-effort). Presence-only passes when there's no baseline. Verified live: presence → exit 0; baseline 55.0 vs candidate 46.0 → exit 1; --allow-regression → exit 0. 65 Python + 69 worker tests pass; lint clean; both tsc --noEmit clean.

One thing to confirm: the cross-version baseline resolves the previous release via git describe --tags --match 'agent-pkg-email-*'. If the email release tags use a different pattern, point me at it and I'll fix the match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops DevOps/infrastructure changes documentation Documentation changes eval Evaluation framework changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(eval): per-agent per-version eval scorecard + release gate (presence + regression)

3 participants