amd · itomek · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026
@@ -0,0 +1,85 @@
+---
+name: "adding-eval-scorecard"
+description: "Adopt the per-agent eval scorecard for a GAIA hub agent: write the harness→payload adapter, run the eval to produce a REAL scorecard, link + surface it from the agent's README, wire the release gate, and (for a new agent) generalize the format. Use when asked to 'add a scorecard', 'adopt the eval scorecard', 'generate the scorecard for <agent>', or wire scorecard CI for an agent. Builds on docs/reference/eval-scorecard.mdx and the email agent reference adapter."
+---
+
+# Adding an Eval Scorecard to a GAIA Agent
+
+Adopt the release **eval scorecard** ([`docs/reference/eval-scorecard.mdx`](../../../docs/reference/eval-scorecard.mdx)) for one hub agent. The system is `harness → result payload → generator → scorecard`, with a standalone presence+regression release gate. The **email agent is the reference implementation** — mirror it.
+
+**Core modules (do not modify; reuse):**
+- `src/gaia/eval/release_scorecard.py` — `ResultPayload`, `compute_aggregate`, `render_scorecard`, `write_scorecard`, `validate_scorecard`, `carry_forward`, `latest_version_below`. Harness-agnostic (stdlib + PyYAML only).
+- `src/gaia/eval/scorecard_gate.py` — the standalone gate (`python -m gaia.eval.scorecard_gate`).
+- Reference adapter: `hub/agents/python/email/packaging/gen_scorecard.py`.
+
+This is a **phased checklist with a hard gate at the real-eval step** — the scorecard MUST come from an actual eval run, never hand-authored numbers.
+
+## Phase 1 — Locate the agent's surfaces
+
+1. **Version source of truth** = the `version:` field in `<agent>/gaia-agent.yaml`. Never invent a parallel scheme.
+2. **Canonical README** (where the scorecard is linked + surfaced): for an npm-published agent it is the npm client README (e.g. `hub/agents/npm/<id>/README.md`), NOT a `packaging/README.md`. For a Python-only agent it is `hub/agents/python/<id>/README.md`. Confirm which by checking what `release_agent_<id>.yml` publishes (`README:` env) — the published README is the one to link.
+3. **doc-root** = the directory holding that canonical README. Scorecards live at `<doc-root>/scorecards/<version>.md`.
+4. **Eval vehicle**: what existing harness produces this agent's accuracy metric? (email → `gaia eval benchmark` over `tests/fixtures/email/`.) If none exists, STOP and surface that — propose the minimal harness before building; do not invent numbers.
+
+## Phase 2 — Write the adapter (harness → payload)
+
+Copy `hub/agents/python/email/packaging/gen_scorecard.py` as the template. The adapter:
+- imports ONLY `gaia.eval.release_scorecard` (never the harness or agent package — preserve loose coupling);
+- reads the harness output, builds a `ResultPayload`;
+- defines **"judged"** explicitly and **raises loudly** if zero results are judged (no silent 0.0);
+- records **dataset size** (total labeled examples) and **test_cases_run** (subset executed) as DISTINCT fields;
+- stores **repo-relative** paths only (never a local absolute path — it ships in a published artifact);
+- records the eval `limit`/config so future regression checks are comparable;
+- writes to `<doc-root>/scorecards/<version>.md`.
+
+Add an offline unit test against a committed sample harness-output fixture (see `tests/fixtures/eval/email_benchmark_scorecard.json` + `tests/unit/eval/test_release_scorecard.py::TestEmailAdapter`) so the adapter is testable without a live model.
+
+## Phase 3 — Run the REAL eval (hard gate — no hand-authored numbers)
+
+The accuracy number must come from an actual run. For the email agent:
+
+```bash
+# Real eval needs Lemonade + the model. Prefer AMD hardware (Strix Halo / Ryzen AI);
+# the [self-hosted, lemonade-eval] runner is the canonical environment.
+GAIA_AGENT_TOOL_TIMEOUT=900 \
+PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring \
+PYTHONPATH="$(pwd)" \
+  <venv>/bin/gaia eval benchmark \
+    --model Gemma-4-E4B-it-GGUF \
+    --mbox-path tests/fixtures/email/synthetic_inbox.mbox \
+    --ground-truth tests/fixtures/email/ground_truth.json \
+    --limit 25 --output-dir <persistent-dir>
+
+<venv>/bin/python hub/agents/python/email/packaging/gen_scorecard.py \
+    --benchmark-dir <persistent-dir> --limit 25
+```
+
+**Headless gotchas (see memory `project-email-benchmark-headless-gotchas`):**
+- `PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring` — the email agent's calendar-connector resolution blocks forever on the macOS Keychain (and can stall on Linux SecretService) in non-interactive contexts. Without this it hangs at 0% CPU during agent construction.
+- `PYTHONPATH="$(pwd)"` — the benchmark imports `tests.fixtures.email.*`; the console script doesn't add the repo root.
+- `GAIA_AGENT_TOOL_TIMEOUT=900` — triage of N emails is one tool call; the 180s default abandons it on slow backends, yielding a degenerate 0-email FAIL run.
+- Write `--output-dir` to a **persistent** dir, not `/tmp` (cleared on session resume).
+- Record honestly: if the metric is low for a known reason (e.g. a taxonomy/label mismatch), put the explanation in the adapter's `methodology` string and link the tracking issue — never inflate the number.
+
+## Phase 4 — Surface, link, and gate
+
+1. **Link + surface** from the canonical README: a one-line `Eval scorecard (vX.Y.Z): aggregate N/100 … ([./scorecards/X.Y.Z.md](./scorecards/X.Y.Z.md))`. The relative link must resolve in-repo.
+2. **npm `files`**: if the agent publishes on npm, add `scorecards/` to `package.json` `files` so the link resolves on the published package too.
+3. **Hub display**: a published scorecard surfaces on the agent's hub page / Agent UI detail view (see `workers/agent-hub` + `AgentDetailModal.tsx`); ensure the publish step uploads the scorecard alongside the README.
+4. **Release gate**: add a `scorecard-gate` job to `release_agent_<id>.yml` and list it in `publish.needs`. The job runs on a GitHub-hosted runner (it only parses committed files — no eval):
+   ```bash
+   python -m gaia.eval.scorecard_gate \
+     --scorecards-dir <doc-root>/scorecards \
+     --manifest hub/agents/python/<id>/gaia-agent.yaml
+   ```
+   The job must NOT have `continue-on-error`, an `environment:`, or a `permissions:` override (inherits `contents: read`; needs no secrets).
+5. **Auto-update/reject loop**: for re-running on agent changes and refreshing the committed scorecard, see `eval-scorecard.mdx` "Keeping the scorecard current" and the self-hosted refresh workflow — reject-on-worse is the gate; better-or-equal refreshes the committed card.
+
+## Phase 5 — Verify (evidence before "done")
+
+Run and capture: the generated `<version>.md`; the gate **passing** on it (exit 0); the gate **blocking** a manufactured regression (exit 1) and a missing card (exit 1); a by-hand recompute of the aggregate from `aggregate.components` matching the recorded value. Run `python util/lint.py --all` and the eval unit tests. These are the PR's real-world proof.
+
+## Versioning
+
+- **Patch** release → `carry_forward(prev_path, new_version)` (copies results verbatim, sets `inherited_from`); do NOT re-run the eval.
+- **Minor/major** release → re-run the eval (Phase 3); `carry_forward` refuses a non-patch bump with a "re-run" error.
@@ -0,0 +1,162 @@
+# Copyright(C) 2025-2026 Advanced Micro Devices, Inc. All rights reserved.
+# SPDX-License-Identifier: MIT
+
+# Email agent eval-scorecard refresh + regression gate (#1862).
+#
+# Answers "how does a PR that changes the agent keep the scorecard honest?":
+# when the email agent's LLM-affecting code (or the eval corpus) changes, this
+# re-runs the REAL eval, regenerates the scorecard, and then:
+#   - score IMPROVED or held  -> commits the refreshed scorecard to the branch
+#   - score REGRESSED          -> fails the job (the worse card is NOT committed)
+#
+# `gaia eval benchmark` needs Lemonade on AMD hardware, so this runs ONLY on the
+# self-hosted [self-hosted, lemonade-eval] pool — GitHub-hosted runners cannot run
+# it. The release-time `scorecard-gate` job in release_agent_email.yml is the
+# hosted-CI backstop (it parses committed files only, no eval).
+#
+# Two regression checks run here:
+#   1. SAME-VERSION: fresh aggregate vs the currently-committed card for this
+#      version — stops a noisy/worse re-run from silently overwriting a good score.
+#   2. CROSS-VERSION: `gaia.eval.scorecard_gate` — fresh card vs the prior version.
+#
+# Auto-commit needs `contents: write` and only works on the repo's own branches;
+# a fork PR's GITHUB_TOKEN is read-only — for forks, run the eval locally / on AMD
+# hardware and commit the scorecard by hand (the release gate still enforces it).
+
+name: Email Agent Eval — scorecard refresh
+
+on:
+  workflow_dispatch:
+    inputs:
+      limit:
+        description: 'Messages to triage (must match the committed scorecard for comparability)'
+        required: false
+        default: '25'
+      model:
+        description: 'Lemonade model id'
+        required: false
+        default: 'Gemma-4-E4B-it-GGUF'
+  push:
+    branches-ignore:
+      - main
+    paths:
+      - 'hub/agents/python/email/**'
+      - 'tests/fixtures/email/**'
+      - 'src/gaia/eval/release_scorecard.py'
+      - 'src/gaia/eval/scorecard_gate.py'
+
+concurrency:
+  # Share the single Lemonade backend slot with the other self-hosted evals so two
+  # runs never race-evict each other's model (CLAUDE.md: evals run serially).
+  group: lemonade-eval
+  cancel-in-progress: false
+
+permissions:
+  contents: write   # auto-commit the refreshed scorecard to the branch
+
+env:
+  SCORECARD_DIR: hub/agents/npm/agent-email/scorecards
+  MANIFEST: hub/agents/python/email/gaia-agent.yaml
+  LIMIT: ${{ github.event.inputs.limit || '25' }}
+  MODEL: ${{ github.event.inputs.model || 'Gemma-4-E4B-it-GGUF' }}
+
+jobs:
+  refresh:
+    name: Re-run eval, refresh-or-reject scorecard
+    runs-on: [self-hosted, lemonade-eval]
+    timeout-minutes: 90
+    steps:
+      - name: Checkout (the pushed branch)
+        uses: actions/checkout@v6
+        with:
+          ref: ${{ github.head_ref || github.ref_name }}
+
+      - name: Set up Python
+        uses: actions/setup-python@v6
+        with:
+          python-version: '3.10'
+
+      - name: Install in isolated venv
+        run: |
+          python -m venv .venv-scorecard
+          source .venv-scorecard/bin/activate
+          python -m pip install --upgrade pip
+          pip install -e ".[dev,eval,api]"
+          echo "$PWD/.venv-scorecard/bin" >> "$GITHUB_PATH"
+
+      - name: Resolve version + capture currently-committed aggregate
+        id: pre
+        run: |
+          set -euo pipefail
+          VERSION=$(python -c "import yaml; print(yaml.safe_load(open('${MANIFEST}'))['version'])")
+          echo "version=${VERSION}" >> "$GITHUB_OUTPUT"
+          CARD="${SCORECARD_DIR}/${VERSION}.md"
+          # Aggregate of the card as committed on this branch (empty if new).
+          if git cat-file -e "HEAD:${CARD}" 2>/dev/null; then
+            git show "HEAD:${CARD}" > /tmp/committed_card.md
+            COMMITTED=$(python -c "from gaia.eval.release_scorecard import parse_scorecard; print(parse_scorecard(__import__('pathlib').Path('/tmp/committed_card.md'))['aggregate']['value'])")
+          else
+            COMMITTED=""
+          fi
+          echo "committed=${COMMITTED}" >> "$GITHUB_OUTPUT"
+          echo "Version ${VERSION}; committed aggregate: ${COMMITTED:-<none>}"
+
+      - name: Run the email-triage benchmark (real eval)
+        env:
+          # The agent's calendar-connector resolution blocks on the OS keyring in
+          # a headless context — disable it so construction doesn't hang.
+          PYTHON_KEYRING_BACKEND: keyring.backends.null.Keyring
+          # Triage of N emails is one tool call; the 180s default abandons it on a
+          # slow backend and yields a degenerate 0-email run.
+          GAIA_AGENT_TOOL_TIMEOUT: '900'
+          PYTHONPATH: ${{ github.workspace }}
+        run: |
+          set -euo pipefail
+          rm -rf eval-out && mkdir -p eval-out
+          gaia eval benchmark \
+            --model "${MODEL}" \
+            --mbox-path tests/fixtures/email/synthetic_inbox.mbox \
+            --ground-truth tests/fixtures/email/ground_truth.json \
+            --limit "${LIMIT}" \
+            --output-dir eval-out
+
+      - name: Regenerate the scorecard from the real run
+        run: |
+          set -euo pipefail
+          python hub/agents/python/email/packaging/gen_scorecard.py \
+            --benchmark-dir eval-out --limit "${LIMIT}"
+
+      - name: Same-version regression check (reject a worse re-run)
+        run: |
+          set -euo pipefail
+          VERSION="${{ steps.pre.outputs.version }}"
+          COMMITTED="${{ steps.pre.outputs.committed }}"
+          CARD="${SCORECARD_DIR}/${VERSION}.md"
+          FRESH=$(python -c "from gaia.eval.release_scorecard import parse_scorecard; print(parse_scorecard(__import__('pathlib').Path('${CARD}'))['aggregate']['value'])")
+          echo "fresh aggregate: ${FRESH} | committed: ${COMMITTED:-<none>}"
+          if [ -n "${COMMITTED}" ] && python -c "import sys; sys.exit(0 if float('${FRESH}') < float('${COMMITTED}') else 1)"; then
+            echo "::error::Scorecard regression for v${VERSION}: re-run scored ${FRESH} < committed ${COMMITTED}. Not committing the worse card. Investigate, or override intentionally via --allow-regression in a manual commit."
+            git checkout -- "${CARD}" || true
+            exit 1
+          fi
+          echo "No same-version regression — fresh score is >= committed."
+
+      - name: Cross-version gate (fresh card vs prior version)
+        run: |
+          set -euo pipefail
+          python -m gaia.eval.scorecard_gate \
+            --scorecards-dir "${SCORECARD_DIR}" \
+            --manifest "${MANIFEST}"
+
+      - name: Commit the refreshed scorecard (only if it changed for the better/equal)
+        run: |
+          set -euo pipefail
+          if git diff --quiet -- "${SCORECARD_DIR}"; then
+            echo "Scorecard unchanged — nothing to commit."
+            exit 0
+          fi
+          git config user.name  "${{ github.actor }}"
+          git config user.email "${{ github.actor }}@users.noreply.github.com"
+          git add "${SCORECARD_DIR}"
+          git commit -m "eval(email): refresh v${{ steps.pre.outputs.version }} scorecard from benchmark run"
+          git push origin "HEAD:${{ github.head_ref || github.ref_name }}"
@@ -266,11 +266,28 @@ jobs:
             echo "ok=false" >> "$GITHUB_OUTPUT"
           fi
 
+  # ── Stage 1b: scorecard presence + regression gate ─────────────────
+  scorecard-gate:
+    name: Scorecard gate
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6
+      - uses: actions/setup-python@v6
+        with:
+          python-version: "3.12"
+      - name: Install core + PyYAML
+        run: pip install -e . pyyaml
+      - name: Run scorecard gate
+        run: |
+          python -m gaia.eval.scorecard_gate \
+            --scorecards-dir hub/agents/npm/agent-email/scorecards \
+            --manifest hub/agents/python/email/gaia-agent.yaml
+
   # ── Stage 2: publish to the hub + npm (single atomic step) ─────────
   publish:
     name: Publish to Hub + npm
     runs-on: ubuntu-latest
-    needs: [build, verify-darwin-x64-compat]
+    needs: [build, verify-darwin-x64-compat, scorecard-gate]
     # Manual approval gate: the `agent-publish` environment is configured (repo
     # Settings → Environments) with required reviewers, so this job pauses until a
     # maintainer approves — the human backstop for an accidental/tampered release
@@ -458,13 +475,20 @@ jobs:
             case "$f" in *.json) continue ;; esac
             args+=(--artifact "$f")
           done
+          VER="${{ steps.ver.outputs.version }}"
+          scorecard_args=()
+          SCORECARD="hub/agents/npm/agent-email/scorecards/${VER}.md"
+          if [ -f "${SCORECARD}" ]; then
+            scorecard_args+=(--eval-scorecard "${SCORECARD}")
+          fi
           python hub/agents/python/email/packaging/publish_to_r2.py \
             --base-url "${GAIA_HUB_PUBLISH_URL:-${GAIA_HUB_BASE_URL:-https://hub.amd-gaia.ai}}" \
             --manifest "${MANIFEST}" \
             --readme "${README}" \
             --changelog "${CHANGELOG}" \
             --spec "${SPEC}" \
             --skill "${SKILL}" \
+            "${scorecard_args[@]}" \
             "${args[@]}" \
             --summary-out published.json
           echo "=== publish summary ==="

@@ -356,6 +356,7 @@
                 "group": "Evaluation Framework",
                 "pages": [
                   "reference/eval",
+                  "reference/eval-scorecard",
                   "eval"
                 ]
               },