Skip to content
Open
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
753b88d
test(eval): add TDD tests for release scorecard + gate (frozen contract)
Jun 25, 2026
2257088
feat(eval): add release_scorecard + scorecard_gate modules (increment…
Jun 25, 2026
5ed399c
feat(eval): add email adapter gen_scorecard.py + fix loose-coupling t…
Jun 25, 2026
a1dce4f
feat(eval): docs, hello-world scorecard, CI gate, npm wiring (increme…
Jun 25, 2026
78e45bf
feat(eval): deepen validate_scorecard with nested-field checks
Jun 25, 2026
019cc16
feat(eval): record eval limit + derive model in email scorecard adapter
Jun 25, 2026
2f931a1
ci(eval): pin scorecard-gate setup-python to @v6
Jun 25, 2026
e47bfaf
feat(eval): email v0.2.4 scorecard from real benchmark run
Jun 26, 2026
2ae55ec
feat(eval): scorecard refresh/reject CI loop, adoption skill, correct…
Jun 26, 2026
01d6da4
feat(eval): surface eval scorecard in Agent Hub worker and publish flow
Jun 26, 2026
add5172
feat(eval): show eval score and scorecard link in Agent UI detail modal
Jun 26, 2026
a3dd996
Merge branch 'main' into feat/issue-1862-eval-scorecard
itomek Jun 26, 2026
0eed445
feat(eval): regenerate email v0.2.4 scorecard against relabeled corpus
Jun 26, 2026
178266c
Merge remote-tracking branch 'origin/main' into feat/issue-1862-eval-…
Jun 26, 2026
f5971b6
refactor(eval): single SCORECARD.md per agent, new gate interface, re…
Jun 26, 2026
7e0ea56
refactor(eval): replace scorecards/ dirs with single SCORECARD.md per…
Jun 26, 2026
704ea08
refactor(eval): update hub worker, workflows, and publish for SCORECA…
Jun 26, 2026
40107bf
docs(eval): update scorecard docs and skill for single SCORECARD.md c…
Jun 26, 2026
20dbdbe
fix(eval): scorecard_gate pylint and black formatting
Jun 26, 2026
c2dcf6c
feat(eval): email SCORECARD.md from full-corpus run (46.0); portable …
Jun 26, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions .claude/skills/adding-eval-scorecard/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
---
name: "adding-eval-scorecard"
description: "Adopt the per-agent eval scorecard for a GAIA hub agent: write the harness→payload adapter, run the eval to produce a REAL scorecard, link + surface it from the agent's README, wire the release gate, and (for a new agent) generalize the format. Use when asked to 'add a scorecard', 'adopt the eval scorecard', 'generate the scorecard for <agent>', or wire scorecard CI for an agent. Builds on docs/reference/eval-scorecard.mdx and the email agent reference adapter."
---

# Adding an Eval Scorecard to a GAIA Agent

Adopt the release **eval scorecard** ([`docs/reference/eval-scorecard.mdx`](../../../docs/reference/eval-scorecard.mdx)) for one hub agent. The system is `harness → result payload → generator → scorecard`, with a standalone presence+regression release gate. The **email agent is the reference implementation** — mirror it.

**Core modules (do not modify; reuse):**
- `src/gaia/eval/release_scorecard.py` — `ResultPayload`, `compute_aggregate`, `render_scorecard`, `write_scorecard`, `validate_scorecard`, `carry_forward`, `latest_version_below`. Harness-agnostic (stdlib + PyYAML only).
- `src/gaia/eval/scorecard_gate.py` — the standalone gate (`python -m gaia.eval.scorecard_gate`).
- Reference adapter: `hub/agents/python/email/packaging/gen_scorecard.py`.

This is a **phased checklist with a hard gate at the real-eval step** — the scorecard MUST come from an actual eval run, never hand-authored numbers.

## Phase 1 — Locate the agent's surfaces

1. **Version source of truth** = the `version:` field in `<agent>/gaia-agent.yaml`. Never invent a parallel scheme.
2. **Canonical README** (where the scorecard is linked + surfaced): for an npm-published agent it is the npm client README (e.g. `hub/agents/npm/<id>/README.md`), NOT a `packaging/README.md`. For a Python-only agent it is `hub/agents/python/<id>/README.md`. Confirm which by checking what `release_agent_<id>.yml` publishes (`README:` env) — the published README is the one to link.
3. **doc-root** = the directory holding that canonical README. Scorecards live at `<doc-root>/scorecards/<version>.md`.
4. **Eval vehicle**: what existing harness produces this agent's accuracy metric? (email → `gaia eval benchmark` over `tests/fixtures/email/`.) If none exists, STOP and surface that — propose the minimal harness before building; do not invent numbers.

## Phase 2 — Write the adapter (harness → payload)

Copy `hub/agents/python/email/packaging/gen_scorecard.py` as the template. The adapter:
- imports ONLY `gaia.eval.release_scorecard` (never the harness or agent package — preserve loose coupling);
- reads the harness output, builds a `ResultPayload`;
- defines **"judged"** explicitly and **raises loudly** if zero results are judged (no silent 0.0);
- records **dataset size** (total labeled examples) and **test_cases_run** (subset executed) as DISTINCT fields;
- stores **repo-relative** paths only (never a local absolute path — it ships in a published artifact);
- records the eval `limit`/config so future regression checks are comparable;
- writes to `<doc-root>/scorecards/<version>.md`.

Add an offline unit test against a committed sample harness-output fixture (see `tests/fixtures/eval/email_benchmark_scorecard.json` + `tests/unit/eval/test_release_scorecard.py::TestEmailAdapter`) so the adapter is testable without a live model.

## Phase 3 — Run the REAL eval (hard gate — no hand-authored numbers)

The accuracy number must come from an actual run. For the email agent:

```bash
# Real eval needs Lemonade + the model. Prefer AMD hardware (Strix Halo / Ryzen AI);
# the [self-hosted, lemonade-eval] runner is the canonical environment.
GAIA_AGENT_TOOL_TIMEOUT=900 \
PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring \
PYTHONPATH="$(pwd)" \
<venv>/bin/gaia eval benchmark \
--model Gemma-4-E4B-it-GGUF \
--mbox-path tests/fixtures/email/synthetic_inbox.mbox \
--ground-truth tests/fixtures/email/ground_truth.json \
--limit 25 --output-dir <persistent-dir>

<venv>/bin/python hub/agents/python/email/packaging/gen_scorecard.py \
--benchmark-dir <persistent-dir> --limit 25
```

**Headless gotchas (see memory `project-email-benchmark-headless-gotchas`):**
- `PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring` — the email agent's calendar-connector resolution blocks forever on the macOS Keychain (and can stall on Linux SecretService) in non-interactive contexts. Without this it hangs at 0% CPU during agent construction.
- `PYTHONPATH="$(pwd)"` — the benchmark imports `tests.fixtures.email.*`; the console script doesn't add the repo root.
- `GAIA_AGENT_TOOL_TIMEOUT=900` — triage of N emails is one tool call; the 180s default abandons it on slow backends, yielding a degenerate 0-email FAIL run.
- Write `--output-dir` to a **persistent** dir, not `/tmp` (cleared on session resume).
- Record honestly: if the metric is low for a known reason (e.g. a taxonomy/label mismatch), put the explanation in the adapter's `methodology` string and link the tracking issue — never inflate the number.

## Phase 4 — Surface, link, and gate

1. **Link + surface** from the canonical README: a one-line `Eval scorecard (vX.Y.Z): aggregate N/100 … ([./scorecards/X.Y.Z.md](./scorecards/X.Y.Z.md))`. The relative link must resolve in-repo.
2. **npm `files`**: if the agent publishes on npm, add `scorecards/` to `package.json` `files` so the link resolves on the published package too.
3. **Hub display**: a published scorecard surfaces on the agent's hub page / Agent UI detail view (see `workers/agent-hub` + `AgentDetailModal.tsx`); ensure the publish step uploads the scorecard alongside the README.
4. **Release gate**: add a `scorecard-gate` job to `release_agent_<id>.yml` and list it in `publish.needs`. The job runs on a GitHub-hosted runner (it only parses committed files — no eval):
```bash
python -m gaia.eval.scorecard_gate \
--scorecards-dir <doc-root>/scorecards \
--manifest hub/agents/python/<id>/gaia-agent.yaml
```
The job must NOT have `continue-on-error`, an `environment:`, or a `permissions:` override (inherits `contents: read`; needs no secrets).
5. **Auto-update/reject loop**: for re-running on agent changes and refreshing the committed scorecard, see `eval-scorecard.mdx` "Keeping the scorecard current" and the self-hosted refresh workflow — reject-on-worse is the gate; better-or-equal refreshes the committed card.

## Phase 5 — Verify (evidence before "done")

Run and capture: the generated `<version>.md`; the gate **passing** on it (exit 0); the gate **blocking** a manufactured regression (exit 1) and a missing card (exit 1); a by-hand recompute of the aggregate from `aggregate.components` matching the recorded value. Run `python util/lint.py --all` and the eval unit tests. These are the PR's real-world proof.

## Versioning

- **Patch** release → `carry_forward(prev_path, new_version)` (copies results verbatim, sets `inherited_from`); do NOT re-run the eval.
- **Minor/major** release → re-run the eval (Phase 3); `carry_forward` refuses a non-patch bump with a "re-run" error.
162 changes: 162 additions & 0 deletions .github/workflows/email_scorecard_refresh.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# Copyright(C) 2025-2026 Advanced Micro Devices, Inc. All rights reserved.
# SPDX-License-Identifier: MIT

# Email agent eval-scorecard refresh + regression gate (#1862).
#
# Answers "how does a PR that changes the agent keep the scorecard honest?":
# when the email agent's LLM-affecting code (or the eval corpus) changes, this
# re-runs the REAL eval, regenerates the scorecard, and then:
# - score IMPROVED or held -> commits the refreshed scorecard to the branch
# - score REGRESSED -> fails the job (the worse card is NOT committed)
#
# `gaia eval benchmark` needs Lemonade on AMD hardware, so this runs ONLY on the
# self-hosted [self-hosted, lemonade-eval] pool — GitHub-hosted runners cannot run
# it. The release-time `scorecard-gate` job in release_agent_email.yml is the
# hosted-CI backstop (it parses committed files only, no eval).
#
# Two regression checks run here:
# 1. SAME-VERSION: fresh aggregate vs the currently-committed card for this
# version — stops a noisy/worse re-run from silently overwriting a good score.
# 2. CROSS-VERSION: `gaia.eval.scorecard_gate` — fresh card vs the prior version.
#
# Auto-commit needs `contents: write` and only works on the repo's own branches;
# a fork PR's GITHUB_TOKEN is read-only — for forks, run the eval locally / on AMD
# hardware and commit the scorecard by hand (the release gate still enforces it).

name: Email Agent Eval — scorecard refresh

on:
workflow_dispatch:
inputs:
limit:
description: 'Messages to triage (must match the committed scorecard for comparability)'
required: false
default: '25'
model:
description: 'Lemonade model id'
required: false
default: 'Gemma-4-E4B-it-GGUF'
push:
branches-ignore:
- main
paths:
- 'hub/agents/python/email/**'
- 'tests/fixtures/email/**'
- 'src/gaia/eval/release_scorecard.py'
- 'src/gaia/eval/scorecard_gate.py'

concurrency:
# Share the single Lemonade backend slot with the other self-hosted evals so two
# runs never race-evict each other's model (CLAUDE.md: evals run serially).
group: lemonade-eval
cancel-in-progress: false

permissions:
contents: write # auto-commit the refreshed scorecard to the branch

env:
SCORECARD_DIR: hub/agents/npm/agent-email/scorecards
MANIFEST: hub/agents/python/email/gaia-agent.yaml
LIMIT: ${{ github.event.inputs.limit || '25' }}
MODEL: ${{ github.event.inputs.model || 'Gemma-4-E4B-it-GGUF' }}

jobs:
refresh:
name: Re-run eval, refresh-or-reject scorecard
runs-on: [self-hosted, lemonade-eval]
timeout-minutes: 90
steps:
- name: Checkout (the pushed branch)
uses: actions/checkout@v6
with:
ref: ${{ github.head_ref || github.ref_name }}

- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: '3.10'

- name: Install in isolated venv
run: |
python -m venv .venv-scorecard
source .venv-scorecard/bin/activate
python -m pip install --upgrade pip
pip install -e ".[dev,eval,api]"
echo "$PWD/.venv-scorecard/bin" >> "$GITHUB_PATH"

- name: Resolve version + capture currently-committed aggregate
id: pre
run: |
set -euo pipefail
VERSION=$(python -c "import yaml; print(yaml.safe_load(open('${MANIFEST}'))['version'])")
echo "version=${VERSION}" >> "$GITHUB_OUTPUT"
CARD="${SCORECARD_DIR}/${VERSION}.md"
# Aggregate of the card as committed on this branch (empty if new).
if git cat-file -e "HEAD:${CARD}" 2>/dev/null; then
git show "HEAD:${CARD}" > /tmp/committed_card.md
COMMITTED=$(python -c "from gaia.eval.release_scorecard import parse_scorecard; print(parse_scorecard(__import__('pathlib').Path('/tmp/committed_card.md'))['aggregate']['value'])")
else
COMMITTED=""
fi
echo "committed=${COMMITTED}" >> "$GITHUB_OUTPUT"
echo "Version ${VERSION}; committed aggregate: ${COMMITTED:-<none>}"

- name: Run the email-triage benchmark (real eval)
env:
# The agent's calendar-connector resolution blocks on the OS keyring in
# a headless context — disable it so construction doesn't hang.
PYTHON_KEYRING_BACKEND: keyring.backends.null.Keyring
# Triage of N emails is one tool call; the 180s default abandons it on a
# slow backend and yields a degenerate 0-email run.
GAIA_AGENT_TOOL_TIMEOUT: '900'
PYTHONPATH: ${{ github.workspace }}
run: |
set -euo pipefail
rm -rf eval-out && mkdir -p eval-out
gaia eval benchmark \
--model "${MODEL}" \
--mbox-path tests/fixtures/email/synthetic_inbox.mbox \
--ground-truth tests/fixtures/email/ground_truth.json \
--limit "${LIMIT}" \
--output-dir eval-out

- name: Regenerate the scorecard from the real run
run: |
set -euo pipefail
python hub/agents/python/email/packaging/gen_scorecard.py \
--benchmark-dir eval-out --limit "${LIMIT}"

- name: Same-version regression check (reject a worse re-run)
run: |
set -euo pipefail
VERSION="${{ steps.pre.outputs.version }}"
COMMITTED="${{ steps.pre.outputs.committed }}"
CARD="${SCORECARD_DIR}/${VERSION}.md"
FRESH=$(python -c "from gaia.eval.release_scorecard import parse_scorecard; print(parse_scorecard(__import__('pathlib').Path('${CARD}'))['aggregate']['value'])")
echo "fresh aggregate: ${FRESH} | committed: ${COMMITTED:-<none>}"
if [ -n "${COMMITTED}" ] && python -c "import sys; sys.exit(0 if float('${FRESH}') < float('${COMMITTED}') else 1)"; then
echo "::error::Scorecard regression for v${VERSION}: re-run scored ${FRESH} < committed ${COMMITTED}. Not committing the worse card. Investigate, or override intentionally via --allow-regression in a manual commit."
git checkout -- "${CARD}" || true
exit 1
fi
echo "No same-version regression — fresh score is >= committed."

- name: Cross-version gate (fresh card vs prior version)
run: |
set -euo pipefail
python -m gaia.eval.scorecard_gate \
--scorecards-dir "${SCORECARD_DIR}" \
--manifest "${MANIFEST}"

- name: Commit the refreshed scorecard (only if it changed for the better/equal)
run: |
set -euo pipefail
if git diff --quiet -- "${SCORECARD_DIR}"; then
echo "Scorecard unchanged — nothing to commit."
exit 0
fi
git config user.name "${{ github.actor }}"
git config user.email "${{ github.actor }}@users.noreply.github.com"
git add "${SCORECARD_DIR}"
git commit -m "eval(email): refresh v${{ steps.pre.outputs.version }} scorecard from benchmark run"
git push origin "HEAD:${{ github.head_ref || github.ref_name }}"
26 changes: 25 additions & 1 deletion .github/workflows/release_agent_email.yml
Original file line number Diff line number Diff line change
Expand Up @@ -266,11 +266,28 @@ jobs:
echo "ok=false" >> "$GITHUB_OUTPUT"
fi

# ── Stage 1b: scorecard presence + regression gate ─────────────────
scorecard-gate:
name: Scorecard gate
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v6
- uses: actions/setup-python@v6
with:
python-version: "3.12"
- name: Install core + PyYAML
run: pip install -e . pyyaml
- name: Run scorecard gate
run: |
python -m gaia.eval.scorecard_gate \
--scorecards-dir hub/agents/npm/agent-email/scorecards \
--manifest hub/agents/python/email/gaia-agent.yaml

# ── Stage 2: publish to the hub + npm (single atomic step) ─────────
publish:
name: Publish to Hub + npm
runs-on: ubuntu-latest
needs: [build, verify-darwin-x64-compat]
needs: [build, verify-darwin-x64-compat, scorecard-gate]
# Manual approval gate: the `agent-publish` environment is configured (repo
# Settings → Environments) with required reviewers, so this job pauses until a
# maintainer approves — the human backstop for an accidental/tampered release
Expand Down Expand Up @@ -458,13 +475,20 @@ jobs:
case "$f" in *.json) continue ;; esac
args+=(--artifact "$f")
done
VER="${{ steps.ver.outputs.version }}"
scorecard_args=()
SCORECARD="hub/agents/npm/agent-email/scorecards/${VER}.md"
if [ -f "${SCORECARD}" ]; then
scorecard_args+=(--eval-scorecard "${SCORECARD}")
fi
python hub/agents/python/email/packaging/publish_to_r2.py \
--base-url "${GAIA_HUB_PUBLISH_URL:-${GAIA_HUB_BASE_URL:-https://hub.amd-gaia.ai}}" \
--manifest "${MANIFEST}" \
--readme "${README}" \
--changelog "${CHANGELOG}" \
--spec "${SPEC}" \
--skill "${SKILL}" \
"${scorecard_args[@]}" \
"${args[@]}" \
--summary-out published.json
echo "=== publish summary ==="
Expand Down
1 change: 1 addition & 0 deletions docs/docs.json
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,7 @@
"group": "Evaluation Framework",
"pages": [
"reference/eval",
"reference/eval-scorecard",
"eval"
]
},
Expand Down
Loading
Loading