fix(eval): relabel email triage corpus to schema-2.0 5-bucket taxonomy by github-actions[bot] · Pull Request #1875 · amd/gaia

github-actions · 2026-06-26T14:43:57Z

The email-triage eval corpus was scored against the wrong vocabulary. The agent emits the schema-2.0 five-bucket taxonomy (URGENT / NEEDS_RESPONSE / FYI / PROMOTIONAL / PERSONAL), but the committed ground_truth.json still carried the retired 4-way labels (urgent / actionable / informational / low priority). Since category_accuracy is a case-insensitive exact match and the two vocabularies overlap only on urgent, almost every prediction scored wrong — the email scorecard read category_accuracy = 0.04 purely as a labeling artifact, not real quality. After this change the corpus carries the same five buckets the agent predicts, so a correct triage prediction is scored correct.

The corpus generator already maps to the schema-2.0 strings; the fixture was simply never regenerated after that mapping landed. Regenerating is deterministic — the .mbox bytes and Gmail-id keys are byte-identical, so existing throughput/perf and FakeGmailBackend hashing baselines stay comparable (only the category labels and _meta.taxonomy change).

Closes #1874

Test plan

PYTHONPATH=...:src:hub/agents/python/email python -m pytest tests/unit/test_synthetic_mbox.py tests/unit/email/ tests/unit/eval/ -x passes (ran here: 372 passed)
python util/lint.py --all passes
python tests/fixtures/email/generate_mbox.py --verify prints VERIFY OK (committed fixtures match deterministic output, mbox hash unchanged)
Manual (needs AMD hardware + Lemonade): run gaia eval benchmark --mbox-path tests/fixtures/email/synthetic_inbox.mbox --ground-truth tests/fixtures/email/ground_truth.json --limit 25 --output-dir <dir> and confirm category_accuracy moves from ~0.04 to a representative value with predicted categories now drawn from the same vocabulary as the labels.
Follow-up (AC Update Driver Check #3, needs hardware): regenerate the real-run baselines (tests/fixtures/email/baseline_accuracy*.json via score_baseline.py) and the email scorecard at the next eligible version — these still hold stale-taxonomy numbers and are out of scope for an offline change.

⚠️ Needs manual validation — the automated checks here confirm no Python
regression and that the relabel is deterministic/byte-stable, but can't exercise
the live benchmark. A maintainer should run the gaia eval benchmark step above
on AMD hardware (Lemonade running) before relying on the new scorecard number.

🔍 Technical details

Root cause. tests/fixtures/email/generate_mbox.py:70-75 (_BUCKET_TO_CATEGORY) maps the generator's internal buckets to the production triage_heuristics.ALL_CATEGORIES strings, and line 365 routes every label through it. The committed ground_truth.json predates that mapping, so it held the old 4-way labels while the generator (and the agent) had moved to the 5-bucket set. src/gaia/eval/quality_metrics.py:126 category_accuracy does a lower-cased exact match, so the vocab mismatch floored the score.

Changes:

tests/fixtures/email/ground_truth.json — regenerated via the deterministic generator (SEED=23023). Verified the .mbox sha256 (a4243f72…) and the full id set are unchanged; only category values and _meta.taxonomy differ. New realized counts: URGENT 47, NEEDS_RESPONSE 56, FYI 80, PROMOTIONAL 37 (PERSONAL not yet populated in the synthetic corpus — pre-existing, tracked by test(email): re-record the categorization baseline on the canonical CI backend #1438).
src/gaia/eval/quality_metrics.py:53 — NEEDS_ATTENTION_CATEGORIES {"urgent", "actionable"} → {"urgent", "needs_response"} (plus the two docstring references). Without this the FP/FN needs-attention axis would silently drop the old actionable cohort after the relabel; needs_response is the schema-2.0 successor. Compared lower-cased, matching the scorer.
tests/unit/test_synthetic_mbox.py — updated test_category_coverage_and_counts and test_meta_block_present to the 5-bucket strings, and added test_corpus_vocab_matches_scorer_taxonomy (AC Documentation update from v0.7.2 tag. #4) asserting the committed corpus vocabulary and the scorer's attention axis both stay a subset of ALL_CATEGORIES — a drift guard so a future taxonomy change can't silently re-break scoring.

Verification run here: 372 email+eval unit tests pass; generate_mbox.py --verify self-checks clean; black + isort clean. The inline-GT unit tests in test_quality_metrics.py / test_benchmark.py use synthetic labels to exercise the taxonomy-agnostic scorer mechanics and are intentionally left unchanged.

The committed ground_truth.json carried the retired 4-way taxonomy (urgent/actionable/informational/low priority) while the agent emits the schema-2.0 5-bucket set (URGENT/NEEDS_RESPONSE/FYI/PROMOTIONAL/PERSONAL), so the benchmark's exact-match category_accuracy collapsed to ~0.04 — a vocabulary artifact, not real quality. The generator already maps to the schema-2.0 strings; regenerating relabels the corpus deterministically (mbox bytes and Gmail-id keys unchanged, so perf/hashing baselines stay comparable). Update the scorer's needs-attention axis (actionable -> needs_response) so the FP/FN axis stays coherent, refresh the corpus assertions, and add a drift guard asserting the corpus vocabulary stays a subset of the production taxonomy. Closes #1874

itomek · 2026-06-26T15:35:15Z

Verified on AMD Strix Halo (Gemma-4-E4B, real gaia eval benchmark run): this relabel fixes the #1874 artifact. Scoring the same 50 agent predictions against both label sets, category_accuracy goes 0.04 → 0.46 — it's now a real measure of agreement, not a vocabulary mismatch. Recommend merge; one follow-up worth noting below.

🔍 Verification details + a residual quality note

Run: current 5-bucket email agent over the unchanged synthetic_inbox.mbox, 50/220 emails, scored against the old labels vs this PR's labels (identical predictions):

Scored against	category_accuracy
old 4-way labels (pre-PR)	2/50 = 0.04
this PR's 5-bucket labels	23/50 = 0.46

Per-bucket recall (this PR's labels): FYI 52% · NEEDS_RESPONSE 46% · URGENT 40% · PROMOTIONAL 29%.

Residual (non-blocking) note: the relabel is a 1:1 rename by count — informational→FYI, actionable→NEEDS_RESPONSE, urgent→URGENT, low priority→PROMOTIONAL (old/new counts are identical: 80/56/47/37). The first three are sound; low priority→PROMOTIONAL is the weakest bucket (29%) — those former low-priority mails aren't actually promotional (the agent splits them across FYI/NEEDS_RESPONSE/PERSONAL), and the 5th bucket PERSONAL is unused in the corpus. The dominant error is genuine FYI↔NEEDS_RESPONSE boundary confusion by the model, not labeling. A content-based relabel of the former low-priority set (into PROMOTIONAL/FYI/PERSONAL by actual content) would lift the ceiling further — good as a follow-up, doesn't block this fix.

github-actions · 2026-06-26T15:47:36Z

🟡 The new test_corpus_vocab_matches_scorer_taxonomy test imports from gaia_agent_email without a skip guard, so it will error (not skip) on any standard dev environment that doesn't have the hub package installed — breaking pytest tests/unit/ for everyone who hasn't installed gaia-agent-email.

Every other test in this repo that touches gaia_agent_email guards with pytest.importorskip first. Add the same guard inside the new test before the import:

pytest.importorskip("gaia_agent_email")

🔍 Technical details

tests/unit/test_synthetic_mbox.py — the new test function does:

def test_corpus_vocab_matches_scorer_taxonomy(labels: dict) -> None:
    from gaia_agent_email.tools.triage_heuristics import ALL_CATEGORIES
    ...

Without a skip guard, ImportError causes an ERROR result (not a graceful skip) when gaia-agent-email isn't installed. The established repo pattern (see tests/unit/email/test_corpus_integrity.py:33, tests/unit/test_email_cli.py:18, tests/unit/test_agents_split.py:9) is pytest.importorskip("gaia_agent_email") before any import from that package.

Fix — add as the first line of the function body (or at module level):

pytest.importorskip("gaia_agent_email")
from gaia_agent_email.tools.triage_heuristics import ALL_CATEGORIES

github-actions · 2026-06-26T16:19:22Z

🟡 The new test_corpus_vocab_matches_scorer_taxonomy test will throw ImportError in any environment that doesn't have gaia-agent-email installed — including the base CI run and any developer using only the core package. Every other unit test that touches gaia_agent_email in this repo gates on pytest.importorskip("gaia_agent_email") first; this one is missing that guard.

🔍 Technical details

tests/unit/test_synthetic_mbox.py (new test, ~line 2063):

def test_corpus_vocab_matches_scorer_taxonomy(labels: dict) -> None:
    from gaia_agent_email.tools.triage_heuristics import ALL_CATEGORIES  # no skip guard

gaia-agent-email is an optional extra (setup.py: "agent-email": ["gaia-agent-email"]), not part of the core install. The established pattern used in every other unit test that imports from this package — tests/unit/email/test_triage_heuristics.py, test_pre_scan_summary_fix.py, test_phishing_precision.py — is pytest.importorskip("gaia_agent_email") at module or function scope before the deferred import.

Fix: add pytest.importorskip("gaia_agent_email") as the first line of the test function body.

After #1875 relabeled the eval corpus to the schema-2.0 triage taxonomy, the email agent's predictions and the ground-truth labels share one vocabulary, so category_accuracy now measures real agreement: 0.40 over 25 of 220 emails -> aggregate 40.0/100 (was 4.0, a labeling artifact). Fresh gaia eval benchmark run on AMD Strix Halo. Drop the now-resolved #1874 caveat from the adapter methodology + README; align the dataset description to the schema-2.0 taxonomy.

github-actions Bot requested a review from kovtcharov-amd as a code owner June 26, 2026 14:43

github-actions Bot mentioned this pull request Jun 26, 2026

fix(eval): email triage corpus ground-truth labels stale vs schema-2.0 5-bucket taxonomy #1874

Closed

4 tasks

Merge branch 'main' into autofix/issue-1874

f1a98f5

github-actions Bot added eval Evaluation framework changes tests Test changes performance Performance-critical changes labels Jun 26, 2026

itomek self-assigned this Jun 26, 2026

Merge branch 'main' into autofix/issue-1874

02cae1c

itomek approved these changes Jun 26, 2026

View reviewed changes

itomek added this pull request to the merge queue Jun 26, 2026

Merged via the queue into main with commit e379441 Jun 26, 2026
35 checks passed

itomek deleted the autofix/issue-1874 branch June 26, 2026 16:29

itomek mentioned this pull request Jun 26, 2026

feat(eval): per-agent per-version eval scorecard + release gate #1873

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eval): relabel email triage corpus to schema-2.0 5-bucket taxonomy#1875

fix(eval): relabel email triage corpus to schema-2.0 5-bucket taxonomy#1875
itomek merged 3 commits into
mainfrom
autofix/issue-1874

github-actions Bot commented Jun 26, 2026

Uh oh!

itomek commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

github-actions Bot commented Jun 26, 2026

Test plan

Uh oh!

itomek commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant