fix(eval): relabel email triage corpus to schema-2.0 5-bucket taxonomy#1875
Conversation
The committed ground_truth.json carried the retired 4-way taxonomy (urgent/actionable/informational/low priority) while the agent emits the schema-2.0 5-bucket set (URGENT/NEEDS_RESPONSE/FYI/PROMOTIONAL/PERSONAL), so the benchmark's exact-match category_accuracy collapsed to ~0.04 — a vocabulary artifact, not real quality. The generator already maps to the schema-2.0 strings; regenerating relabels the corpus deterministically (mbox bytes and Gmail-id keys unchanged, so perf/hashing baselines stay comparable). Update the scorer's needs-attention axis (actionable -> needs_response) so the FP/FN axis stays coherent, refresh the corpus assertions, and add a drift guard asserting the corpus vocabulary stays a subset of the production taxonomy. Closes #1874
|
Verified on AMD Strix Halo (Gemma-4-E4B, real 🔍 Verification details + a residual quality noteRun: current 5-bucket email agent over the unchanged
Per-bucket recall (this PR's labels): FYI 52% · NEEDS_RESPONSE 46% · URGENT 40% · PROMOTIONAL 29%. Residual (non-blocking) note: the relabel is a 1:1 rename by count — |
|
🟡 The new Every other test in this repo that touches pytest.importorskip("gaia_agent_email")🔍 Technical details
def test_corpus_vocab_matches_scorer_taxonomy(labels: dict) -> None:
from gaia_agent_email.tools.triage_heuristics import ALL_CATEGORIES
...Without a skip guard, Fix — add as the first line of the function body (or at module level): pytest.importorskip("gaia_agent_email")
from gaia_agent_email.tools.triage_heuristics import ALL_CATEGORIES |
|
🟡 The new 🔍 Technical details
def test_corpus_vocab_matches_scorer_taxonomy(labels: dict) -> None:
from gaia_agent_email.tools.triage_heuristics import ALL_CATEGORIES # no skip guard
Fix: add |
After #1875 relabeled the eval corpus to the schema-2.0 triage taxonomy, the email agent's predictions and the ground-truth labels share one vocabulary, so category_accuracy now measures real agreement: 0.40 over 25 of 220 emails -> aggregate 40.0/100 (was 4.0, a labeling artifact). Fresh gaia eval benchmark run on AMD Strix Halo. Drop the now-resolved #1874 caveat from the adapter methodology + README; align the dataset description to the schema-2.0 taxonomy.
The email-triage eval corpus was scored against the wrong vocabulary. The agent emits the schema-2.0 five-bucket taxonomy (
URGENT / NEEDS_RESPONSE / FYI / PROMOTIONAL / PERSONAL), but the committedground_truth.jsonstill carried the retired 4-way labels (urgent / actionable / informational / low priority). Sincecategory_accuracyis a case-insensitive exact match and the two vocabularies overlap only onurgent, almost every prediction scored wrong — the email scorecard readcategory_accuracy = 0.04purely as a labeling artifact, not real quality. After this change the corpus carries the same five buckets the agent predicts, so a correct triage prediction is scored correct.The corpus generator already maps to the schema-2.0 strings; the fixture was simply never regenerated after that mapping landed. Regenerating is deterministic — the
.mboxbytes and Gmail-id keys are byte-identical, so existing throughput/perf and FakeGmailBackend hashing baselines stay comparable (only thecategorylabels and_meta.taxonomychange).Closes #1874
Test plan
PYTHONPATH=...:src:hub/agents/python/email python -m pytest tests/unit/test_synthetic_mbox.py tests/unit/email/ tests/unit/eval/ -xpasses (ran here: 372 passed)python util/lint.py --allpassespython tests/fixtures/email/generate_mbox.py --verifyprintsVERIFY OK(committed fixtures match deterministic output, mbox hash unchanged)gaia eval benchmark --mbox-path tests/fixtures/email/synthetic_inbox.mbox --ground-truth tests/fixtures/email/ground_truth.json --limit 25 --output-dir <dir>and confirmcategory_accuracymoves from ~0.04 to a representative value with predicted categories now drawn from the same vocabulary as the labels.tests/fixtures/email/baseline_accuracy*.jsonviascore_baseline.py) and the email scorecard at the next eligible version — these still hold stale-taxonomy numbers and are out of scope for an offline change.🔍 Technical details
Root cause.
tests/fixtures/email/generate_mbox.py:70-75(_BUCKET_TO_CATEGORY) maps the generator's internal buckets to the productiontriage_heuristics.ALL_CATEGORIESstrings, and line 365 routes every label through it. The committedground_truth.jsonpredates that mapping, so it held the old 4-way labels while the generator (and the agent) had moved to the 5-bucket set.src/gaia/eval/quality_metrics.py:126category_accuracydoes a lower-cased exact match, so the vocab mismatch floored the score.Changes:
tests/fixtures/email/ground_truth.json— regenerated via the deterministic generator (SEED=23023). Verified the.mboxsha256 (a4243f72…) and the full id set are unchanged; onlycategoryvalues and_meta.taxonomydiffer. New realized counts: URGENT 47, NEEDS_RESPONSE 56, FYI 80, PROMOTIONAL 37 (PERSONAL not yet populated in the synthetic corpus — pre-existing, tracked by test(email): re-record the categorization baseline on the canonical CI backend #1438).src/gaia/eval/quality_metrics.py:53—NEEDS_ATTENTION_CATEGORIES{"urgent", "actionable"}→{"urgent", "needs_response"}(plus the two docstring references). Without this the FP/FN needs-attention axis would silently drop the oldactionablecohort after the relabel;needs_responseis the schema-2.0 successor. Compared lower-cased, matching the scorer.tests/unit/test_synthetic_mbox.py— updatedtest_category_coverage_and_countsandtest_meta_block_presentto the 5-bucket strings, and addedtest_corpus_vocab_matches_scorer_taxonomy(AC Documentation update from v0.7.2 tag. #4) asserting the committed corpus vocabulary and the scorer's attention axis both stay a subset ofALL_CATEGORIES— a drift guard so a future taxonomy change can't silently re-break scoring.Verification run here: 372 email+eval unit tests pass;
generate_mbox.py --verifyself-checks clean; black + isort clean. The inline-GT unit tests intest_quality_metrics.py/test_benchmark.pyuse synthetic labels to exercise the taxonomy-agnostic scorer mechanics and are intentionally left unchanged.