Skip to content

fix(eval): relabel email triage corpus to schema-2.0 5-bucket taxonomy#1875

Merged
itomek merged 3 commits into
mainfrom
autofix/issue-1874
Jun 26, 2026
Merged

fix(eval): relabel email triage corpus to schema-2.0 5-bucket taxonomy#1875
itomek merged 3 commits into
mainfrom
autofix/issue-1874

Conversation

@github-actions

Copy link
Copy Markdown
Contributor

The email-triage eval corpus was scored against the wrong vocabulary. The agent emits the schema-2.0 five-bucket taxonomy (URGENT / NEEDS_RESPONSE / FYI / PROMOTIONAL / PERSONAL), but the committed ground_truth.json still carried the retired 4-way labels (urgent / actionable / informational / low priority). Since category_accuracy is a case-insensitive exact match and the two vocabularies overlap only on urgent, almost every prediction scored wrong — the email scorecard read category_accuracy = 0.04 purely as a labeling artifact, not real quality. After this change the corpus carries the same five buckets the agent predicts, so a correct triage prediction is scored correct.

The corpus generator already maps to the schema-2.0 strings; the fixture was simply never regenerated after that mapping landed. Regenerating is deterministic — the .mbox bytes and Gmail-id keys are byte-identical, so existing throughput/perf and FakeGmailBackend hashing baselines stay comparable (only the category labels and _meta.taxonomy change).

Closes #1874

Test plan

  • PYTHONPATH=...:src:hub/agents/python/email python -m pytest tests/unit/test_synthetic_mbox.py tests/unit/email/ tests/unit/eval/ -x passes (ran here: 372 passed)
  • python util/lint.py --all passes
  • python tests/fixtures/email/generate_mbox.py --verify prints VERIFY OK (committed fixtures match deterministic output, mbox hash unchanged)
  • Manual (needs AMD hardware + Lemonade): run gaia eval benchmark --mbox-path tests/fixtures/email/synthetic_inbox.mbox --ground-truth tests/fixtures/email/ground_truth.json --limit 25 --output-dir <dir> and confirm category_accuracy moves from ~0.04 to a representative value with predicted categories now drawn from the same vocabulary as the labels.
  • Follow-up (AC Update Driver Check #3, needs hardware): regenerate the real-run baselines (tests/fixtures/email/baseline_accuracy*.json via score_baseline.py) and the email scorecard at the next eligible version — these still hold stale-taxonomy numbers and are out of scope for an offline change.

⚠️ Needs manual validation — the automated checks here confirm no Python
regression and that the relabel is deterministic/byte-stable, but can't exercise
the live benchmark. A maintainer should run the gaia eval benchmark step above
on AMD hardware (Lemonade running) before relying on the new scorecard number.

🔍 Technical details

Root cause. tests/fixtures/email/generate_mbox.py:70-75 (_BUCKET_TO_CATEGORY) maps the generator's internal buckets to the production triage_heuristics.ALL_CATEGORIES strings, and line 365 routes every label through it. The committed ground_truth.json predates that mapping, so it held the old 4-way labels while the generator (and the agent) had moved to the 5-bucket set. src/gaia/eval/quality_metrics.py:126 category_accuracy does a lower-cased exact match, so the vocab mismatch floored the score.

Changes:

  • tests/fixtures/email/ground_truth.json — regenerated via the deterministic generator (SEED=23023). Verified the .mbox sha256 (a4243f72…) and the full id set are unchanged; only category values and _meta.taxonomy differ. New realized counts: URGENT 47, NEEDS_RESPONSE 56, FYI 80, PROMOTIONAL 37 (PERSONAL not yet populated in the synthetic corpus — pre-existing, tracked by test(email): re-record the categorization baseline on the canonical CI backend #1438).
  • src/gaia/eval/quality_metrics.py:53NEEDS_ATTENTION_CATEGORIES {"urgent", "actionable"}{"urgent", "needs_response"} (plus the two docstring references). Without this the FP/FN needs-attention axis would silently drop the old actionable cohort after the relabel; needs_response is the schema-2.0 successor. Compared lower-cased, matching the scorer.
  • tests/unit/test_synthetic_mbox.py — updated test_category_coverage_and_counts and test_meta_block_present to the 5-bucket strings, and added test_corpus_vocab_matches_scorer_taxonomy (AC Documentation update from v0.7.2 tag. #4) asserting the committed corpus vocabulary and the scorer's attention axis both stay a subset of ALL_CATEGORIES — a drift guard so a future taxonomy change can't silently re-break scoring.

Verification run here: 372 email+eval unit tests pass; generate_mbox.py --verify self-checks clean; black + isort clean. The inline-GT unit tests in test_quality_metrics.py / test_benchmark.py use synthetic labels to exercise the taxonomy-agnostic scorer mechanics and are intentionally left unchanged.

The committed ground_truth.json carried the retired 4-way taxonomy
(urgent/actionable/informational/low priority) while the agent emits the
schema-2.0 5-bucket set (URGENT/NEEDS_RESPONSE/FYI/PROMOTIONAL/PERSONAL),
so the benchmark's exact-match category_accuracy collapsed to ~0.04 — a
vocabulary artifact, not real quality.

The generator already maps to the schema-2.0 strings; regenerating
relabels the corpus deterministically (mbox bytes and Gmail-id keys
unchanged, so perf/hashing baselines stay comparable). Update the scorer's
needs-attention axis (actionable -> needs_response) so the FP/FN axis stays
coherent, refresh the corpus assertions, and add a drift guard asserting the
corpus vocabulary stays a subset of the production taxonomy.

Closes #1874
@itomek

itomek commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Verified on AMD Strix Halo (Gemma-4-E4B, real gaia eval benchmark run): this relabel fixes the #1874 artifact. Scoring the same 50 agent predictions against both label sets, category_accuracy goes 0.04 → 0.46 — it's now a real measure of agreement, not a vocabulary mismatch. Recommend merge; one follow-up worth noting below.

🔍 Verification details + a residual quality note

Run: current 5-bucket email agent over the unchanged synthetic_inbox.mbox, 50/220 emails, scored against the old labels vs this PR's labels (identical predictions):

Scored against category_accuracy
old 4-way labels (pre-PR) 2/50 = 0.04
this PR's 5-bucket labels 23/50 = 0.46

Per-bucket recall (this PR's labels): FYI 52% · NEEDS_RESPONSE 46% · URGENT 40% · PROMOTIONAL 29%.

Residual (non-blocking) note: the relabel is a 1:1 rename by count — informational→FYI, actionable→NEEDS_RESPONSE, urgent→URGENT, low priority→PROMOTIONAL (old/new counts are identical: 80/56/47/37). The first three are sound; low priority→PROMOTIONAL is the weakest bucket (29%) — those former low-priority mails aren't actually promotional (the agent splits them across FYI/NEEDS_RESPONSE/PERSONAL), and the 5th bucket PERSONAL is unused in the corpus. The dominant error is genuine FYI↔NEEDS_RESPONSE boundary confusion by the model, not labeling. A content-based relabel of the former low-priority set (into PROMOTIONAL/FYI/PERSONAL by actual content) would lift the ceiling further — good as a follow-up, doesn't block this fix.

@github-actions github-actions Bot added eval Evaluation framework changes tests Test changes performance Performance-critical changes labels Jun 26, 2026
@github-actions

Copy link
Copy Markdown
Contributor Author

🟡 The new test_corpus_vocab_matches_scorer_taxonomy test imports from gaia_agent_email without a skip guard, so it will error (not skip) on any standard dev environment that doesn't have the hub package installed — breaking pytest tests/unit/ for everyone who hasn't installed gaia-agent-email.

Every other test in this repo that touches gaia_agent_email guards with pytest.importorskip first. Add the same guard inside the new test before the import:

pytest.importorskip("gaia_agent_email")
🔍 Technical details

tests/unit/test_synthetic_mbox.py — the new test function does:

def test_corpus_vocab_matches_scorer_taxonomy(labels: dict) -> None:
    from gaia_agent_email.tools.triage_heuristics import ALL_CATEGORIES
    ...

Without a skip guard, ImportError causes an ERROR result (not a graceful skip) when gaia-agent-email isn't installed. The established repo pattern (see tests/unit/email/test_corpus_integrity.py:33, tests/unit/test_email_cli.py:18, tests/unit/test_agents_split.py:9) is pytest.importorskip("gaia_agent_email") before any import from that package.

Fix — add as the first line of the function body (or at module level):

pytest.importorskip("gaia_agent_email")
from gaia_agent_email.tools.triage_heuristics import ALL_CATEGORIES

@itomek itomek self-assigned this Jun 26, 2026
@github-actions

Copy link
Copy Markdown
Contributor Author

🟡 The new test_corpus_vocab_matches_scorer_taxonomy test will throw ImportError in any environment that doesn't have gaia-agent-email installed — including the base CI run and any developer using only the core package. Every other unit test that touches gaia_agent_email in this repo gates on pytest.importorskip("gaia_agent_email") first; this one is missing that guard.

🔍 Technical details

tests/unit/test_synthetic_mbox.py (new test, ~line 2063):

def test_corpus_vocab_matches_scorer_taxonomy(labels: dict) -> None:
    from gaia_agent_email.tools.triage_heuristics import ALL_CATEGORIES  # no skip guard

gaia-agent-email is an optional extra (setup.py: "agent-email": ["gaia-agent-email"]), not part of the core install. The established pattern used in every other unit test that imports from this package — tests/unit/email/test_triage_heuristics.py, test_pre_scan_summary_fix.py, test_phishing_precision.py — is pytest.importorskip("gaia_agent_email") at module or function scope before the deferred import.

Fix: add pytest.importorskip("gaia_agent_email") as the first line of the test function body.

@itomek itomek added this pull request to the merge queue Jun 26, 2026
Merged via the queue into main with commit e379441 Jun 26, 2026
35 checks passed
@itomek itomek deleted the autofix/issue-1874 branch June 26, 2026 16:29
itomek pushed a commit that referenced this pull request Jun 26, 2026
After #1875 relabeled the eval corpus to the schema-2.0 triage taxonomy, the
email agent's predictions and the ground-truth labels share one vocabulary, so
category_accuracy now measures real agreement: 0.40 over 25 of 220 emails ->
aggregate 40.0/100 (was 4.0, a labeling artifact). Fresh gaia eval benchmark run
on AMD Strix Halo. Drop the now-resolved #1874 caveat from the adapter
methodology + README; align the dataset description to the schema-2.0 taxonomy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval Evaluation framework changes performance Performance-critical changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(eval): email triage corpus ground-truth labels stale vs schema-2.0 5-bucket taxonomy

1 participant