Skip to content

Add production eCPS baseline resolver with integrity gate#1164

Open
MaxGhenis wants to merge 4 commits into
mainfrom
claude/production-ecps-baseline
Open

Add production eCPS baseline resolver with integrity gate#1164
MaxGhenis wants to merge 4 commits into
mainfrom
claude/production-ecps-baseline

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Fixes #1163

Summary

Adds a production eCPS baseline resolver so anything comparing against "the eCPS" always uses the same verified, pinned production dataset instead of whatever local enhanced_cps_2024.h5 happens to be on disk.

policyengine_us_data/utils/production_baseline.py:

  • resolve_production_ecps(version=None) — fetches the HF-published enhanced_cps_2024.h5 pinned to the installed policyengine-us-data version (uploads tag the HF commit with the version) into the local Hugging Face cache, and returns the path plus provenance (repo, revision, sha256, integrity checks).
  • assert_baseline_intact(path, required_nonzero_columns=...) — fails loudly with BaselineIntegrityError if a required column is missing or all-zero. Defaults to the four Social Security components + employment_income_before_lsr.
  • CLI: python -m policyengine_us_data.utils.production_baseline [--json]; make production-ecps.

Why

A local rebuild's enhanced_cps_2024.h5 recently lost social_security_retirement (dropped at the extended-CPS step), leaving a comparison baseline ~64% short on total Social Security and 100% short on SS retirement. A diagnostic then scored a candidate dataset as "winning" on Social Security purely because the baseline was broken. The published production eCPS for that same version is intact — only the local copy was broken. This change turns that silent failure mode into a hard error and gives every consumer one verified, pinned way to resolve the baseline.

Testing

  • tests/unit/utils/test_production_baseline.py: the integrity gate passes on a healthy dataset; raises on a missing column, an all-zero column (the exact recurring bug), and handles period-keyed groups; the resolver returns provenance and propagates integrity failures (Hugging Face download mocked — no network).
  • Verified end-to-end against real Hugging Face: resolving the production pin fetched enhanced_cps_2024.h5 and the gate passed (all four SS components non-zero).
  • ruff format --check, ruff check, and scripts/run_quality_guards.py pass.

Opening as draft per repo convention.

🤖 Generated with Claude Code

Max Ghenis and others added 3 commits June 4, 2026 21:51
Anything comparing against "the eCPS" (microplex, audits, replacement
diagnostics) could silently use a stale or broken *local* enhanced_cps_2024.h5.
This adds policyengine_us_data.utils.production_baseline so the canonical
production baseline is always resolved the same way:

- resolve_production_ecps() fetches the HF-published enhanced CPS pinned to the
  installed package version (uploads tag the HF commit with the version) into
  the local HF cache, and returns the path plus provenance (repo, revision,
  sha256, checks).
- assert_baseline_intact() fails loudly if a required column is missing or
  all-zero -- the recurring failure mode, most recently the extended-CPS step
  dropping social_security_retirement, which left a comparison baseline 64%
  short on total Social Security.
- `python -m policyengine_us_data.utils.production_baseline` and
  `make production-ecps` print the verified path or full JSON provenance.

Tests cover the gate (pass, missing column, all-zero column, period-keyed
groups) and the resolver with a mocked download.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Only social_security_retirement was guarded; a build dropping disability,
survivors, or dependents (as the extended-CPS step once dropped retirement)
would still publish an eCPS that under-counts Social Security. Require all
four components to be present and non-zero.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Gate only the robustly-populated Social Security components (retirement and
  disability). survivors/dependents are sparse and stay zero under the
  imputation fallback (_age_heuristic_ss_shares), so hard-gating them could
  reject legitimate builds.
- Align the upload validator to {retirement, disability} and correct its
  comment: REQUIRED_VARIABLES_BY_FILENAME is checked for presence/length, not
  non-zero.
- Clarify that the recorded sha256 is provenance only -- byte integrity is
  already guaranteed by hf_hub_download against the Hub hash; the gate checks
  column content.
- Wrap an unpublished/missing HF revision in a clear BaselineIntegrityError
  (pointing at the version override) instead of an opaque Hub traceback, with a
  regression test.
- Default the gate tests to the period-keyed-group layout real files use, keep
  a flat-dataset case, and rename the changelog fragment to match the PR (1164).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis marked this pull request as ready for review June 4, 2026 22:31
Never read or sum weight arrays directly, and never report unweighted record
counts or raw column sums as population figures. Compute population aggregates
via Microsimulation (microdf auto-weights with the household weight); if a
weight must be referenced at all, it is household_weight only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Comparisons can silently use a stale or broken local eCPS baseline

1 participant