Skip to content

Vintage profiles: single source of truth for source years (kills stale-default footgun)#189

Draft
MaxGhenis wants to merge 5 commits into
mainfrom
claude/vintage-profile-20260602
Draft

Vintage profiles: single source of truth for source years (kills stale-default footgun)#189
MaxGhenis wants to merge 5 commits into
mainfrom
claude/vintage-profile-20260602

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

@MaxGhenis MaxGhenis commented Jun 2, 2026

Why

A build's source year was a free parameter re-specified in ~6 places per source — provider arg, checkpoint arg, CLI flag, build script, manifest, and the version-id name (...asec2025-calendar2024...). Six copies that drift. The ACS 2024-vs-2022 saga was exactly that: the default said 2024, the scripts said 2022, the manifest said 2022, and reviewers anchored on the wrong copies.

What — key on (dataset, year), derive everything

microplex_us.vintages:

  • DatasetProfile is addressed by (dataset, model_year): .key == ("mp_ecps", 2024), .name == "mp_ecps_2024", resolve_profile("mp_ecps", 2024).
  • version_id(variant, commit, build_date) derives the canonical build name from the profile, so asec{cps}-calendar{model} in the name cannot disagree with the data. Names become an output of the profile, never hand-typed.
  • source_years() exposes all five years from one place.
  • Release carries each source's release + how its dollars reach the model year (native, or age_to with a factor family); __post_init__ enforces coherence (reach model_year or declare a gap_reason).
  • MP_2024: CPS ASEC 2025 (income 2024) native spine · PUF 2015→2024 (SOI) · ACS 2024 · SIPP 2023→2024 · SCF 2022→2024.

Threaded through the build:

  • default_policyengine_us_data_rebuild_source_providers and run_policyengine_us_data_rebuild_checkpoint take profile (default MP_2024). The per-source *_year args become None-defaulting overrides that resolve from profile.source_years() — so there are no literal year defaults to drift.
  • The CLI takes --profile mp_ecps_2024; the --cps-source-year / --puf-target-year / --acs-year / --sipp-year / --scf-year flags are removed.

So the year is now a key you look up, not a value smeared across the call stack and the filesystem.

ACS correction

MP_2024.acs is the native 2024 release (codex #184 default + the acs_2024.h5 donor fallback), confirmed by the latest RC log (donor_source=acs_2024). An earlier revision wrongly pinned it to 2022 by anchoring on the stale manifest/scripts; corrected. ACS is excluded from the manifest-tie because MP loads a local acs_2024.h5 beyond the module's ACS_2022 baseline.

Tests

test_vintages.py (key/version_id/source_years/coherence/Release validation/manifest-tie for SIPP+SCF) + the resolution regression guard (year params default to None; resolved providers carry the profile's years). 132 tests pass across vintages/rebuild/checkpoint/us/cps/donor; ruff clean.

Follow-ups (codex; those files aren't tracked here)

  • Build scripts pass --profile mp_ecps_2024 and derive the artifact name via profile.version_id(...) instead of hand-assembling it (and drop the stale --acs-year 2022).
  • Optionally remove the now-override *_year params entirely; bind factors labels to the Age SIPP and SCF donors to target year #185 aging.

🤖 Generated with Claude Code

MaxGhenis and others added 5 commits June 2, 2026 17:12
… years

Source release years were declared as literal defaults in three places (the
provider signature, the checkpoint signature, and the CLI), so they could drift
from the real build: cps_source_year defaulted to 2023 (income year 2022) while
every production build overrode it to 2025 via a shell flag. The stale literal
sat in three signatures and failed open -- nothing errored.

Introduce microplex_us.vintages. A DatasetProfile declares, in ONE place, the
model year a dataset represents and each source's release plus how its dollars
reach that year (native, or aged with a component-specific factor family). A
coherence check asserts every source reaches model_year or declares an explicit
gap_reason. MP_2024 is the current 2024 base dataset: CPS ASEC 2025 (income year
2024) native spine, PUF 2015->2024 via SOI factors, ACS 2024, SIPP 2023->2024,
SCF 2022->2024.

Thread the year defaults through MP_2024 so the value is defined once and the
safe path is the only path; the stale CPS default becomes the profile's 2025. A
regression guard asserts the provider defaults derive from the profile.

Foundation for codex: follow-ups are to drop the per-call --*-year args in favor
of `--profile`, and add a build-time gate that checks a produced artifact against
the active profile (freshness vs latest release + basis coherence).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Independent review found MP_2024.acs declared release 2024 while the ACS donor
loader is pinned to ACS_2022 (manifest default_year=2022) and is excluded from
TARGET_YEAR_UPRATED_SURVEYS (never aged). The real ACS vintage is 2022; the
provider default had silently drifted to 2024 vs what every build script
(--acs-year 2022), the manifest, and the gate1 build log actually load. The
profile enshrining 2024 defeated its own purpose.

- Correct MP_2024.acs to release 2022 with a declared gap_reason. The acs_year
  default now resolves to 2022 (matching the loader/scripts), so the build no
  longer needs to override it; the gap_reason flags that an ACS-2024 move is a
  loader migration, not silently assumed done.
- Add a manifest-tie test asserting each donor release equals the pe_source_
  impute manifest default_year -- catches exactly this profile-vs-loader drift.
- Extend the default-derivation regression guard to the checkpoint signature,
  not just the provider.
- Update the coherence test for the declared ACS gap; harden Release (reject an
  empty gap_reason) and get_profile (chain-free KeyError).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 2024

Toward making the source year a single key rather than a value smeared across
the provider/checkpoint/CLI/scripts/names:

- Profiles are addressed by (dataset, model_year): DatasetProfile.key and .name
  (mp_ecps_2024), plus resolve_profile(dataset, year).
- version_id(variant, commit, build_date) derives the canonical build name from
  the profile, so the asec{cps}-calendar{model} years in the name cannot drift
  from the data (e.g. mp-ecps-shaped-asec2025-calendar2024-...). Names become an
  output of the profile, never hand-typed.
- source_years() exposes the per-source years from one place, so callers thread
  a profile instead of five loose year args.
- Correct MP_2024.acs to the native 2024 release (codex #184 default + the
  ACS-2024 donor H5 fallback). Drops the earlier 2022 gap that wrongly anchored
  on the stale manifest/scripts. ACS is excluded from the manifest-tie because
  MP loads a local acs_2024.h5 beyond the module's ACS_2022 baseline.

Next (same PR or follow-up): thread `--profile` / source_years() through the CLI
and build scripts, derive the artifact version_id from the profile, and retire
the per-year args and `--*-year` flags.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lags

A build is now keyed on a dataset profile -- the single (dataset, year) key --
instead of five loose year arguments smeared across the call stack:

- default_policyengine_us_data_rebuild_source_providers and
  run_policyengine_us_data_rebuild_checkpoint take `profile` (default MP_2024).
  The per-source *_year arguments become None-defaulting overrides that resolve
  from profile.source_years(), so there are no literal year defaults left to
  drift from the profile.
- The CLI takes `--profile mp_ecps_2024`; the per-source --cps-source-year /
  --puf-target-year / --acs-year / --sipp-year / --scf-year flags are removed and
  the checkpoint resolves the profile via get_profile.
- The regression guard now verifies the year params default to None and that the
  resolved providers carry the profile's years.

The source years (and, via version_id(), the build name) derive from that one
key. Build scripts pass --profile and derive the version-id from the profile
(codex follow-up; those scripts are not tracked in this PR).

132 tests pass across vintages/rebuild/checkpoint/us/cps/donor; ruff clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover the new CLI surface the review flagged as untested: assert that --profile
resolves through the vintage registry and threads the resolved profile onto the
checkpoint call, and that an unknown profile name fails loudly (KeyError) rather
than silently building the wrong dataset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant