Vintage profiles: single source of truth for source years (kills stale-default footgun)#189
Draft
MaxGhenis wants to merge 5 commits into
Draft
Vintage profiles: single source of truth for source years (kills stale-default footgun)#189MaxGhenis wants to merge 5 commits into
MaxGhenis wants to merge 5 commits into
Conversation
… years Source release years were declared as literal defaults in three places (the provider signature, the checkpoint signature, and the CLI), so they could drift from the real build: cps_source_year defaulted to 2023 (income year 2022) while every production build overrode it to 2025 via a shell flag. The stale literal sat in three signatures and failed open -- nothing errored. Introduce microplex_us.vintages. A DatasetProfile declares, in ONE place, the model year a dataset represents and each source's release plus how its dollars reach that year (native, or aged with a component-specific factor family). A coherence check asserts every source reaches model_year or declares an explicit gap_reason. MP_2024 is the current 2024 base dataset: CPS ASEC 2025 (income year 2024) native spine, PUF 2015->2024 via SOI factors, ACS 2024, SIPP 2023->2024, SCF 2022->2024. Thread the year defaults through MP_2024 so the value is defined once and the safe path is the only path; the stale CPS default becomes the profile's 2025. A regression guard asserts the provider defaults derive from the profile. Foundation for codex: follow-ups are to drop the per-call --*-year args in favor of `--profile`, and add a build-time gate that checks a produced artifact against the active profile (freshness vs latest release + basis coherence). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Independent review found MP_2024.acs declared release 2024 while the ACS donor loader is pinned to ACS_2022 (manifest default_year=2022) and is excluded from TARGET_YEAR_UPRATED_SURVEYS (never aged). The real ACS vintage is 2022; the provider default had silently drifted to 2024 vs what every build script (--acs-year 2022), the manifest, and the gate1 build log actually load. The profile enshrining 2024 defeated its own purpose. - Correct MP_2024.acs to release 2022 with a declared gap_reason. The acs_year default now resolves to 2022 (matching the loader/scripts), so the build no longer needs to override it; the gap_reason flags that an ACS-2024 move is a loader migration, not silently assumed done. - Add a manifest-tie test asserting each donor release equals the pe_source_ impute manifest default_year -- catches exactly this profile-vs-loader drift. - Extend the default-derivation regression guard to the checkpoint signature, not just the provider. - Update the coherence test for the declared ACS gap; harden Release (reject an empty gap_reason) and get_profile (chain-free KeyError). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 2024
Toward making the source year a single key rather than a value smeared across
the provider/checkpoint/CLI/scripts/names:
- Profiles are addressed by (dataset, model_year): DatasetProfile.key and .name
(mp_ecps_2024), plus resolve_profile(dataset, year).
- version_id(variant, commit, build_date) derives the canonical build name from
the profile, so the asec{cps}-calendar{model} years in the name cannot drift
from the data (e.g. mp-ecps-shaped-asec2025-calendar2024-...). Names become an
output of the profile, never hand-typed.
- source_years() exposes the per-source years from one place, so callers thread
a profile instead of five loose year args.
- Correct MP_2024.acs to the native 2024 release (codex #184 default + the
ACS-2024 donor H5 fallback). Drops the earlier 2022 gap that wrongly anchored
on the stale manifest/scripts. ACS is excluded from the manifest-tie because
MP loads a local acs_2024.h5 beyond the module's ACS_2022 baseline.
Next (same PR or follow-up): thread `--profile` / source_years() through the CLI
and build scripts, derive the artifact version_id from the profile, and retire
the per-year args and `--*-year` flags.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lags A build is now keyed on a dataset profile -- the single (dataset, year) key -- instead of five loose year arguments smeared across the call stack: - default_policyengine_us_data_rebuild_source_providers and run_policyengine_us_data_rebuild_checkpoint take `profile` (default MP_2024). The per-source *_year arguments become None-defaulting overrides that resolve from profile.source_years(), so there are no literal year defaults left to drift from the profile. - The CLI takes `--profile mp_ecps_2024`; the per-source --cps-source-year / --puf-target-year / --acs-year / --sipp-year / --scf-year flags are removed and the checkpoint resolves the profile via get_profile. - The regression guard now verifies the year params default to None and that the resolved providers carry the profile's years. The source years (and, via version_id(), the build name) derive from that one key. Build scripts pass --profile and derive the version-id from the profile (codex follow-up; those scripts are not tracked in this PR). 132 tests pass across vintages/rebuild/checkpoint/us/cps/donor; ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover the new CLI surface the review flagged as untested: assert that --profile resolves through the vintage registry and threads the resolved profile onto the checkpoint call, and that an unknown profile name fails loudly (KeyError) rather than silently building the wrong dataset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
A build's source year was a free parameter re-specified in ~6 places per source — provider arg, checkpoint arg, CLI flag, build script, manifest, and the version-id name (
...asec2025-calendar2024...). Six copies that drift. The ACS 2024-vs-2022 saga was exactly that: the default said 2024, the scripts said 2022, the manifest said 2022, and reviewers anchored on the wrong copies.What — key on
(dataset, year), derive everythingmicroplex_us.vintages:DatasetProfileis addressed by(dataset, model_year):.key == ("mp_ecps", 2024),.name == "mp_ecps_2024",resolve_profile("mp_ecps", 2024).version_id(variant, commit, build_date)derives the canonical build name from the profile, soasec{cps}-calendar{model}in the name cannot disagree with the data. Names become an output of the profile, never hand-typed.source_years()exposes all five years from one place.Releasecarries each source's release + how its dollars reach the model year (native, orage_towith a factor family);__post_init__enforces coherence (reachmodel_yearor declare agap_reason).MP_2024: CPS ASEC 2025 (income 2024) native spine · PUF 2015→2024 (SOI) · ACS 2024 · SIPP 2023→2024 · SCF 2022→2024.Threaded through the build:
default_policyengine_us_data_rebuild_source_providersandrun_policyengine_us_data_rebuild_checkpointtakeprofile(defaultMP_2024). The per-source*_yearargs becomeNone-defaulting overrides that resolve fromprofile.source_years()— so there are no literal year defaults to drift.--profile mp_ecps_2024; the--cps-source-year / --puf-target-year / --acs-year / --sipp-year / --scf-yearflags are removed.So the year is now a key you look up, not a value smeared across the call stack and the filesystem.
ACS correction
MP_2024.acsis the native 2024 release (codex #184 default + theacs_2024.h5donor fallback), confirmed by the latest RC log (donor_source=acs_2024). An earlier revision wrongly pinned it to 2022 by anchoring on the stale manifest/scripts; corrected. ACS is excluded from the manifest-tie because MP loads a localacs_2024.h5beyond the module'sACS_2022baseline.Tests
test_vintages.py(key/version_id/source_years/coherence/Release validation/manifest-tie for SIPP+SCF) + the resolution regression guard (year params default toNone; resolved providers carry the profile's years). 132 tests pass across vintages/rebuild/checkpoint/us/cps/donor; ruff clean.Follow-ups (codex; those files aren't tracked here)
--profile mp_ecps_2024and derive the artifact name viaprofile.version_id(...)instead of hand-assembling it (and drop the stale--acs-year 2022).*_yearparams entirely; bindfactorslabels to the Age SIPP and SCF donors to target year #185 aging.🤖 Generated with Claude Code