Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 18 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The app is fully static — all TANF benefits are precomputed into JSON files, s
| Data generation | Python, [PolicyEngine US](https://github.com/PolicyEngine/policyengine-us) |
| Hosting | GitHub Pages (via `docs/` folder) |

**Current data version:** policyengine-us `1.598.0`
**Current data version:** policyengine-us `1.715.3` + [Indiana TANF fix #8543](https://github.com/PolicyEngine/policyengine-us/pull/8543), tax year 2026 — all 56 data files regenerated.

### Precomputed data grid

Expand All @@ -33,7 +33,10 @@ The app is fully static — all TANF benefits are precomputed into JSON files, s
| Adults | 1–2 | — |
| Children | 0–7 | — |

This produces 15,376 simulations per state (~23 minutes each).
This produces 15,376 benefit values per state. The vectorized generator
(`precompute_vec.py`) computes a full state in ~2–5 seconds — roughly **600×
faster** than the cell-by-cell generator — and is validated bit-for-bit
identical to it. See [scripts/README.md](scripts/README.md).

## Getting Started

Expand All @@ -58,11 +61,21 @@ Requires Python 3.10+ and policyengine-us:
```bash
cd scripts
pip install -r requirements.txt
python precompute.py # Generate all state JSON files
python precompute.py --states CA,NY # Generate specific states only
python precompute.py --metadata-only # Regenerate metadata.json only

# Fast vectorized generator (recommended; ~2–5s per state)
python precompute_vec.py # Generate all state JSON files
python precompute_vec.py --states CA,NY # Generate specific states only

# Reference cell-by-cell generator (slow; metadata.json lives here)
python precompute.py --states CA,NY # Generate specific states (slow)
python precompute.py --metadata-only # Regenerate metadata.json only
```

The two generators are interchangeable and produce byte-for-byte identical
data; `precompute_vec.py` is just far faster. See
[scripts/README.md](scripts/README.md) for how the vectorization works and the
validation behind that claim.

Then rebuild the frontend:

```bash
Expand Down
2 changes: 1 addition & 1 deletion public/data/AK.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/CA_1.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/CA_2.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/CT.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/HI.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/IL.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/IN.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/KS.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/KY.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/MA.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/MN.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/MT.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/ND.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/NE.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/NH.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/NY.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/OH.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/SC.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/SD.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/TX.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/WA.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/WI.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/WY.json

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion public/data/metadata.json

Large diffs are not rendered by default.

117 changes: 117 additions & 0 deletions scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Data generation scripts

The frontend is fully static: every TANF benefit it shows is precomputed into
`public/data/<STATE>.json`. These scripts produce those files.

## Files

| File | Role |
|---|---|
| `calculator.py` | Builds a PolicyEngine situation for one household and returns its TANF benefit. Single source of truth for income injection and state-variable mapping. |
| `config.py` | State list, county→region/group mappings, default year. |
| `precompute.py` | **Reference** generator — one `Simulation` per grid cell. Slow but simple; also owns `metadata.json`. |
| `precompute_vec.py` | **Fast** generator — vectorized with PolicyEngine `axes`. Recommended. Produces byte-for-byte identical output. |

## The grid

Each state file is a full grid of **15,376** benefit values:

| Dimension | Values | Count |
|---|---|---|
| Earned income (monthly) | $0–$3,000, $100 steps | 31 |
| Unearned income (monthly) | $0–$3,000, $100 steps | 31 |
| Adults | 1, 2 | 2 |
| Children | 0–7 | 8 |

`31 × 31 × 2 × 8 = 15,376`. Stored as `data["<adults>_<children>_false"][earned_idx][unearned_idx]`
= the rounded **monthly** benefit.

## Why the vectorized generator is ~600× faster

The bottleneck was never the math — it was constructing **15,376 separate
`Simulation` objects per state**. `precompute_vec.py` constructs only **16**:

* The **household-structure** dimensions (adults × children) *can't* be
expressed as axes — they change the number of person entities — so they stay
a 16-iteration loop (2 adults × 8 children).
* The **income** dimensions (31 earned × 31 unearned = 961 cells) become two
PolicyEngine **axis groups**, so one `Simulation` computes all 961 cells in a
single vectorized pass.

That's 16 builds per state instead of 15,376. Measured on Illinois: **4.3 s
vectorized vs ~47.6 min cell-by-cell (~664×)**.

### How the axes are built (and why it matches exactly)

`precompute_vec.py` reuses `calculator.create_situation` verbatim to build the
base household (with zero income), then attaches axes that reproduce *exactly*
the same inputs the per-cell code would have set:

* `employment_income` (annual) → one year-axis, `min=0, max=36000`.
* `tanf_gross_earned_income` (monthly) → **12 lock-step month-axes**, one per
month, `min=0, max=3000`. Parallel axes in a group step together, so cell *i*
gets `employment_income = i·1200` and each month `= i·100` — i.e.
`employment_income = monthly·12`, identical to `create_situation`.
* State-specific **person-level monthly** vars (DC, IL, MT, SC, TX) → the same
12-month treatment.

Two subtleties that the code documents inline:

1. **Entity homogeneity.** PolicyEngine lays out a whole parallel-axis group
using the *first* axis's entity, so person-level and SPM-unit-level vars
can't share a group. The **SPM-unit annual** vars (CA, CO, NC) are therefore
set *after* the simulation is built, via `simulation.set_input`, with a
961-length array matching the cell layout.
2. **Cell orientation.** PolicyEngine expands axis group 0 (earned) as the
*inner/fast* index (`np.meshgrid` uses `'xy'` indexing), so the flat result
is laid out `[unearned][earned]`. The code reshapes then transposes (`.T`)
to the `[earned][unearned]` layout the frontend expects.

### Safety net

If the vectorized path ever raises for a single (adults, children) structure
(e.g. a state with parameters that don't resolve at the target year), that
structure silently falls back to the trusted cell-by-cell path, so the run can
never emit wrong or missing data. Any fallback is reported at the end of the run.

## Validation

`precompute_vec.py` was confirmed **bit-for-bit identical** to `precompute.py`
on 10 states chosen to cover every code path — plain states, person-monthly
special vars (IL), county selection (CA), and SPM-unit annual vars (CA/CO):

```
AK CA_1 CA_2 CT HI IL IN KS KY CO → 0 mismatches across 138,384+ cells
```

To re-verify after any change, regenerate a state to a temp dir and diff:

```bash
python precompute_vec.py --states IL --output-dir /tmp/vec_check
python - <<'PY'
import json
a = json.load(open("../public/data/IL.json"))
b = json.load(open("/tmp/vec_check/IL.json"))
print("mismatches:", sum(a[k][e][u] != b[k][e][u]
for k in a for e in range(31) for u in range(31)))
PY
```

## Usage

```bash
pip install -r requirements.txt

# Fast (recommended)
python precompute_vec.py # all states -> public/data/
python precompute_vec.py --states CA,NY # subset
python precompute_vec.py --states IL --output-dir /tmp/vec_check # don't clobber

# Reference / metadata
python precompute.py --states CA,NY # slow, cell-by-cell
python precompute.py --metadata-only # regenerate metadata.json
```

> Note: `metadata.json` (year, FPG, grid config, county data) is owned by
> `precompute.py --metadata-only`. `precompute_vec.py` only writes the per-state
> benefit grids.
23 changes: 15 additions & 8 deletions scripts/precompute.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
import os
import sys
import time
from multiprocessing import Pool, cpu_count
from multiprocessing import Pool

# Add scripts dir to path (calculator.py and config.py live here)
sys.path.insert(0, os.path.dirname(__file__))
Expand All @@ -17,7 +17,7 @@
from config import PILOT_STATES, CA_COUNTIES, PA_COUNTIES, VA_COUNTIES

# Grid configuration
YEAR = 2025
YEAR = 2026
EARNED_STEPS = list(range(0, 3001, 100)) # $0-$3000/mo in $100 steps (31 values)
UNEARNED_STEPS = list(range(0, 3001, 100)) # $0-$3000/mo in $100 steps (31 values)
ADULTS_RANGE = [1, 2]
Expand All @@ -43,7 +43,7 @@
}

OUTPUT_DIR = os.path.join(
os.path.dirname(__file__), "..", "frontend", "public", "data"
os.path.dirname(__file__), "..", "public", "data"
)


Expand Down Expand Up @@ -113,11 +113,11 @@ def build_county_list(counties):
pa_counties, pa_county_groups = build_county_list(PA_COUNTIES)
va_counties, va_county_groups = build_county_list(VA_COUNTIES)

# Federal Poverty Guidelines 2025
# Federal Poverty Guidelines 2026
fpg = {
"default": {"base": 15650, "per_additional": 5500},
"AK": {"base": 19560, "per_additional": 6880},
"HI": {"base": 18000, "per_additional": 6330},
"default": {"base": 15960, "per_additional": 5680},
"AK": {"base": 19950, "per_additional": 7100},
"HI": {"base": 18360, "per_additional": 6530},
}

from importlib.metadata import version as pkg_version
Expand Down Expand Up @@ -154,6 +154,7 @@ def build_county_list(counties):


def main():
sys.stdout.reconfigure(line_buffering=True)
import argparse

parser = argparse.ArgumentParser()
Expand All @@ -166,6 +167,12 @@ def main():
action="store_true",
help="Only generate metadata.json",
)
parser.add_argument(
"--workers",
type=int,
default=3,
help="Number of parallel worker processes (default: 3).",
)
args = parser.parse_args()

os.makedirs(OUTPUT_DIR, exist_ok=True)
Expand Down Expand Up @@ -221,7 +228,7 @@ def main():
start = time.time()

# Use multiprocessing
num_workers = min(cpu_count(), len(tasks))
num_workers = min(args.workers, len(tasks))
print(f"Using {num_workers} workers...\n")

completed = 0
Expand Down
Loading
Loading