Migrate the GVL core to Rust (Phases 0–5): numba-free, byte-identical, cargo-standalone#262
Merged
Conversation
…er design Scope: port get_diffs_sparse + choose_exonic_variants (genotypes) and the 7 flat-variant gather/fill kernels; delete dead filter_af; gate = parity + no regression. Fixes the Phase 2/3 double-count of the reconstruction kernels. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Task-by-task plan: port get_diffs_sparse + choose_exonic_variants + 7 flat gather/fill kernels to Rust, delete dead filter_af, parity + no-regression gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pure-ndarray core in src/genotypes/, PyO3 in src/ffi/, dispatched via _dispatch (default rust). Offsets normalized to (2,n) int64. numba retained as parity reference. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…perseded by inline numpy) AF filtering happens in numpy in _haps.py/_flat_variants.py; the numba filter_af had zero production callers. Its dedicated unit test and two stale comment references are removed with it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…(parity) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…elds) Task 5's gather_rows hardcoded int32, silently truncating float32 dosage and arbitrary custom FORMAT field values. Dispatch by dtype: i32/f32 rust cores + dtype-preserving numba fallback for other dtypes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ving) i32/f32 rust cores + dtype-preserving numba fallback for other dtypes (custom FORMAT fields, e.g. int16) — no down-cast. Parity-gated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…st (dtype-preserving) i32/f32 rust cores + dtype-preserving numba fallback for other dtypes (custom FORMAT fields, e.g. int16) — no down-cast. Parity-gated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rving) Two-level dummy-fill for allele bytes (uint8) AND token windows (int32). u8/i32 rust cores + dtype-preserving numba fallback. Parity-gated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ical) Flips GVL_BACKEND numba<->rust through the real variants getitem path; spy asserts the rust gather_rows_i32 kernel is invoked (non-vacuous); compares every RaggedVariants field byte-identically. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… lint/docstring cleanup test_flat_variants_type imported the pre-rename _gather_v_idxs_ss; point it at _gather_v_idxs_ss_numba. Also drop an unused strategy var, fix two stale docstring xrefs to the renamed numba gather helpers, and ruff-format. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ration branch Phase 2 genotype assembly + variant gather kernels ported (parity byte-identical, full tree green). filter_af deleted as dead. Records the dtype-preserving design (custom FORMAT fields), the measured ~7% rust-vs-numba read-path gap, and the cProfile finding that it is Python dispatch glue (np.ascontiguousarray = 62%), not rust compute. Per owner decision: drop per-phase throughput gate, accumulate the roadmap on the persistent `rust-migration` branch, restore the perf gate via a single-big-__getitem__-kernel optimization pass before one final merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…eal errors) Final-review finding: `except (KeyError, Exception)` could mask a real AF read-path regression as a skip. Catch only KeyError (AF key genuinely absent); let anything else propagate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1:1 parity twins for the 8 read-path numba kernel groups, plus begin read-path consolidation by fusing the haplotypes and tracks __getitem__ paths. Parity is the hard gate; throughput is recorded only (supersedes the stale throughput-gate line in the roadmap). Sequencing reference -> haps -> tracks -> fuse. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… plan 15 tasks across 4 sub-units (reference, haplotype reconstruction, track realignment+RLE, fused-path consolidation). Each kernel follows the Phase 2 port recipe: ndarray core + cargo tests -> ffi -> dispatch -> byte-identical hypothesis parity. Parity hard-gated; throughput recorded only. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ilure for start>=clen (parity twin) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…arning (review fixes) I1: capture spy count after rust read, assert it is unchanged after numba read — proves the spy is wired only to the rust kernel, mirroring the guard in test_variants_dataset_parity.py. M1: remove with_tracks(False) call on a no-tracks fixture; the call was a no-op that only emitted a spurious "Dataset has no tracks" warning. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-tested) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eep-mask branches
…ity, default rust) Implements Task 5 of Phase 3: adds a Rust batch driver for reconstruct_haplotypes_from_sparse (plural), wires it into the dispatch registry with default=rust, and verifies byte-identical parity against the numba backend via Hypothesis property tests. Also fixes the parity strategy to constrain variant positions to [0, min_contig_len) — mirrors the production invariant that VCF variants are always within-contig — preventing false panics in the Rust kernel on out-of-range random inputs that the parallel numba kernel silently swallows via thread-local SystemError. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ip numba annotated flake When a deletion's ref_end advances ref_idx past the contig boundary, `ref_.len() - ref_idx` is negative. Mirror numba: compute out_end_idx = (out_idx + writable_ref).max(0) so the right-pad range matches exactly. Annotated parity test uses assume(False) to discard inputs where numba's parallel batch driver hits its pre-existing SystemError (negative slice index inside prange); the non-annotated test exercises full byte-identity. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…tch serial-only impl - Expand all three unsafe from_raw_parts_mut SAFETY comments in the batch loop to explicitly state the disjointness invariant: out_offsets required by calling contract to be monotonically non-decreasing → each [out_s..out_e] is a strictly non-overlapping address range; serial loop prevents aliasing UB. - Rename batch_two_queries_two_haplotypes → batch_correctness_two_queries and update doc comment to accurately describe a correctness check (not a serial-vs-parallel comparison); note GIL as reason rayon is omitted. - Add batch_correctness_with_snp test that applies a single SNP (C→T) to exercise the variant-application code path alongside reference-copy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mError; correct rayon-deferral comment Fix A: factor a _assert_non_annotated_parity helper that wraps the numba call in try/except SystemError → assume(False), mirroring the guard already present in _assert_annotated_parity. Eliminates latent CI flakiness for the ~0.2% of hypothesis inputs that trigger numba parallel=True crash in the non-annotated path (2000-example high-budget run: 0 uncaught errors). Fix B: replace the incorrect "GIL makes rayon useless" comment in src/reconstruct/mod.rs batch_correctness_two_queries with an accurate note: serial-only is a phase gate decision (throughput recorded not gated), and the loop is rayon-parallelizable later via the same disjoint-chunk split used in src/reference/mod.rs get_reference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ith _dispatch Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t calls, delete _dispatch Replace the 22 dispatched call sites across 6 files with direct rust callable references, remove all 20 register() blocks, delete _dispatch.py, delete dead test infra (_harness.py, test_harness_tuple.py, test_dispatch.py), and rewrite make_kernel_spy to monkeypatch the module-level rust symbol instead of mutating the dispatch registry. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…END in bench conftest (W5 B1) - generate_goldens: guard _dispatch import with try/except ImportError (_dispatch=None); _have_numba returns False when _dispatch is None; remove register-triggering side-effect imports (_flat_variants, _genotypes, _intervals, _reference, _tracks); fix E731 lambda-assignment in gen_inplace_kernels - benchmarks/conftest.py: remove dead GVL_BACKEND env manipulation from captured_haplotypes; fix stale _dispatch_get()/_REGISTRY comment in captured_realign_tracks; drop now-unused import os - _tracks.py: remove triple blank line (ruff format) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… GVL_BACKEND/_active_backend Remove all ~20 backend-conditional forks across _query.py, _haps.py, _reconstruct.py, _reference.py, and _tracks.py. Keep the Rust arm inline and delete the numba composed path at each site. RC accounting preserved byte-identically: _query.py and _reference.py numba post-passes deleted (Rust folds RC in-kernel); _tracks.py keeps its post-pass (unconditional now — tracks RC is Python-side on Rust). All 686 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…that import) Track-only path spies via _tracks_mod; the haps+tracks fused path is covered by test_fused_tracks_parity. The defensive _recon_mod spy broke after B2 deleted the now-unused intervals_to_tracks import from _reconstruct. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Removed all @nb.njit / @nb.vectorize decorators and `import numba as nb` from python/genvarloader/. Twelve modules touched. Zero numba decorators remain in genvarloader source. Key changes: - _threads.py: cap_numba_threads() → cap_threads(); seeds RAYON_NUM_THREADS for rayon global pool init; keeps optional numba.get_num_threads() cap for backward test compat during migration. - _flat_variants.py: replaced 5 numba dispatch fallbacks with dtype-preserving numpy equivalents (_gather_rows_numpy, _compact_keep_numpy, _fill_empty_scalar_numpy, _fill_empty_seq_numpy, _fill_empty_fixed_numpy) — fixes issue #231 (custom FORMAT fields, e.g. int16/int64 dtypes). - _genotypes.py/_tracks.py/_reference.py/_utils.py: deleted njit functions; restored pure Python oracles for parity/unit test compat (no decorators). - _intervals.py: deleted 4 njit functions + restored dispatch wrappers. - _flat_flanks.py/_sitesonly.py: removed decorators; bodies unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r pure-OS detection _threads.py: revert sub-agent's conditional numba import; use exact replacement from brief (OS-only, no numba ceiling). _reconstruct.py: drop stale _shift_and_realign_tracks_sparse_rust_wrapper import (ruff F401). tests/unit/test_threads.py: update to new no-numba semantics (env unclamped; threshold via monkeypatched cpu count). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…x B4 guard to own-code Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ard stays own-code B4 removed the conda numba pin, so pixi satisfied seqpro's transitive numba via a broken PyPI llvmlite (libllvmlite.so won't load) -> import genvarloader failed at collection. genvarloader's own code is numba-free; the pin only keeps seqpro working. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n batch parallelism Add `parallel: bool` to the core batch kernel and all 5 FFI entries (reconstruct_haplotypes_from_sparse, reconstruct_haplotypes_fused, reconstruct_haplotypes_spliced_fused, reconstruct_annotated_haplotypes_fused, reconstruct_annotated_haplotypes_spliced_fused). The parallel branch carves disjoint per-k &mut [_] slices via split_at_mut chains over all active buffers (out u8 always; annot_v_idxs/annot_ref_pos i32 when Some) and dispatches via into_par_iter(), mirroring the proven get_reference idiom. Python callers (reconstruct_haplotypes_from_sparse in _genotypes.py, the 4 fused entries in _haps.py) compute should_parallelize(total_out_bytes) and pass it through. New test tests/parity/test_rayon_equivalence.py asserts serial == parallel == frozen golden for all 200 hypothesis cases. Gate: 64 parity tests pass, cargo test 17/17, ruff clean, clippy 0 errors (16 pre-existing warns). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…t comment Address C1 task-review Important findings: - I-1: add debug_assert!(s >= cursor && e >= s) to the parallel chunk-carve loop documenting/enforcing the out_offsets monotonicity contract (zero-cost in release; the same bounds drive the annotation carves). - I-2: correct the stale comment in test_rayon_equivalence.py — RUST_KERNELS now stores the C1 shim (parallel=False default) that forwards to the FFI, not the bare FFI function. Gate: 688 passed / 35 skipped / 2 xfailed; cargo reconstruct 17/17; ruff + clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…o_intervals (Task C2) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add parallel=bool to get_diffs_sparse (par_chunks_mut over flat output, one cell per work item) and intervals_to_tracks (split_at_mut cursor idiom, same as C1/C2). Thread parallel through all FFI entry points and Python callers (_genotypes.py, _intervals.py); add parallel=False shims for both kernels in _golden.py so existing replay callers are unaffected. Update genvarloader.pyi stub for intervals_to_tracks. Extend test_rayon_equivalence.py with serial==parallel==golden cases for both kernels. All 68 parity tests pass; 110 cargo tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…away micro-benchmarks C4 — Stage-C boundary for the W5 consolidation PR. - Roadmap: rewrite the W5 notes entry to cover all three stages (golden snapshot, numba deletion, rayon batch parallelism) and the per-kernel rayon rollout (C1 reconstruct, C2 tracks, C3 diffs/intervals). Phase 5 stays 🚧 (W6/PR6 is measure-and-merge). Correct the seqpro-numba note to "to be filed". - tests/benchmarks/test_micro.py: skip the 3 micro-benchmarks whose Python-level capture points were fused away in W3/W5 (reconstruct_haplotypes_from_sparse, intervals_to_tracks, shift_and_realign_tracks_sparse) — redesign onto the fused rust entries is deferred to W6. Fix the now-stale shift import to the rust wrapper. test_get_diffs_sparse + e2e benchmarks still run. This unbreaks whole-tree `pytest tests` / `pixi run test` (broken since B2/B3). Stage-C gate (controller-verified, fresh maturin --release): whole `pytest tests` = 973 passed / 44 skipped / 5 xfailed; cargo test --release 114; ruff + format + pyrefly + clippy clean; serial==parallel==golden across all kernels. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…kout Final-review caveat: post-W5 (numba deleted) re-running either golden generator would silently freeze rust == rust with no oracle cross-check, defeating the parity contract. Strengthen both generator docstrings from a passive note into an explicit DANGER warning. Docstring-only; no logic change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…-of-scope - W5 entry PR #TODO → #260. - Correct the seqpro caveat: removing numba from seqpro (ML4GLand/SeqPro) is out of scope (user decision 2026-06-27); W5's numba removal is gvl-only by design, so the transitive numba dep + its JIT-RSS floor remain intentionally. W6 perf re-baseline measures gvl-attributable deltas, not the seqpro JIT floor. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 5 W5: consolidation — golden snapshot + delete numba + rayon
…alone/seqpro verifications) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…urface glue Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ct stale Phase 1 note) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ead speedup + RSS Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rderline threshold claim Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Phase 5 W6 wrap-up: thin-shim audit + cargo-standalone + seqpro verification + perf re-baseline
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Migrate the GVL core to Rust (Phases 0–5)
This is the big integration merge of the
rust-migrationbranch. It ports GenVarLoader's core read/write data structures and algorithms from Python/numba to a self-contained Rust crate wrapped by a thin PyO3 (abi3) binding, and deletes numba from GVL's own code entirely. Python keeps only the ergonomic surface —Datasetindexing sugar, torch integration, validation/error messages — and dispatches into Rust for everything else.Scope: 50 commits, 197 files, +29,790 / −2,366. Docs-and-tests-heavy: most of the line count is the differential-parity test suite and the migration roadmap.
Why
cargo test-able Rust crate usable from Rust directly, with a type system that shrinks the code + testing surface.gvl.write()/update()and parity-or-betterDataset.__getitem__, with headroom for batch parallelism.The migration contract (strangler fig + byte-identical parity)
Every kernel followed the same loop: implement in Rust on the native ragged layout → expose through
src/ffi/behind a Python-side backend switch → differential-test byte-identical against the numba/Python impl on property-generated inputs → flip the default to Rust and delete the numba impl in the same bundled PR.mainstayed shippable at every step; numba removal was continuous, not a big-bang. The ragged layout is consumed fromseqpro-core(a pyo3-free rlib, crates.io0.1.0), not reimplemented in GVL.What landed, by phase
src/ffi/seam + first live kernel (intervals_to_tracks); reusable run-both-assert-byte-identical differential harness + hypothesis generators;cargo testwired into pixi; abi3 wheel confirmed; Carter baselines capturedseqpro-corerlib owning theRaggedlayout; ported the last ragged numba ops (to_padded,reverse_complement) to Rust; GVL consumes it as a crates.io dep; droppedawkwardfrom the foundationget_diffs_sparse,choose_exonic_variants, the flat-variant gather/fill kernels; dtype-preserving dispatch (int32/float32 hot cores + arbitrary-dtype fallback, after a naive port silently corrupted float32 dosage / int16 FORMAT fields)rust-migration__getitem__kernels (plain/annotated/spliced haps, annotated-spliced, fused tracks) each crossing the FFI boundary once; format 2.0 zero-copy SoA storage + scale-guardgvl.write()/update()fully Rust-backed and numba-free: single-pass streaming bigWig writer (SoAstarts/ends/values.npy), COITrees table/annot overlap engine; dead legacy write paths deletedPerformance
Final single-thread rust-vs-numba
__getitem__A/B (Phase 5 W4, the last apples-to-apples comparison before numba was deleted; Carter,chr22_geuv,NUMBA_NUM_THREADS=1, full tables indocs/roadmaps/phase-5-w4-final-ab.md) — rust is parity-or-better on every mode:The annotated path went from the close-out laggard (0.65×) to a clear rust win after zero-copy interval marshalling + uninit output buffers; tracks-only went from a 0.63× regression to 1.07× after replacing per-interval
ndarrayslicing with raw-slice writes (#248); variant-windows collapsed an entire Python assembly pass into Rust (#250).Write path (Phase 4, Carter,
chr22_geuv):gvl.write()1.934 s / 3.520 GB peak RSS;gvl.update()0.081 s. The bigWig write slice is ~1.88× faster with ~28% less total allocation vs. the legacy path.Rayon batch parallelism (Phase 5 W5): every read kernel has a
parallelgate (should_parallelize, threshold =GVL_NUM_THREADS × 1 MiB) that dispatchesinto_par_iter(), never a raw*mutacross threads, and is gated byte-identical to the serial golden (tests/parity/test_rayon_equivalence.py). The W6 re-baseline corpus stayed below the threshold so rayon ran serial there (a documented finding, not a regression); production-scale batches (SEQLEN≥131072 or BATCH≥256) cross it.On-disk format & API changes (reviewer note)
starts/ends/values.npysharingoffsets.npy) so the memmaps cross the Python→Rust boundary zero-copy. Opening is gated; existing datasets migrate in place viagvl.migrate. This also closes a rust-only OOM-at-scale defect where the AoS layout forced a full per-sample-scalenp.ascontiguousarraycopy every batch (locked bytests/integration/test_scale_guard.py).GVL_BACKENDenv var andpython/genvarloader/_dispatch.pyare gone — Python calls Rust directly. No runtime backend switch remains.TypeError(the dead custom-IntervalTrackwrite path was removed).seqpro≥ 0.20 (Rust-backedRagged) and theseqpro-core0.1.0 crates.io crate;awkwarddropped from hot paths.Verification gate (2026-06-27, HEAD of branch)
cargo test --releasefrom a clean shell, no pixi/PYO3_PYTHONneeded).Known caveats (not blockers)
#242-familystart>=clenclip and a reconstruct trailing-under-write case): numba is the buggy side and is not a valid oracle there; rust is correct in both. Documented at the Phase 3 gate.Merge note
Per project policy this should land no-squash to preserve the per-phase commit history. After merge, backfill the Phase 5
_PR:link indocs/roadmaps/rust-migration.md(currently—), consistent with the prior-phase convention.🤖 Generated with Claude Code