Releases: mcvickerlab/GenVarLoader
Release list
v0.36.0
v0.36.0 — The Rust Migration
This is the largest release since 2.0. GenVarLoader's entire core read and write path has been ported from Python/numba to a self-contained Rust crate behind a thin PyO3 (abi3) binding, and numba has been removed from GVL's own code entirely. Python keeps only the ergonomic surface — Dataset indexing, torch integration, validation and error messages — and dispatches into Rust for everything else.
The work landed as Phases 0–5 of the Rust migration roadmap (50+ commits, ~30k lines, mostly differential-parity tests and docs). Every kernel was ported under a strict contract: implement in Rust → differential-test byte-identical against the old numba implementation on property-generated inputs → flip the default and delete the numba version. main stayed shippable at every step.
⚠️ Breaking changes
On-disk dataset format → 2.0 (migration required)
Track intervals are now stored struct-of-arrays (starts.npy / ends.npy / values.npy sharing offsets.npy) instead of array-of-structs. This lets the memmaps cross the Python→Rust boundary zero-copy, and fixes an out-of-memory defect at sample-scale where the old layout forced a full contiguous copy on every batch.
Datasets written by gvl < 0.36 must be migrated. Opening one now raises:
ValueError: Dataset at <path> uses format version 1.x but this genvarloader
expects 2.0.0. Run `genvarloader.migrate('<path>')` to upgrade it in place.
Migration
import genvarloader as gvl
gvl.migrate("path/to/dataset") # in place, then open as usualgvl.migrate is in-place, streaming, and crash-safe (peak extra disk is a single track's interval store; metadata is bumped last via atomic replace). It's idempotent — a no-op on an already-2.0 dataset — and leaves genotypes, regions, and the reference untouched. No re-write() needed.
Other breaking changes
GVL_BACKENDremoved. The numba/rust runtime backend switch and_dispatch.pyare gone — Python now calls Rust directly. If you setGVL_BACKENDanywhere, drop it (it's a no-op now).- Unsupported track types now raise
TypeErrorat write time (the dead customIntervalTrackwrite path was removed).
✨ New
gvl.migrate(path)— the in-place 1.x → 2.0 dataset migrator described above (new public API).- Self-contained Rust crate. The core data structures and algorithms are now a
cargo test-able Rust crate (114 Rust tests), usable from Rust directly and built on the pyo3-freeseqpro-coreRaggedlayout — no reimplementation in GVL. - Rust-accelerated variant-windows assembly — an entire Python assembly pass for windowed/bare-allele variant output now happens in a single Rust call.
⚡ Performance
Final single-thread rust-vs-numba Dataset.__getitem__ A/B (the last apples-to-apples comparison before numba was deleted; chr22_geuv, single thread). Rust is parity-or-better on every output mode:
| Output mode | rust ÷ numba |
|---|---|
| tracks-only | 1.07× |
| haplotypes / tracks+seqs | 1.66× |
| annotated | 1.43× |
| variants | 1.38× |
| variant-windows | 4.58× |
Highlights:
- Five fused
__getitem__kernels (plain / annotated / spliced / annotated-spliced haplotypes, plus fused tracks) each cross the FFI boundary exactly once. gvl.write()/update()are fully Rust-backed: single-pass streaming bigWig writer and a COITrees-based table/annotation overlap engine. The bigWig write slice is ~1.88× faster with ~28% less allocation than the legacy path.- Rayon batch parallelism on every read kernel, gated byte-identical to the serial result and only engaged above a size threshold (production-scale batches, e.g. SEQLEN ≥ 131072 or BATCH ≥ 256).
- GVL no longer contributes the ~3 GB llvmlite JIT cost it used to (note:
seqprostill pulls numba transitively for now — its own numba removal is upstream).
📦 Dependencies
- Requires
seqpro≥ 0.20 (Rust-backedRagged) and theseqpro-core0.1.0 crate. awkwarddropped from the hot paths.
🔬 Correctness notes
- The Rust port is byte-identical to the previous numba implementation across the differential-parity suite. Two narrow sub-domains are intentionally excluded from the parity oracle because numba was the buggy side there (the
#242-familystart >= contig_leninterval clip, and a reconstruct trailing-fill under-write) — Rust is correct in both, and these are now covered by direct oracle tests. - Verification at release: 973 pytest passed / 44 skipped / 5 xfailed / 0 failed; 114 cargo tests passed; ruff, pyrefly, and clippy clean.
Full changelog (conventional commits)
Feat
- rayon: parallelize get_diffs_sparse + intervals_to_tracks (C3)
- rayon: parallelize shift_and_realign_tracks_sparse and tracks_to_intervals (Task C2)
- rayon: parallelize reconstruct_haplotypes_from_sparse with rayon batch parallelism
- delete numba backend — rust-only read path (Phase 5 W5)
- rust: fuse annotated+spliced haplotype reconstruction into one FFI crossing (Phase 5 W3)
- register rc_alleles dispatch (rust default, seqpro reference)
- rust: rc_alleles PyO3 wrapper + registration
- rust: rc_alleles_inplace core for variant-allele RC
- rust: debug_assert to_rc mask length in kernel RC blocks
- fold strand RC into rust kernels; numba post-pass retained as oracle
- rust: optional in-kernel RC for spliced haplotype kernel
- rust: optional in-kernel RC for annotated haplotype kernel
- rust: optional in-kernel reverse for track realign kernel
- rust: optional in-kernel RC for reconstruct_haplotypes_fused
- rust: optional in-kernel RC for get_reference
- rust: in-place reverse/reverse-complement primitives for read path
- py: assemble_variant_buffers numba oracle, rust shim, and dict parity harness
- ffi: assemble_variant_buffers_{u8,i32} pyfunctions
- variants: assemble_windows_mode (token windows + bare alleles)
- variants: assemble_variants_mode (alt/ref bytes + flank tokens)
- variants: add fetch_windows reference-read helper
- variants: add tokenize/slice_flanks/assemble_alt_window cores
- ffi: zero-copy boundary guard for sample-scale memmaps
- migrate: add gvl.migrate for 1.x AoS -> 2.0 SoA
- open: gate dataset open on format_version major
- format: store track intervals as struct-of-arrays (gvl 2.0)
- intervals: route intervals_to_tracks through backend dispatch (default rust)
- ffi: Rust intervals_to_tracks + ffi seam module
- utils: route splits_sum_le_value through backend dispatch (default rust)
- ffi: Rust splits_sum_le_value + ffi seam module
- dispatch: backend registry for Rust migration strangler window
- ragged: route to_padded through seqpro-core Rust bridge
- rust: consume seqpro-core via rlib; add ragged_to_padded bridge
Fix
- rayon: debug_assert offset monotonicity in C1 carve; correct test comment
- env: keep conda numba pin (seqpro needs working libllvmlite); guard stays own-code
- threads: remove conditional numba import; update thread tests for pure-OS detection
- test: drop stale _recon_mod.intervals_to_tracks spy (B2 removed that import)
- test: restore generate_goldens regeneration; clean dead GVL_BACKEND in bench conftest (W5 B1)
- reconstruct,tracks: pad full tail in numba trailing-fill on ref overshoot
- reconstruct: pad full tail when ref exhausted, not from index 0
- test: add init.py to disambiguate test_write collision; ruff fmt
- variants: drop unused ArrayView2 import
- indexing: SpliceIndexer.parse_idx double-applies sample-subset map
- intervals: clip sub-query interval starts in both kernels (#242)
- tracks: clamp writable_ref when deletion extends past track end
- reconstruct: guard non-annotated parity test against numba SystemError; correct rayon-deferral comment
- reconstruct: strengthen SAFETY comments; rename batch test to match serial-only impl
- reconstruct: clamp writable_ref when ref_idx past contig end; skip numba annotated flake
- reference: revert padded_slice leniency — mirror numba's loud failure for start>=clen (parity twin)
- test: update stale _gather_v_idxs_ss import after Task 5 rename; lint/docstring cleanup
- variants: gather_rows must preserve data dtype (dosage/custom fields)
- bench: profile variants variable-length (with_len is meaningless for variants)
- stub: sync genvarloader.pyi — bigwig_intervals rename + intervals_to_tracks
- test: drop stale rv._rag access in flat_mode_equivalence after Ragged subclass refactor
- dispatch: guard isinstance(Ragged) sites for RaggedVariants subclass
Refactor
- delete numba kernels; numpy fallbacks for #231 dtype paths
- backend: B2 — collapse backend-conditional branches; delete GVL_BACKEND/_active_backend
- dispatch: B1 — replace all get() call sites with direct rust calls, delete _dispatch
- rust: extract reverse::rc_row shared helper
- write: delete dead legacy track path + splits_sum_le_value
- drop unreachable spliced variant-RC guard
- route variant-allele RC through dispatched rc_alleles kernel
- genotypes: delete dead filter_af kernel + its dead test (superseded by inline numpy)
- variants: RaggedVariants subclasses Ragged, drop _rag composition
Perf
- rust: fuse rc_alleles_inplace — 186→308 instrs (rc_row inlined), drop Vec alloc + rescan
- rust: tune reconstruct_haplotypes_from_sparse — 2839→1279 instrs, 0.655→0.589 rust÷numba
- rust: tune rc_flat_rows_inplace — 212→283 instrs (vectorized), 0.664→0.635 rust÷numba
- rust: tune assemble_alt_window — 518→727 asm lines (memcpy-expanded), 1.146→0.835 ms/batch
- rust: tune slice_flanks — 389→429 total instrs (hot-pa...
v0.35.0
Feat
- promote gvl.Table to public API; remove experimental subpackage
- route Table/annot writes through Rust; vectorize contig norm
- back gvl.Table with Rust COITrees engine; drop polars-bio
- rust: streaming Table writer + PyO3 RustTable methods
- rust: materialize ordered intervals from offsets
- rust: COITrees overlap count for RustTable
- rust: table interval store + RustTable::build
- warn when opening datasets with variant-truncated track windows (#233)
- route annotation bigWig writes through rust behind switch
- route per-sample bigWig writes through rust behind GVL_RUST_BIGWIG_WRITE
- PyO3 binding for bigwig_write_track
- rust single-pass streaming bigWig write_track
Fix
- rag-variants: string-key field access + getattr for record fields
- rag-variants: getitem preserves leading fixed axes for slice/array indexing
- rag-variants: _share_offsets uses inner char offsets for opaque-string fields
- adapt GenVarLoader to rust-backed seqpro _core.Ragged backend
- un-skip annot e2e test; guard empty annot; prune dead Rust field
- floor track-write window at the input region (#233 follow-up)
- silence E402 for required sys.path bootstrap in bench corpus builder
- surface bigWig write_track I/O and contig errors as PyRuntimeError instead of panic
Refactor
- write: drop unreachable nomask branch; defensive max init; test empty-group region max
- write: per-region max + concat without awkward
- torch: to_nested_tensor accepts _core.Ragged only
- chunked: detect RaggedVariants by type, not ak.Array
- flat-variants: build RaggedVariants via _core.Ragged
- haps: allele layout returns _core.Ragged; AF filter without awkward
- splice: splice_map as _core.Ragged; drop awkward aggregations
- ragged: prepend_pad_itv via seqpro.rag.concatenate; drop awkward RC helpers
- shm: serialize RaggedVariants via _core buffers (no awkward)
- rag-variants: nested-tensor batch from char-view buffers; drop awkward walker
- rag-variants: pad via seqpro.rag.concatenate (empty-group sentinels)
- rag-variants: rc_ via flat allele-level reverse_complement_masked
- rag-variants: to_packed via record Ragged; drop awkward packing helpers
- rag-variants: record-Ragged wrapper skeleton (construction, access, indexing)
- make rust the default bigWig write path; delete legacy + switch
Perf
- vectorize _ragged_stack_tracks (remove per-batch Python loop on interval hot path)
v0.34.0
v0.33.0
Feat
- realign_tracks setting decouples track re-alignment from seq mode
- flat interval reconstruction via FlatIntervals
- add FlatIntervals flat-buffer interval container
Fix
- update RefTracks→SeqsTracks unit tests; pin flat seqs; trim dead union
Refactor
- generalize RefTracks into SeqsTracks (seqs + un-realigned tracks)
v0.32.0
Feat
- add gvl.update for atomic post-hoc track addition
- parallelize gvl.write over track categories (variants first)
- gvl.write accepts annot_tracks
- polars-bio annotation extraction + bigwig annot source
Fix
- guard duplicate track names in update; flat_ends symmetry; test polish
- restore DATASET_FORMAT_VERSION attribute docstring placement
- pyrefly search-path resolves local source; drop stale ignore and unused import
Refactor
- remove Dataset.write_annot_tracks and _annot_to_intervals
- _write_track takes explicit out_dir; add _write_ragged_intervals
v0.31.0
v0.30.0
Feat
- dataset: fold haplotypes into ploidy-1 union in variant decode (#222)
- dataset: reject haplotypes/annotated output under unphased_union (#222)
- dataset: report ploidy=1 and fold n_variants under unphased_union (#222)
- dataset: add unphased_union flag on Haps + with_settings (#222)
Fix
- dataset: keep n_variants int32 under unphased_union fold (#222)
Refactor
- variants: drop retired order-dependent germline-CCF inference (#222)
Perf
- threads: tolerate malformed GVL_NUM_THREADS at import (#221)
- variant-windows: single fused fetch for both-window decode (#221)
- flanks: fuse 3 ref-window fetches into 1 via flank slicing (#221)
- reference: dispatch get_reference kernel serial/parallel (#221)
- reference: dispatch fetch kernel serial/parallel by per-thread bytes (#221)
- threads: cap numba workers to cgroup cores + add dispatch predicate (#221)
v0.29.0
Feat
- flat: dummy padding for flank_tokens and variant-windows in get_variants_flat
- flat: thread unknown_token onto Haps for dummy token fill
- flat: fill_empty_groups fills flank_tokens with unknown_token
- flat: _FlatVariantWindows.fill_empty_groups (all-unknown_token dummy)
- flat: _fill_empty_fixed kernel for flank_tokens empty-row dummy fill
- flat: double_buffered carries flat buffers without awkward re-wrap
- flat: slice_chunk handles flat containers
- flat: shm write/read flat types (kind 1/2/3) without awkward
- flat: instance-axis getitem on flat containers
- flat: with_settings(dummy_variant=...) + export DummyVariant + non-variant guard
- flat: Haps.dummy_variant + apply empty-group fill in get_variants_flat
- flat: DummyVariant + empty-group fill kernels + fill_empty_groups
- flat: export FlatVariantWindows and VarWindowOpt
- flat: VarWindowOpt-driven per-allele variant-windows (window|allele matrix)
- flat: VarWindowOpt + per-allele window/allele computation + optional window fields
- flat: register variant-windows kind end-to-end
- flat: attach flank_tokens / emit windows from get_variants_flat
- flat: ref/alt window assembly + tokenization
- flat: _FlatWindow + _FlatVariantWindows two-level token types
- flat: compute ride-along flank tokens from reference
- flat: thread flank_length + token LUT settings onto Haps
- flat: byte->int token LUT builder for flank tokenization
- flat: A flat variant decode (get_variants_flat) with no awkward
- flat: A0 flat passthrough for seqs/haps/annot/reference outputs
- flat: output_format field + with_output_format + QueryView.flat_output
- flat: export FlatRagged/FlatAnnotatedHaps/FlatVariants/FlatAlleles
- flat: _FlatVariants/_FlatAlleles types with to_ragged()
Fix
- torch: reject variant-windows output over buffered transport up front
- flat: allow dummy_variant with variant-windows output kind
- flat: _fill_empty_seq preserves input dtype for token windows
- flat: reject flat+buffered variants with flank tokens at construction
- flat: harden _FlatVariants slice for flank_tokens; test polish
- flat: with_settings(dummy_variant=False) is a no-op on non-variant datasets
- flat: narrow _seqs to Haps before flank-field replace (pyrefly)
- flat: clear error for variant-windows + active tracks; drop dead alias
- flat: clear error for VarWindowOpt(ref='allele') on REF-less dataset
- flat: clearer splice-unsupported messages for variant-windows
- flat: windows reshape must keep ploidy dim ([1:-1] not [1:-2])
- flat: guard flank-without-LUT, document Haps fields, test validation paths
- flat: shape-driven _FlatAlleles.to_ragged fixes scalar-scalar squeeze byte-identity
- flat: _FlatAlleles.reshape re-appends ragged axis; docstrings + 2D-index test
- flat: add _FlatAlleles.squeeze so _FlatVariants.squeeze works
Refactor
- flat: clarify flat-variant reader names; annotated shm round-trip test
- flat: retire awkward _get_variants; ragged variants decode via flat path
- flat: normalize _FlatVariants.reshape shape arg; drop redundant import
v0.28.0
Feat
- variants: numba allele-pack kernel + layout decomposition for non-canonical views
- write: reject symbolic/breakend variants from SVAR inputs
- write: reject symbolic/breakend variants from PGEN inputs
- write: reject symbolic/breakend variants from VCF inputs
- write: add consolidated unsupported-variant validator
Fix
- variants: rc_() on sliced/reordered views via materialized copy
- variants: to_packed() handles sliced/reordered alt/ref via numba kernel
- open: default to variants when genotypes have no reference
- write: validate VCF variants before creating output directory
v0.27.0
Highlights: robust on-disk artifacts —
gvl.writenow creates datasets atomically under an advisory file lock, andDataset.openvalidates dataset format version + structural integrity before use. FASTA caches move to a self-describing.gvlfaformat that builds atomically, auto-migrates legacy.gvlcaches, and auto-rebuilds when stale. Plus correctdrop_lasthandling across all dataloader modes.
✨ Features
Atomic, concurrency-safe dataset creation
gvl.write builds into a private sibling temp directory and publishes via an atomic os.replace, so a destination directory is never observed half-written. A best-effort filelock lets parallel jobs sharing one destination avoid redundant rebuilds — correctness rests on the atomic rename; the lock is advisory only (new filelock dependency).
Dataset format versioning + integrity validation
Metadata now records a format_version, and Dataset.open validates both the version and structural integrity (file presence and sizes) before returning. An incompatible or corrupt dataset raises a clear ValueError instructing you to regenerate with gvl.write. Datasets do not auto-rebuild.
New .gvlfa FASTA cache
gvl.Reference.from_path now builds and reuses a self-describing, fingerprint-validated .gvlfa cache directory (and accepts a .gvlfa path directly). The cache is published atomically under a best-effort lock — concurrent builders sharing one reference are safe — and, unlike datasets, auto-rebuilds from its source when stale or missing. Legacy .fa.gvl caches are migrated in place by reusing their bytes.
🐛 Fixes
to_dataloader(drop_last=...)now honored across all modes:drop_lastis no longer forwarded to the underlyingDataLoaderin default mode (it was double-applied);buffered/double_bufferedmodes honordrop_last=False; andChunkPlannerkeeps the trailing partial batch.- BatchSampler conflict warning:
to_dataloadernow warns when aBatchSampleroverrides an explicitbatch_size. .gvlfamigration guards against stale/truncated legacy bytes, and a format-too-new sibling cache now raises instead of silently downgrading.
🔧 Internals
- New
atomic_dirdirectory-publish primitive (temp build +os.replace) underpins both dataset and FASTA-cache creation. _fasta_cachemodule:FastaCachemodels + fingerprinting, three-way source resolution, build/load/validity guards, legacy migration, and anensure_cacheorchestrator.- Concurrency regression tests for atomic cache + dataset creation (closes #21); coverage for too-old format versions and the genotypes-without-ploidy branch.
Out of scope:
genoray.gviandpysam.fai/.gziindex files are created by those upstream libraries and are not covered by gvl's atomic/locked creation.
Feat
- _open: validate dataset format version + integrity on open
- _write: atomic dataset creation + format_version in Metadata
- _fasta_cache: publish cache atomically via atomic_dir + locked double-check
- _atomic: add atomic_dir directory-publish primitive
- fasta: use .gvlfa cache module and accept .gvlfa input
- _fasta_cache: add ensure_cache orchestrator and dispatch
- _fasta_cache: migrate legacy .gvl caches by reusing bytes
- _fasta_cache: add build, load, and validity guards
- _fasta_cache: add source hints and three-way resolution
- _fasta_cache: add FastaCache models and fingerprint
Fix
- torch: warn when BatchSampler overrides explicit batch_size
- torch: buffered modes honor drop_last=False
- torch: do not forward drop_last to DataLoader in default mode
- chunked: keep trailing partial batch in ChunkPlanner
- test_fasta: move mid-file imports to top (E402, CI lint)
- _fasta_cache: guard legacy migration against stale/truncated bytes
- _fasta_cache: raise on format-too-new sibling cache instead of silent downgrade
Refactor
- _write: use plain with-atomic_dir; restore warnings filter; add atomicity + format_version on-disk tests
- reference: build cache via ensure_cache, accept .gvlfa
- _fasta_cache: fix progress bar advance and tighten load status type