Skip to content

Releases: mcvickerlab/GenVarLoader

v0.36.0

Choose a tag to compare

@d-laub d-laub released this 28 Jun 02:33

v0.36.0 — The Rust Migration

This is the largest release since 2.0. GenVarLoader's entire core read and write path has been ported from Python/numba to a self-contained Rust crate behind a thin PyO3 (abi3) binding, and numba has been removed from GVL's own code entirely. Python keeps only the ergonomic surface — Dataset indexing, torch integration, validation and error messages — and dispatches into Rust for everything else.

The work landed as Phases 0–5 of the Rust migration roadmap (50+ commits, ~30k lines, mostly differential-parity tests and docs). Every kernel was ported under a strict contract: implement in Rust → differential-test byte-identical against the old numba implementation on property-generated inputs → flip the default and delete the numba version. main stayed shippable at every step.


⚠️ Breaking changes

On-disk dataset format → 2.0 (migration required)

Track intervals are now stored struct-of-arrays (starts.npy / ends.npy / values.npy sharing offsets.npy) instead of array-of-structs. This lets the memmaps cross the Python→Rust boundary zero-copy, and fixes an out-of-memory defect at sample-scale where the old layout forced a full contiguous copy on every batch.

Datasets written by gvl < 0.36 must be migrated. Opening one now raises:

ValueError: Dataset at <path> uses format version 1.x but this genvarloader
expects 2.0.0. Run `genvarloader.migrate('<path>')` to upgrade it in place.

Migration

import genvarloader as gvl

gvl.migrate("path/to/dataset")   # in place, then open as usual

gvl.migrate is in-place, streaming, and crash-safe (peak extra disk is a single track's interval store; metadata is bumped last via atomic replace). It's idempotent — a no-op on an already-2.0 dataset — and leaves genotypes, regions, and the reference untouched. No re-write() needed.

Other breaking changes

  • GVL_BACKEND removed. The numba/rust runtime backend switch and _dispatch.py are gone — Python now calls Rust directly. If you set GVL_BACKEND anywhere, drop it (it's a no-op now).
  • Unsupported track types now raise TypeError at write time (the dead custom IntervalTrack write path was removed).

✨ New

  • gvl.migrate(path) — the in-place 1.x → 2.0 dataset migrator described above (new public API).
  • Self-contained Rust crate. The core data structures and algorithms are now a cargo test-able Rust crate (114 Rust tests), usable from Rust directly and built on the pyo3-free seqpro-core Ragged layout — no reimplementation in GVL.
  • Rust-accelerated variant-windows assembly — an entire Python assembly pass for windowed/bare-allele variant output now happens in a single Rust call.

⚡ Performance

Final single-thread rust-vs-numba Dataset.__getitem__ A/B (the last apples-to-apples comparison before numba was deleted; chr22_geuv, single thread). Rust is parity-or-better on every output mode:

Output mode rust ÷ numba
tracks-only 1.07×
haplotypes / tracks+seqs 1.66×
annotated 1.43×
variants 1.38×
variant-windows 4.58×

Highlights:

  • Five fused __getitem__ kernels (plain / annotated / spliced / annotated-spliced haplotypes, plus fused tracks) each cross the FFI boundary exactly once.
  • gvl.write() / update() are fully Rust-backed: single-pass streaming bigWig writer and a COITrees-based table/annotation overlap engine. The bigWig write slice is ~1.88× faster with ~28% less allocation than the legacy path.
  • Rayon batch parallelism on every read kernel, gated byte-identical to the serial result and only engaged above a size threshold (production-scale batches, e.g. SEQLEN ≥ 131072 or BATCH ≥ 256).
  • GVL no longer contributes the ~3 GB llvmlite JIT cost it used to (note: seqpro still pulls numba transitively for now — its own numba removal is upstream).

📦 Dependencies

  • Requires seqpro ≥ 0.20 (Rust-backed Ragged) and the seqpro-core 0.1.0 crate.
  • awkward dropped from the hot paths.

🔬 Correctness notes

  • The Rust port is byte-identical to the previous numba implementation across the differential-parity suite. Two narrow sub-domains are intentionally excluded from the parity oracle because numba was the buggy side there (the #242-family start >= contig_len interval clip, and a reconstruct trailing-fill under-write) — Rust is correct in both, and these are now covered by direct oracle tests.
  • Verification at release: 973 pytest passed / 44 skipped / 5 xfailed / 0 failed; 114 cargo tests passed; ruff, pyrefly, and clippy clean.

Full changelog (conventional commits)

Feat

  • rayon: parallelize get_diffs_sparse + intervals_to_tracks (C3)
  • rayon: parallelize shift_and_realign_tracks_sparse and tracks_to_intervals (Task C2)
  • rayon: parallelize reconstruct_haplotypes_from_sparse with rayon batch parallelism
  • delete numba backend — rust-only read path (Phase 5 W5)
  • rust: fuse annotated+spliced haplotype reconstruction into one FFI crossing (Phase 5 W3)
  • register rc_alleles dispatch (rust default, seqpro reference)
  • rust: rc_alleles PyO3 wrapper + registration
  • rust: rc_alleles_inplace core for variant-allele RC
  • rust: debug_assert to_rc mask length in kernel RC blocks
  • fold strand RC into rust kernels; numba post-pass retained as oracle
  • rust: optional in-kernel RC for spliced haplotype kernel
  • rust: optional in-kernel RC for annotated haplotype kernel
  • rust: optional in-kernel reverse for track realign kernel
  • rust: optional in-kernel RC for reconstruct_haplotypes_fused
  • rust: optional in-kernel RC for get_reference
  • rust: in-place reverse/reverse-complement primitives for read path
  • py: assemble_variant_buffers numba oracle, rust shim, and dict parity harness
  • ffi: assemble_variant_buffers_{u8,i32} pyfunctions
  • variants: assemble_windows_mode (token windows + bare alleles)
  • variants: assemble_variants_mode (alt/ref bytes + flank tokens)
  • variants: add fetch_windows reference-read helper
  • variants: add tokenize/slice_flanks/assemble_alt_window cores
  • ffi: zero-copy boundary guard for sample-scale memmaps
  • migrate: add gvl.migrate for 1.x AoS -> 2.0 SoA
  • open: gate dataset open on format_version major
  • format: store track intervals as struct-of-arrays (gvl 2.0)
  • intervals: route intervals_to_tracks through backend dispatch (default rust)
  • ffi: Rust intervals_to_tracks + ffi seam module
  • utils: route splits_sum_le_value through backend dispatch (default rust)
  • ffi: Rust splits_sum_le_value + ffi seam module
  • dispatch: backend registry for Rust migration strangler window
  • ragged: route to_padded through seqpro-core Rust bridge
  • rust: consume seqpro-core via rlib; add ragged_to_padded bridge

Fix

  • rayon: debug_assert offset monotonicity in C1 carve; correct test comment
  • env: keep conda numba pin (seqpro needs working libllvmlite); guard stays own-code
  • threads: remove conditional numba import; update thread tests for pure-OS detection
  • test: drop stale _recon_mod.intervals_to_tracks spy (B2 removed that import)
  • test: restore generate_goldens regeneration; clean dead GVL_BACKEND in bench conftest (W5 B1)
  • reconstruct,tracks: pad full tail in numba trailing-fill on ref overshoot
  • reconstruct: pad full tail when ref exhausted, not from index 0
  • test: add init.py to disambiguate test_write collision; ruff fmt
  • variants: drop unused ArrayView2 import
  • indexing: SpliceIndexer.parse_idx double-applies sample-subset map
  • intervals: clip sub-query interval starts in both kernels (#242)
  • tracks: clamp writable_ref when deletion extends past track end
  • reconstruct: guard non-annotated parity test against numba SystemError; correct rayon-deferral comment
  • reconstruct: strengthen SAFETY comments; rename batch test to match serial-only impl
  • reconstruct: clamp writable_ref when ref_idx past contig end; skip numba annotated flake
  • reference: revert padded_slice leniency — mirror numba's loud failure for start>=clen (parity twin)
  • test: update stale _gather_v_idxs_ss import after Task 5 rename; lint/docstring cleanup
  • variants: gather_rows must preserve data dtype (dosage/custom fields)
  • bench: profile variants variable-length (with_len is meaningless for variants)
  • stub: sync genvarloader.pyi — bigwig_intervals rename + intervals_to_tracks
  • test: drop stale rv._rag access in flat_mode_equivalence after Ragged subclass refactor
  • dispatch: guard isinstance(Ragged) sites for RaggedVariants subclass

Refactor

  • delete numba kernels; numpy fallbacks for #231 dtype paths
  • backend: B2 — collapse backend-conditional branches; delete GVL_BACKEND/_active_backend
  • dispatch: B1 — replace all get() call sites with direct rust calls, delete _dispatch
  • rust: extract reverse::rc_row shared helper
  • write: delete dead legacy track path + splits_sum_le_value
  • drop unreachable spliced variant-RC guard
  • route variant-allele RC through dispatched rc_alleles kernel
  • genotypes: delete dead filter_af kernel + its dead test (superseded by inline numpy)
  • variants: RaggedVariants subclasses Ragged, drop _rag composition

Perf

  • rust: fuse rc_alleles_inplace — 186→308 instrs (rc_row inlined), drop Vec alloc + rescan
  • rust: tune reconstruct_haplotypes_from_sparse — 2839→1279 instrs, 0.655→0.589 rust÷numba
  • rust: tune rc_flat_rows_inplace — 212→283 instrs (vectorized), 0.664→0.635 rust÷numba
  • rust: tune assemble_alt_window — 518→727 asm lines (memcpy-expanded), 1.146→0.835 ms/batch
  • rust: tune slice_flanks — 389→429 total instrs (hot-pa...
Read more

v0.35.0

Choose a tag to compare

@d-laub d-laub released this 23 Jun 16:09

Feat

  • promote gvl.Table to public API; remove experimental subpackage
  • route Table/annot writes through Rust; vectorize contig norm
  • back gvl.Table with Rust COITrees engine; drop polars-bio
  • rust: streaming Table writer + PyO3 RustTable methods
  • rust: materialize ordered intervals from offsets
  • rust: COITrees overlap count for RustTable
  • rust: table interval store + RustTable::build
  • warn when opening datasets with variant-truncated track windows (#233)
  • route annotation bigWig writes through rust behind switch
  • route per-sample bigWig writes through rust behind GVL_RUST_BIGWIG_WRITE
  • PyO3 binding for bigwig_write_track
  • rust single-pass streaming bigWig write_track

Fix

  • rag-variants: string-key field access + getattr for record fields
  • rag-variants: getitem preserves leading fixed axes for slice/array indexing
  • rag-variants: _share_offsets uses inner char offsets for opaque-string fields
  • adapt GenVarLoader to rust-backed seqpro _core.Ragged backend
  • un-skip annot e2e test; guard empty annot; prune dead Rust field
  • floor track-write window at the input region (#233 follow-up)
  • silence E402 for required sys.path bootstrap in bench corpus builder
  • surface bigWig write_track I/O and contig errors as PyRuntimeError instead of panic

Refactor

  • write: drop unreachable nomask branch; defensive max init; test empty-group region max
  • write: per-region max + concat without awkward
  • torch: to_nested_tensor accepts _core.Ragged only
  • chunked: detect RaggedVariants by type, not ak.Array
  • flat-variants: build RaggedVariants via _core.Ragged
  • haps: allele layout returns _core.Ragged; AF filter without awkward
  • splice: splice_map as _core.Ragged; drop awkward aggregations
  • ragged: prepend_pad_itv via seqpro.rag.concatenate; drop awkward RC helpers
  • shm: serialize RaggedVariants via _core buffers (no awkward)
  • rag-variants: nested-tensor batch from char-view buffers; drop awkward walker
  • rag-variants: pad via seqpro.rag.concatenate (empty-group sentinels)
  • rag-variants: rc_ via flat allele-level reverse_complement_masked
  • rag-variants: to_packed via record Ragged; drop awkward packing helpers
  • rag-variants: record-Ragged wrapper skeleton (construction, access, indexing)
  • make rust the default bigWig write path; delete legacy + switch

Perf

  • vectorize _ragged_stack_tracks (remove per-batch Python loop on interval hot path)

v0.34.0

Choose a tag to compare

@d-laub d-laub released this 19 Jun 16:17

Feat

  • gather custom FORMAT fields into RaggedVariants like dosage (#231)
  • haps: load custom FORMAT fields into Haps.var_field_data (#231)
  • discover genoray custom FORMAT fields in available_var_fields (#231)

Fix

  • off-by-one in bigWig interval start collapsed wide annot tracks (#233)
  • resolve svar override path in _lazy_load_custom_fields (#231)

v0.33.0

Choose a tag to compare

@d-laub d-laub released this 18 Jun 05:05

Feat

  • realign_tracks setting decouples track re-alignment from seq mode
  • flat interval reconstruction via FlatIntervals
  • add FlatIntervals flat-buffer interval container

Fix

  • update RefTracks→SeqsTracks unit tests; pin flat seqs; trim dead union

Refactor

  • generalize RefTracks into SeqsTracks (seqs + un-realigned tracks)

v0.32.0

Choose a tag to compare

@d-laub d-laub released this 17 Jun 23:54

Feat

  • add gvl.update for atomic post-hoc track addition
  • parallelize gvl.write over track categories (variants first)
  • gvl.write accepts annot_tracks
  • polars-bio annotation extraction + bigwig annot source

Fix

  • guard duplicate track names in update; flat_ends symmetry; test polish
  • restore DATASET_FORMAT_VERSION attribute docstring placement
  • pyrefly search-path resolves local source; drop stale ignore and unused import

Refactor

  • remove Dataset.write_annot_tracks and _annot_to_intervals
  • _write_track takes explicit out_dir; add _write_ragged_intervals

v0.31.0

Choose a tag to compare

@d-laub d-laub released this 15 Jun 11:30

Feat

  • table: ship polars-bio as the table extra
  • table: make gvl.Table an opt-in experimental feature

Fix

  • write: raise clear ValueError when variant/track sample intersection is empty (#225)

v0.30.0

Choose a tag to compare

@d-laub d-laub released this 14 Jun 02:32

Feat

  • dataset: fold haplotypes into ploidy-1 union in variant decode (#222)
  • dataset: reject haplotypes/annotated output under unphased_union (#222)
  • dataset: report ploidy=1 and fold n_variants under unphased_union (#222)
  • dataset: add unphased_union flag on Haps + with_settings (#222)

Fix

  • dataset: keep n_variants int32 under unphased_union fold (#222)

Refactor

  • variants: drop retired order-dependent germline-CCF inference (#222)

Perf

  • threads: tolerate malformed GVL_NUM_THREADS at import (#221)
  • variant-windows: single fused fetch for both-window decode (#221)
  • flanks: fuse 3 ref-window fetches into 1 via flank slicing (#221)
  • reference: dispatch get_reference kernel serial/parallel (#221)
  • reference: dispatch fetch kernel serial/parallel by per-thread bytes (#221)
  • threads: cap numba workers to cgroup cores + add dispatch predicate (#221)

v0.29.0

Choose a tag to compare

@d-laub d-laub released this 13 Jun 22:58

Feat

  • flat: dummy padding for flank_tokens and variant-windows in get_variants_flat
  • flat: thread unknown_token onto Haps for dummy token fill
  • flat: fill_empty_groups fills flank_tokens with unknown_token
  • flat: _FlatVariantWindows.fill_empty_groups (all-unknown_token dummy)
  • flat: _fill_empty_fixed kernel for flank_tokens empty-row dummy fill
  • flat: double_buffered carries flat buffers without awkward re-wrap
  • flat: slice_chunk handles flat containers
  • flat: shm write/read flat types (kind 1/2/3) without awkward
  • flat: instance-axis getitem on flat containers
  • flat: with_settings(dummy_variant=...) + export DummyVariant + non-variant guard
  • flat: Haps.dummy_variant + apply empty-group fill in get_variants_flat
  • flat: DummyVariant + empty-group fill kernels + fill_empty_groups
  • flat: export FlatVariantWindows and VarWindowOpt
  • flat: VarWindowOpt-driven per-allele variant-windows (window|allele matrix)
  • flat: VarWindowOpt + per-allele window/allele computation + optional window fields
  • flat: register variant-windows kind end-to-end
  • flat: attach flank_tokens / emit windows from get_variants_flat
  • flat: ref/alt window assembly + tokenization
  • flat: _FlatWindow + _FlatVariantWindows two-level token types
  • flat: compute ride-along flank tokens from reference
  • flat: thread flank_length + token LUT settings onto Haps
  • flat: byte->int token LUT builder for flank tokenization
  • flat: A flat variant decode (get_variants_flat) with no awkward
  • flat: A0 flat passthrough for seqs/haps/annot/reference outputs
  • flat: output_format field + with_output_format + QueryView.flat_output
  • flat: export FlatRagged/FlatAnnotatedHaps/FlatVariants/FlatAlleles
  • flat: _FlatVariants/_FlatAlleles types with to_ragged()

Fix

  • torch: reject variant-windows output over buffered transport up front
  • flat: allow dummy_variant with variant-windows output kind
  • flat: _fill_empty_seq preserves input dtype for token windows
  • flat: reject flat+buffered variants with flank tokens at construction
  • flat: harden _FlatVariants slice for flank_tokens; test polish
  • flat: with_settings(dummy_variant=False) is a no-op on non-variant datasets
  • flat: narrow _seqs to Haps before flank-field replace (pyrefly)
  • flat: clear error for variant-windows + active tracks; drop dead alias
  • flat: clear error for VarWindowOpt(ref='allele') on REF-less dataset
  • flat: clearer splice-unsupported messages for variant-windows
  • flat: windows reshape must keep ploidy dim ([1:-1] not [1:-2])
  • flat: guard flank-without-LUT, document Haps fields, test validation paths
  • flat: shape-driven _FlatAlleles.to_ragged fixes scalar-scalar squeeze byte-identity
  • flat: _FlatAlleles.reshape re-appends ragged axis; docstrings + 2D-index test
  • flat: add _FlatAlleles.squeeze so _FlatVariants.squeeze works

Refactor

  • flat: clarify flat-variant reader names; annotated shm round-trip test
  • flat: retire awkward _get_variants; ragged variants decode via flat path
  • flat: normalize _FlatVariants.reshape shape arg; drop redundant import

v0.28.0

Choose a tag to compare

@d-laub d-laub released this 08 Jun 09:00

Feat

  • variants: numba allele-pack kernel + layout decomposition for non-canonical views
  • write: reject symbolic/breakend variants from SVAR inputs
  • write: reject symbolic/breakend variants from PGEN inputs
  • write: reject symbolic/breakend variants from VCF inputs
  • write: add consolidated unsupported-variant validator

Fix

  • variants: rc_() on sliced/reordered views via materialized copy
  • variants: to_packed() handles sliced/reordered alt/ref via numba kernel
  • open: default to variants when genotypes have no reference
  • write: validate VCF variants before creating output directory

v0.27.0

Choose a tag to compare

@d-laub d-laub released this 05 Jun 09:27

Highlights: robust on-disk artifacts — gvl.write now creates datasets atomically under an advisory file lock, and Dataset.open validates dataset format version + structural integrity before use. FASTA caches move to a self-describing .gvlfa format that builds atomically, auto-migrates legacy .gvl caches, and auto-rebuilds when stale. Plus correct drop_last handling across all dataloader modes.

✨ Features

Atomic, concurrency-safe dataset creation
gvl.write builds into a private sibling temp directory and publishes via an atomic os.replace, so a destination directory is never observed half-written. A best-effort filelock lets parallel jobs sharing one destination avoid redundant rebuilds — correctness rests on the atomic rename; the lock is advisory only (new filelock dependency).

Dataset format versioning + integrity validation
Metadata now records a format_version, and Dataset.open validates both the version and structural integrity (file presence and sizes) before returning. An incompatible or corrupt dataset raises a clear ValueError instructing you to regenerate with gvl.write. Datasets do not auto-rebuild.

New .gvlfa FASTA cache
gvl.Reference.from_path now builds and reuses a self-describing, fingerprint-validated .gvlfa cache directory (and accepts a .gvlfa path directly). The cache is published atomically under a best-effort lock — concurrent builders sharing one reference are safe — and, unlike datasets, auto-rebuilds from its source when stale or missing. Legacy .fa.gvl caches are migrated in place by reusing their bytes.

🐛 Fixes

  • to_dataloader(drop_last=...) now honored across all modes: drop_last is no longer forwarded to the underlying DataLoader in default mode (it was double-applied); buffered/double_buffered modes honor drop_last=False; and ChunkPlanner keeps the trailing partial batch.
  • BatchSampler conflict warning: to_dataloader now warns when a BatchSampler overrides an explicit batch_size.
  • .gvlfa migration guards against stale/truncated legacy bytes, and a format-too-new sibling cache now raises instead of silently downgrading.

🔧 Internals

  • New atomic_dir directory-publish primitive (temp build + os.replace) underpins both dataset and FASTA-cache creation.
  • _fasta_cache module: FastaCache models + fingerprinting, three-way source resolution, build/load/validity guards, legacy migration, and an ensure_cache orchestrator.
  • Concurrency regression tests for atomic cache + dataset creation (closes #21); coverage for too-old format versions and the genotypes-without-ploidy branch.

Out of scope: genoray .gvi and pysam .fai/.gzi index files are created by those upstream libraries and are not covered by gvl's atomic/locked creation.


Feat

  • _open: validate dataset format version + integrity on open
  • _write: atomic dataset creation + format_version in Metadata
  • _fasta_cache: publish cache atomically via atomic_dir + locked double-check
  • _atomic: add atomic_dir directory-publish primitive
  • fasta: use .gvlfa cache module and accept .gvlfa input
  • _fasta_cache: add ensure_cache orchestrator and dispatch
  • _fasta_cache: migrate legacy .gvl caches by reusing bytes
  • _fasta_cache: add build, load, and validity guards
  • _fasta_cache: add source hints and three-way resolution
  • _fasta_cache: add FastaCache models and fingerprint

Fix

  • torch: warn when BatchSampler overrides explicit batch_size
  • torch: buffered modes honor drop_last=False
  • torch: do not forward drop_last to DataLoader in default mode
  • chunked: keep trailing partial batch in ChunkPlanner
  • test_fasta: move mid-file imports to top (E402, CI lint)
  • _fasta_cache: guard legacy migration against stale/truncated bytes
  • _fasta_cache: raise on format-too-new sibling cache instead of silent downgrade

Refactor

  • _write: use plain with-atomic_dir; restore warnings filter; add atomicity + format_version on-disk tests
  • reference: build cache via ensure_cache, accept .gvlfa
  • _fasta_cache: fix progress bar advance and tighten load status type