feat(index): read a column's min/max from ZoneMap without a scan#7463
Draft
Ali2Arslan wants to merge 2 commits into
Draft
feat(index): read a column's min/max from ZoneMap without a scan#7463Ali2Arslan wants to merge 2 commits into
Ali2Arslan wants to merge 2 commits into
Conversation
Adds `ZoneMapIndex::value_range` / `value_range_over`, folding the per-zone min/max summaries of one or more ZoneMap segments into a single `(min, max)` without scanning data, and exposes it on the dataset via `DatasetIndexExt::zonemap_value_range`. Semantics are conservative and sound for pruning: - NaN bail: `ScalarValue`'s total order ranks NaN above every finite value, so a NaN-bearing zone records `max = NaN`, hiding its true finite max. Folding only the finite maxes would yield a subset that could prune live rows, so any NaN zone makes the range `None`. - Coverage: the dataset accessor returns `None` unless the column's ZoneMap segments jointly cover every live fragment (e.g. fragments appended after the index was built leave it uncovered), so the fold never sees only a subset of the data. Multi-segment indices are folded together. The result is a superset of live values (conservative under deletion vectors): safe to prune with, not guaranteed tight. Reuses the existing `scalar_is_nan` helper rather than duplicating it. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A ZoneMap scalar index already stores per-zone
min/maxsummaries. This PR folds those summaries into a single global(min, max)for a column without scanning any data, and exposes it on the dataset:ZoneMapIndex::value_range()/ZoneMapIndex::value_range_over(segments)(lance-index) — fold one or more ZoneMap segments.DatasetIndexExt::zonemap_value_range(column)(lance) — the dataset-level accessor.This is useful for cheap min/max stats and as a planning input (e.g. range pruning) without paying for a scan.
Soundness
The folded range is designed to be safe to prune with — it never drops a matching row:
ScalarValue's total order ranks NaN above every finite value, so a NaN-bearing zone recordsmax = NaN, hiding its true finite max. Folding only the finite maxes would yield a subset that could prune live rows, so any NaN zone makes the resultNone.Noneunless the column's ZoneMap segments jointly cover every live fragment. Fragments appended after the index was built (or a segment set that doesn't span the table) leave a live fragment uncovered →None. Extra dead fragments in the union are fine.Reuses the existing
scalar_is_nanhelper rather than duplicating NaN logic.Test plan
lance-index (
value_range/value_range_over):NoneNoneNone; skips all-null segmentlance (
zonemap_value_range):basic range; column without a ZoneMap →
NoneNonewhen an appended fragment isn't covered by the indexfolds multiple committed segments of one logical index
cargo fmt --allcargo clippy -p lance-index -p lance --testscleanFollow-up
Python bindings (
Dataset-level accessor) can follow in a separate PR to keep this one focused on the Rust core + dataset API.Made with Cursor