Skip to content

feat(index): read a column's min/max from ZoneMap without a scan#7463

Draft
Ali2Arslan wants to merge 2 commits into
lance-format:mainfrom
Ali2Arslan:feat/zonemap-value-range
Draft

feat(index): read a column's min/max from ZoneMap without a scan#7463
Ali2Arslan wants to merge 2 commits into
lance-format:mainfrom
Ali2Arslan:feat/zonemap-value-range

Conversation

@Ali2Arslan

Copy link
Copy Markdown

Summary

A ZoneMap scalar index already stores per-zone min/max summaries. This PR folds those summaries into a single global (min, max) for a column without scanning any data, and exposes it on the dataset:

  • ZoneMapIndex::value_range() / ZoneMapIndex::value_range_over(segments) (lance-index) — fold one or more ZoneMap segments.
  • DatasetIndexExt::zonemap_value_range(column) (lance) — the dataset-level accessor.

This is useful for cheap min/max stats and as a planning input (e.g. range pruning) without paying for a scan.

Soundness

The folded range is designed to be safe to prune with — it never drops a matching row:

  • NaN bail. ScalarValue's total order ranks NaN above every finite value, so a NaN-bearing zone records max = NaN, hiding its true finite max. Folding only the finite maxes would yield a subset that could prune live rows, so any NaN zone makes the result None.
  • Joint fragment coverage. The dataset accessor returns None unless the column's ZoneMap segments jointly cover every live fragment. Fragments appended after the index was built (or a segment set that doesn't span the table) leave a live fragment uncovered → None. Extra dead fragments in the union are fine.
  • The disjoint segments of a multi-segment index are folded together.
  • The result is a superset of live values, conservative under deletion vectors (a deleted extreme still bounds its zone): not guaranteed tight.

Reuses the existing scalar_is_nan helper rather than duplicating NaN logic.

Test plan

lance-index (value_range / value_range_over):

  • spans multiple fragments/zones
  • all-null → None
  • NaN max → None
  • folds across segments; NaN in any segment → None; skips all-null segment

lance (zonemap_value_range):

  • basic range; column without a ZoneMap → None

  • None when an appended fragment isn't covered by the index

  • folds multiple committed segments of one logical index

  • cargo fmt --all

  • cargo clippy -p lance-index -p lance --tests clean

Follow-up

Python bindings (Dataset-level accessor) can follow in a separate PR to keep this one focused on the Rust core + dataset API.

Made with Cursor

Adds `ZoneMapIndex::value_range` / `value_range_over`, folding the
per-zone min/max summaries of one or more ZoneMap segments into a single
`(min, max)` without scanning data, and exposes it on the dataset via
`DatasetIndexExt::zonemap_value_range`.

Semantics are conservative and sound for pruning:
- NaN bail: `ScalarValue`'s total order ranks NaN above every finite
  value, so a NaN-bearing zone records `max = NaN`, hiding its true
  finite max. Folding only the finite maxes would yield a subset that
  could prune live rows, so any NaN zone makes the range `None`.
- Coverage: the dataset accessor returns `None` unless the column's
  ZoneMap segments jointly cover every live fragment (e.g. fragments
  appended after the index was built leave it uncovered), so the fold
  never sees only a subset of the data. Multi-segment indices are folded
  together. The result is a superset of live values (conservative under
  deletion vectors): safe to prune with, not guaranteed tight.

Reuses the existing `scalar_is_nan` helper rather than duplicating it.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer enhancement New feature or request labels Jun 25, 2026
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant