Skip to content

feat: mask data overlay files in scalar and vector index queries#7411

Draft
wjones127 wants to merge 9 commits into
lance-format:mainfrom
wjones127:will/oss-1325-indexes-mask-data-overlay-files-correctly
Draft

feat: mask data overlay files in scalar and vector index queries#7411
wjones127 wants to merge 9 commits into
lance-format:mainfrom
wjones127:will/oss-1325-indexes-mask-data-overlay-files-correctly

Conversation

@wjones127

Copy link
Copy Markdown
Contributor

Implements OSS-1325 (index masking for data overlay files). Stacked on #7409 (OSS-1324) — this branch contains the 1322/1323/1324 commits, so review the final commit (feat: mask data overlay files in scalar and vector index queries) until the parent PRs merge.

Problem

An index built before an overlay does not reflect the overlay's values, so its entries for overlay-covered cells may be stale. WHERE age = 25 after an overlay sets a row's age to 26 must not return that row from the index; WHERE age = 26 must find it. Queries must stay correct while overlays remain.

Approach

For each index a query relies on, compute the fragments carrying an overlay that was committed after the index (committed_version > index.dataset_version) and that touches a field the index covers. The check is:

  • field-aware — an overlay touching only non-indexed fields excludes nothing;
  • version-gated — an overlay already incorporated by the index (committed_version <= index.dataset_version) is ignored.

This is the new overlay_exclusion_offsets helper in dataset/overlay.rs.

Such fragments are dropped from the index's covered set so they fall to the flat path — the same path already used for fragments the index never covered — which re-evaluates them against their current (overlay-merged) values:

  • Scalar (FilteredReadExec): stale fragments are removed from the EvaluatedIndex covered set, so the full filter is re-applied to them instead of trusting the index result.
  • Vector: stale fragments are removed from the index segments' coverage bitmaps, so the ANN prefilter blocks their (stale) rows; they are added to the flat-KNN fallback so their current vectors are re-scored into the top-k.

This drops stale index hits and surfaces new matches the index never saw.

Scope note

Exclusion granularity is per fragment: a fragment with any qualifying overlay falls to the flat path in full. The spec describes row-level exclusion; per-fragment is correct and satisfies every acceptance criterion (it re-evaluates the complete predicate / re-scores the current vector for the affected rows), at the cost of flat-scanning the whole fragment rather than just its overlaid rows. Row-level granularity (preserving index benefit for non-overlaid rows within an overlaid fragment) is a natural follow-up.

Tests

dataset::scanner::overlay_index_masking (e2e) and dataset::overlay (unit):

  • BTree: stale-drop and new-match
  • overlay on an unrelated field excludes nothing
  • overlay with committed_version <= index.dataset_version not excluded
  • NULL override
  • multi-fragment
  • vector index: stale row dropped, overlay-updated row re-scored back into the top-k
  • overlay_exclusion_offsets unit tests: version gate, field-awareness, sparse per-field coverage, multi-overlay union

🤖 Generated with Claude Code

wjones127 and others added 9 commits June 22, 2026 18:07
Add a specification for data overlay files: small files attached to a
fragment that supply new values for a subset of (row offset, field) cells
without rewriting the base data files, for cheap cell-level updates.

- protos/table.proto: rework DataOverlayFile with a dense/sparse coverage
  oneof (shared_offset_bitmap vs new FieldCoverage), rename read_version to
  committed_version (effective, commit-stamped), and document rank-based
  addressing with no offset column. Document reader feature flag 64.
- docs: add data_overlay_file.md (full spec, worked example, guidance stub)
  and link it from the table format overview.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the `DataOverlay` operation (and `DataOverlayGroup`) to attach overlay
files to fragments without rewriting their base data. Mirrors the
`DataReplacement` batch shape, appends to each fragment's `overlays` list, and
documents permissive conflict semantics: concurrent overlays, appends, deletes,
and column rewrites are compatible; row-rewrites, compaction, and overlay->base
folds conflict.

committed_version is left 0 by the writer and stamped at commit time.

Proto only — Rust/Python bindings deferred.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The table/transaction proto changes generate new fields and an Operation
variant. This wires the minimum needed to compile without implementing overlay
support:

- Emit empty `overlays` when converting fragments to proto.
- Reject the `DataOverlay` transaction operation with NotSupported on read.

Datasets that use overlays set reader feature flag 64, which already falls in
the unknown-flag range rejected by `can_read_dataset`, so the library refuses
them at the feature-flag layer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the in-memory + commit machinery for data overlay files (per the spec in
lance-format#7381), the foundation the scanner/take/index/compaction work builds on.

- `DataOverlayFile` / `OverlayCoverage` (dense `shared_offset_bitmap` and sparse
  per-field) with protobuf round-trip, attached to `Fragment.overlays`.
- Reader feature flag 64 (`FLAG_DATA_OVERLAY_FILES`): set whenever any fragment
  carries overlays, so a reader that does not understand them refuses the
  dataset instead of returning stale base values.
- `Operation::DataOverlay` transaction op: appends overlays to a fragment's
  list (preserving concurrently-written overlays) and stamps each overlay's
  `committed_version` to the new dataset version at commit time (re-stamped on
  retry). Conflict rules mirror DataReplacement — permissive against appends,
  deletes, column rewrites, index builds, and other overlays; conflicts only
  with row-rewriting compaction of the same fragment.

Scan-side merge, take, and end-to-end write+read tests follow in the same PR
branch.

Part of the Data Overlay Files feature (OSS-1322).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that overlays can be committed, a scan or take over a fragment that has
overlays would silently return stale base values, since the read-path merge is
not implemented yet. Refuse such reads at `FileFragment::open` with a clear
error instead of serving incorrect data. Lifted once the scan/take merge lands
(rest of OSS-1322 / OSS-1324).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds `dataset::overlay`, the tested heart of reading overlays: given a base
column for a physical row range and the overlays covering a field (newest
first), `resolve_overlay_column` produces the merged column. An offset is
resolved to the newest covering overlay's value at the offset's rank in the
coverage bitmap; an uncovered offset falls through to the base; a covered
offset whose value is NULL overrides the cell to NULL. `overlay_indices_newest_first`
orders a fragment's overlays by `committed_version` then list position.

Deletion precedence needs no handling here: the merge runs before the deletion
filter, so an overlay value for a deleted offset is computed and dropped with
the row. Wiring this into the scan stream and `take` follows on this branch.

Unit tests cover rank addressing, multi-overlay precedence, NULL override vs.
fall-through, physical-offset base, string columns, and ordering.

Part of OSS-1322.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The v2 file writer advanced every column from a single global row counter,
so a single file could only hold columns of equal length. Sparse data
overlay files need columns whose item counts differ within one file (each
field covers a different set of rows).

Add `FileWriter::write_columns`, which writes a set of `(field, array)`
pairs and advances each field's row counter independently, leaving other
fields untouched. A field never written ends up as a zero-length column.
`write_batch` is unchanged: it still advances all fields together, so
ordinary rectangular files round-trip exactly as before.

Per-column lengths were already derivable from page metadata; expose them
via `FileReader::column_num_rows`. The reader already schedules each column
from its own pages, so reading a column at its own length and random access
within it work without further changes.

Part of the Data Overlay Files feature (OSS-1323).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire the overlay cell-resolution core into reads so `take` (and scan)
return merged values. `FileFragment::open` loads each projected field's
overlay value columns newest-first, and `new_read_impl`/`read_ranges`
merge them into base batches by physical offset before deletion
filtering — so resolution is identical for take and scan, NULL overrides
apply, deletions take precedence (an overlay value for a deleted row is
dropped with the row), and fields resolve independently.

The merge addresses each row by its physical offset (via
`ReadBatchParams::to_offsets_total`), so the resolution core now takes
explicit per-row offsets instead of a contiguous start — a single code
path for the contiguous scan range and arbitrary take indices. Sparse
per-field overlays read each field's value column independently, so
unequal-length columns (OSS-1323) are handled without materializing a
rectangular batch.

Removes the temporary "overlays not supported" guard from OSS-1322.

Tests: take of covered/uncovered offsets, multiple overlays
(newest-wins), per-field coverage with unequal-length columns, NULL
override, overlay on a deleted row (inert), and multi-fragment scan —
all over v2.0 and v2.1 files.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A scalar or vector index built before an overlay does not reflect the
overlay's values, so its entries for overlay-covered cells may be stale.
Queries must stay correct while overlays remain.

For each index a query relies on, compute the set of fragments carrying an
overlay that was committed after the index (`committed_version >
index.dataset_version`) and that touches a field the index covers. The check
is field-aware (an overlay on an unrelated field excludes nothing) and
version-gated (an overlay already incorporated by the index is ignored), via
the new `overlay_exclusion_offsets` helper.

Such fragments are dropped from the index's covered set so they fall to the
flat path, which re-evaluates them against their current (overlay-merged)
values:

- Scalar (`FilteredReadExec`): stale fragments are removed from the
  `EvaluatedIndex` covered set, so the full filter is re-applied to them.
- Vector: stale fragments are removed from the index segments' coverage
  bitmaps (so the ANN prefilter blocks their stale rows) and added to the
  flat-KNN fallback so their current vectors are re-scored into the top-k.

This drops stale index hits and surfaces new matches the index never saw.
Granularity is per fragment; row-level exclusion is a future optimization.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added A-encoding Encoding, IO, file reader/writer A-format On-disk format: protos and format spec docs labels Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added the enhancement New feature or request label Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-encoding Encoding, IO, file reader/writer A-format On-disk format: protos and format spec docs enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant