Skip to content
1 change: 1 addition & 0 deletions docs/src/format/table/.pages
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ nav:
- Layout: layout.md
- Branch & Tag: branch_tag.md
- Row ID & Lineage: row_id_lineage.md
- Data Overlay Files: data_overlay_file.md
- MemTable & WAL: mem_wal.md
390 changes: 390 additions & 0 deletions docs/src/format/table/data_overlay_file.md

Large diffs are not rendered by default.

29 changes: 29 additions & 0 deletions docs/src/format/table/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,35 @@ However, this invalidates row addresses and requires rebuilding indices, which c

</details>

## Data Overlay Files

!!! note "Overlay files require feature flag 64 (data overlay files)"

Overlay files supply new values for a subset of `(row offset, field)` cells within
a fragment without rewriting the base data files. They make updates cheap when only
a small percentage of rows and/or columns change: a writer appends a small file
carrying just the changed cells instead of rewriting whole columns or moving rows
to a new fragment.

On read, each cell is resolved by consulting the fragment's overlays from newest to
oldest; the first overlay covering that `(offset, field)` wins, otherwise the value
falls through to the base data file. Indices keep covering the fragment and reconcile
overlays at query time through a field-aware exclusion set.

For the full specification — coverage and resolution rules, dense vs. sparse layout,
versioning, index integration, compaction, and a worked example — see the
[Data Overlay Files Specification](data_overlay_file.md).

<details>
<summary>DataOverlayFile protobuf message</summary>

```protobuf
%%% proto.message.DataOverlayFile %%%
```

</details>


## Related Specifications

### Storage Layout
Expand Down
74 changes: 74 additions & 0 deletions protos/table.proto
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,11 @@ message Manifest {
// * 2: row ids are stable and stored as part of the fragment metadata.
// * 4: use v2 format (deprecated)
// * 8: table config is present
// * 16: data files use multiple base paths (shallow clone / multi-base)
// * 32: the transaction file under _transactions is not written (inline only)
// * 64: data overlay files are present (see DataOverlayFile). Readers that do
// not understand overlays must refuse the dataset, since ignoring an overlay
// would silently return stale base values.
uint64 reader_feature_flags = 9;

// Feature flags for writers.
Expand Down Expand Up @@ -311,6 +316,15 @@ message DataFragment {

repeated DataFile files = 2;

// Optional overlay files for this fragment, which supply new values for a
// subset of cells without rewriting the base data files. This MUST be empty
// if the data overlay files feature flag (64) is not set in the manifest.
//
// Order is significant: a later entry is newer than an earlier one. When two
// overlays cover the same (offset, field) and share a `committed_version`, the
// later entry wins. See DataOverlayFile for the full resolution rules.
repeated DataOverlayFile overlays = 11;

// File that indicates which rows, if any, should be considered deleted.
DeletionFile deletion_file = 3;

Expand Down Expand Up @@ -433,6 +447,66 @@ message DataFile {
optional uint32 base_id = 7;
} // DataFile

// An overlay file supplies new values for a subset of (row offset, field) cells
// within a fragment, without rewriting the fragment's base data files. It is
// used for efficient updates when only a small fraction of rows and/or columns
// change.
//
// On read, a cell is resolved by consulting the fragment's overlays from newest
// to oldest: the first overlay that covers that (offset, field) wins; if none
// cover it, the value falls through to the base data file. Because deletions
// take precedence over overlays, an overlay value for an offset that is also
// marked deleted is dead and is ignored.
//
// The overlay's data file does NOT store a row-offset key column. Within a value
// column, the position of a covered offset's value is the rank (0-based count of
// set bits below it) of that offset within the field's coverage bitmap. Because
// fields may cover different offset sets, the value columns of a single overlay
// data file may have different lengths (which the Lance file format permits).
message DataOverlayFile {
// The data file storing the overlay's new cell values, one value column per
// field in `data_file.fields`. No row-offset key column is stored.
DataFile data_file = 1;

// Which (offset, field) cells this overlay provides values for.
oneof coverage {
// A single 32-bit Roaring bitmap of physical row offsets that applies to
// every field in `data_file.fields` (a "dense" / rectangular overlay).
// Every covered offset has a value for every field. This is the common case
// for a plain UPDATE, where one SET list is applied to one set of rows.
bytes shared_offset_bitmap = 2;
// Per-field coverage for a "sparse" overlay, used when different fields cover
// different offset sets (e.g. a MERGE with multiple WHEN MATCHED branches).
FieldCoverage field_coverage = 4;
}

// The dataset version at which this overlay became effective: the version of
// the commit that introduced it, NOT the version it was read from. It is
// stamped at commit time and re-stamped if the commit is retried, in the same
// way as the created-at / last-updated-at version sequences.
//
// This drives two orderings:
// * Versus index builds: an index whose `dataset_version` >= this value
// already incorporates this overlay. Otherwise the overlay's covered cells
// are excluded from index results for the affected fields and re-evaluated
// against their current values (see the Data Overlay Files specification).
// * Versus other overlays: when two overlays cover the same (offset, field),
// the one with the higher `committed_version` wins. Overlays that share a
// `committed_version` are ordered by their position in
// `DataFragment.overlays`, where a later entry is newer and wins.
uint64 committed_version = 3;
}

// Per-field coverage for a sparse overlay.
message FieldCoverage {
// One entry per field in the overlay's `data_file.fields`, in the same order.
// Each is a 32-bit Roaring bitmap of the physical row offsets covered for that
// field. An offset present in a field's bitmap but mapped to a NULL value
// means the cell is overridden to NULL (distinct from an offset that is absent,
// which falls through to the base data file).
repeated bytes offset_bitmaps = 1;
}

// Deletion File
//
// The path of the deletion file is constructed as:
Expand Down
39 changes: 39 additions & 0 deletions protos/transaction.proto
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,44 @@ message Transaction {
repeated DataReplacementGroup replacements = 1;
}

// Overlay files to append to a single fragment, in order (the last entry is
// newest). The overlays are appended to the fragment's existing `overlays`
// list; they do not replace it, so overlays written by concurrent commits are
// preserved.
message DataOverlayGroup {
uint64 fragment_id = 1;
// Each DataOverlayFile.committed_version is left 0 by the writer and stamped
// to the new dataset version at commit time (re-stamped on retry), in the
// same way as the created-at / last-updated-at version sequences. The fields
// touched are read from each overlay's `data_file.fields`.
repeated DataOverlayFile overlays = 2;
}

// Attach overlay files to fragments, supplying new values for a subset of
// (row offset, field) cells without rewriting the fragments' base data files.
// See the DataOverlayFile message in table.proto and the Data Overlay Files
// specification for resolution, coverage, and versioning rules.
//
// Conflict semantics (intentionally permissive, like DataReplacement). Against
// a concurrent operation that touches one of the same fragments:
// * Another DataOverlay (any fields): COMPATIBLE. Overlays stack; when two
// overlays cover the same (offset, field) the one with the higher
// `committed_version` wins, so independent backfills never conflict.
// * Append / new fragments: COMPATIBLE.
// * Delete: COMPATIBLE. A deletion takes precedence over an overlay, so an
// overlay value for a deleted offset is inert (no special handling needed).
// * DataReplacement or column-rewrite (Update with REWRITE_COLUMNS) of the
// same field: COMPATIBLE. Both preserve physical row addresses, so overlay
// offsets stay valid; the overlay is newer and wins its covered cells, and
// the version gate excludes those cells from any rebuilt index.
// * Row-rewrite, compaction, or an overlay->base fold of the fragment:
// CONFLICT. These change physical row addresses or consume the overlays, so
// the overlay's offsets are no longer valid. The writer must re-read the new
// fragment, recompute, and retry.
message DataOverlay {
repeated DataOverlayGroup groups = 1;
}

// Update the merged generations in MemWAL index.
// This operation is used during merge-insert to atomically record which
// generations have been merged to the base table.
Expand Down Expand Up @@ -346,6 +384,7 @@ message Transaction {
UpdateMemWalState update_mem_wal_state = 112;
Clone clone = 113;
UpdateBases update_bases = 114;
DataOverlay data_overlay = 115;
}

// Fields 200/202 (`blob_append` / `blob_overwrite`) previously represented blob dataset ops.
Expand Down
15 changes: 15 additions & 0 deletions rust/lance-file/src/reader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -491,6 +491,21 @@ impl FileReader {
self.num_rows
}

/// The number of rows stored in a single physical column.
///
/// For ordinary (rectangular) files every column has the same length, equal
/// to [`num_rows`](Self::num_rows). Files written with
/// [`FileWriter::write_columns`](crate::writer::FileWriter::write_columns)
/// may have columns of differing lengths; this returns the length of one
/// such column, derived by summing its pages' row counts. Returns `None` if
/// `column_index` is out of bounds.
pub fn column_num_rows(&self, column_index: usize) -> Option<u64> {
self.metadata
.column_metadatas
.get(column_index)
.map(|col| col.pages.iter().map(|page| page.length).sum())
}

pub fn metadata(&self) -> &Arc<CachedFileMetadata> {
&self.metadata
}
Expand Down
Loading
Loading