diff --git a/docs/src/format/table/.pages b/docs/src/format/table/.pages
index 16c20058608..5b0cb0e95e6 100644
--- a/docs/src/format/table/.pages
+++ b/docs/src/format/table/.pages
@@ -6,4 +6,5 @@ nav:
- Layout: layout.md
- Branch & Tag: branch_tag.md
- Row ID & Lineage: row_id_lineage.md
+ - Data Overlay Files: data_overlay_file.md
- MemTable & WAL: mem_wal.md
diff --git a/docs/src/format/table/data_overlay_file.md b/docs/src/format/table/data_overlay_file.md
new file mode 100644
index 00000000000..5540f860018
--- /dev/null
+++ b/docs/src/format/table/data_overlay_file.md
@@ -0,0 +1,390 @@
+# Data Overlay Files
+
+!!! note "Overlay files require feature flag 64 (data overlay files)"
+
+ A reader or writer that does not understand overlay files must refuse a
+ dataset that uses them. Silently ignoring an overlay would return stale base
+ values, which is a correctness bug rather than a degraded experience.
+
+Overlay files supply new values for a subset of `(row offset, field)` cells
+within a fragment **without rewriting the fragment's base data files**. They make
+updates cheap when only a small fraction of rows and/or columns change: instead
+of rewriting whole columns or moving rows to a new fragment, a writer appends a
+small file carrying just the changed cells.
+
+This is Lance's third mechanism for changing data in place, alongside
+[deletion files](index.md#deletion-files) (which remove rows) and
+[data evolution](index.md#data-evolution) (which adds or rewrites whole columns).
+An overlay changes individual cells.
+
+## Concepts
+
+### Coverage and resolution
+
+Each overlay declares which cells it provides through a **coverage** bitmap (or,
+for sparse overlays, one bitmap per field). The bitmaps index **physical row
+offsets** — positions in the base data files, counting deleted rows — so they are
+stable across deletions, exactly like deletion vectors.
+
+To resolve a cell `(offset, field)` on read, walk the fragment's overlays from
+**newest to oldest**. The first overlay that covers `(offset, field)` wins; its
+value is used. If no overlay covers the cell, the value falls through to the base
+data file (or is `NULL` if no base data file holds that field).
+
+Precedence among overlays is determined by:
+
+1. `committed_version` — higher wins (see [Versioning](#versioning-and-ordering)).
+2. Position in `DataFragment.overlays` as a tiebreaker — a later entry is newer.
+
+A covered offset whose value is `NULL` overrides the cell **to** `NULL`. This is
+distinct from an offset that is simply absent from the bitmap, which falls
+through to the base. Coverage, not value-nullness, decides whether an overlay
+applies.
+
+### Interaction with deletions
+
+Deletions take precedence over overlays. If a row offset is marked deleted in the
+fragment's deletion file, any overlay value for that offset is dead and is
+ignored, regardless of commit order. This keeps the invariant simple: a deletion
+is the final word on a row, so a concurrent overlay against a row that was
+deleted needs no special conflict handling — its values are merely inert.
+
+### Physical layout
+
+An overlay's data file stores **one value column per field**, in the order of
+`data_file.fields`. It does **not** store a row-offset key column. The position of
+a covered offset's value within its column is the **rank** of that offset in the
+field's coverage bitmap — the number of set bits below it. For a Roaring bitmap
+this is an O(1) operation, so random access to any cell is a rank computation
+followed by a single value fetch, with no offset column to read and no binary
+search.
+
+Because different fields may cover different offset sets, the value columns of a
+single sparse overlay may have **different lengths**. The Lance file format
+permits columns of differing item counts within one file, so a sparse overlay is
+representable as a single file. (See [Writer support](#writer-support) for the
+current implementation status.)
+
+### Dense vs. sparse overlays
+
+A single overlay is one of two shapes:
+
+- **Dense (rectangular).** One `shared_offset_bitmap` applies to every field. Every
+ covered offset has a value for every field. This is the common case for a plain
+ `UPDATE`, where one `SET` list is applied to one set of rows.
+- **Sparse.** A `FieldCoverage` carries one bitmap per field, used when different
+ fields cover different offset sets — for example a `MERGE` with multiple
+ `WHEN MATCHED` branches, where different rows update different columns. A dense
+ overlay would have to widen to the bounding rectangle and fill the untouched
+ cells with their current values (post-images), which for wide columns such as
+ embeddings means re-storing data that did not change. A sparse overlay stores
+ exactly the changed cells.
+
+A writer may always express a non-rectangular update as **multiple dense overlays
+in one transaction** (one per coverage group) instead of a single sparse overlay.
+
+## Protobuf
+
+
+DataOverlayFile protobuf message
+
+```protobuf
+%%% proto.message.DataOverlayFile %%%
+```
+
+
+
+
+FieldCoverage protobuf message
+
+```protobuf
+%%% proto.message.FieldCoverage %%%
+```
+
+
+
+## Versioning and ordering
+
+Overlays reuse the dataset version as their ordering clock rather than
+introducing a separate generation counter.
+
+`committed_version` is the dataset version at which an overlay **became
+effective** — the version of the commit that introduced it, **not** the version
+it was read from. It is stamped at commit time and re-stamped if the commit is
+retried, in the same way as the created-at / last-updated-at version sequences.
+
+This single value drives every ordering decision:
+
+- **Overlay vs. overlay** (read precedence): higher `committed_version` wins.
+- **Overlay vs. index** (query correctness): an index records the
+ `dataset_version` it was built from. An index whose `dataset_version >=
+ committed_version` already incorporates the overlay. An overlay whose
+ `committed_version > index.dataset_version` is newer than the index and its
+ cells must be excluded from index results and re-evaluated.
+- **Scheduler signal**: the gap between an overlay's `committed_version` and an
+ index's `dataset_version`, or between an overlay and the base, is a staleness
+ measure the compaction scheduler can use.
+
+!!! note "Why effective version, not read version"
+
+ Suppose an overlay reads version 5 and commits at version 6, while an index
+ is built reading version 5 (before the overlay) and commits at version 7 with
+ `dataset_version = 5`. If the overlay stored its *read* version (5), the test
+ `5 > 5` is false, the row would not be excluded, and the index — which never
+ saw the overlay — would return a stale result. Storing the *effective*
+ version (6) makes `6 > 5` true, the cell is excluded and re-evaluated, and the
+ result is correct.
+
+## Index integration
+
+Building an index over a fragment that has overlays does **not** require dropping
+the fragment from the index's coverage. The fragment stays indexed, and the query
+path reconciles overlays at query time using an **exclusion set**.
+
+The exclusion set for an index on field `F` is the union of the coverage bitmaps,
+restricted to field `F`, of every overlay whose `committed_version >
+index.dataset_version`. The exclusion is **field-aware**: an overlay that touches
+only unrelated columns does not exclude anything from the index on `F`.
+
+The query then proceeds as:
+
+1. Run the index search as usual, producing candidate rows.
+2. Remove any candidate in the exclusion set. (Its indexed value may be stale.)
+3. **Re-evaluate** the excluded rows against their current values — the same flat
+ path already used for the unindexed tail of fragments. For a scalar predicate
+ this re-applies the filter; for a vector query it re-scores the row's current
+ vector. Rows that still match are added back to the result.
+
+Step 3 is what makes exclusion correct rather than merely safe: removing a row
+from index candidates without re-evaluating it would silently drop a row that
+should match under its new value.
+
+### Correctness invariant
+
+> For every indexed field `F` and every row offset `o` in a fragment the index
+> covers, the index's entry for `(o, F)` is trusted unless `o` is excluded.
+> `o` is excluded iff some overlay with `committed_version > index.dataset_version`
+> covers `(o, F)`.
+
+The write and compaction paths together preserve this:
+
+- **Writes** change a cell only by adding an overlay, and that overlay's
+ `committed_version` exceeds the version of any pre-existing index — so the
+ change is always covered by an exclusion.
+- **Compaction** may remove an overlay only if the index no longer relies on it
+ (see below).
+
+## Compaction
+
+Overlays accumulate read cost — every overlay is a bitmap to test and a possible
+file to open. Compaction bounds that cost in two modes:
+
+- **Overlay → overlay.** Merge several overlays into fewer, computing the
+ post-image per `(offset, field)` by walking the merged overlays newest-first.
+ The merged overlay takes the **maximum** `committed_version` of its inputs, so
+ the exclusion semantics are preserved. Indexes are unaffected. This is cheap and
+ does not touch the base.
+- **Overlay → base.** Fold overlays into a fresh base data file, computing the
+ post-image for every covered cell, then clear the overlays. The base is
+ complete, so every post-image is well defined. Overlay offsets are physical, so
+ they cannot survive a rewrite that reorders rows; folding therefore materializes
+ values rather than carrying overlays forward.
+
+!!! warning "Folding an indexed field must update its index"
+
+ An overlay→base fold removes the overlay, which removes the exclusion signal
+ that kept an index correct. Folding an overlay that covers an indexed field
+ `F` is therefore equivalent to a column rewrite of `F` and must, in the same
+ commit, either rebuild the index to a `dataset_version` at least the folded
+ overlay's `committed_version`, or remove the fragment from the index's
+ coverage so the rows fall to the flat path. Otherwise the index would serve
+ stale values with no overlay to exclude them. This is the same rule that
+ already governs rewriting a column that an index is built on.
+
+When a fragment with overlays is compacted by a row-rewriting operation
+(`RewriteRows`, which produces new fragments with new row addresses), the
+overlays are folded into the new base as part of the rewrite, and existing
+[fragment-reuse remapping](row_id_lineage.md) handles the row-address changes as
+it does today.
+
+## Row lineage
+
+An overlay write updates the `last_updated_at_version` of every covered row, so
+change-data-feed and time-travel queries observe the update. Because overlays are
+addressed by physical offset, they do **not** require stable row IDs to be
+enabled; lineage updates apply only when those features are on.
+
+## Worked example
+
+A table `users` with stable row IDs enabled and these fields:
+
+| field id | name | type |
+|----------|-----------|-------------------------|
+| 1 | id | `int32` (primary key) |
+| 2 | name | `utf8` |
+| 3 | age | `int32` |
+| 4 | embedding | `fixed_size_list`|
+
+Created at version 1 as a single fragment `0` with one base data file
+`data/file0.lance` holding all four columns. `physical_rows = 4`:
+
+| offset | id | name | age | embedding |
+|--------|----|-------|-----|------------------|
+| 0 | 1 | Alice | 30 | … |
+| 1 | 2 | Bob | 25 | … |
+| 2 | 3 | Carol | 40 | … |
+| 3 | 4 | Dave | 22 | … |
+
+A BTree scalar index on `age` is built at version 1, covering fragment `0`
+(`dataset_version = 1`).
+
+### Step 1 — write an overlay
+
+```sql
+UPDATE users SET age = 26 WHERE id = 2; -- Bob, offset 1
+```
+
+This touches one field (`age`) for one row, so the writer emits a dense overlay
+and commits it as version 2. Fragment `0` gains:
+
+```text
+DataOverlayFile {
+ data_file: { path: "data/overlay-.lance", fields: [3], column_indices: [0] }
+ coverage: shared_offset_bitmap = {1}
+ committed_version: 2
+}
+```
+
+The overlay file stores a single `age` column with one value, `[26]`, at
+rank `{1}.rank(1) = 0`. `last_updated_at_version[1]` is set to 2.
+
+### Step 2 — read
+
+`SELECT id, age FROM users` reads base ages `[30, 25, 40, 22]`. For `age`
+(field 3), the overlay covers offset 1, so `age[1]` is replaced with the overlay
+value at position `{1}.rank(1) = 0` → `26`. Result ages: `[30, 26, 40, 22]`.
+
+### Step 3 — index query
+
+```sql
+SELECT * FROM users WHERE age = 26;
+```
+
+The `age` index was built at `dataset_version = 1`; the overlay's
+`committed_version` is 2. Since `2 > 1`, the overlay's coverage for `age`, `{1}`,
+is the exclusion set for this query.
+
+- The index (built at v1) holds Bob's *old* `age = 25`, so a lookup for `26`
+ returns nothing from the index.
+- Offset 1 is in the exclusion set, so it is re-evaluated on the flat path. Its
+ current `age` (26, via the overlay) matches `age = 26`, so Bob is returned.
+
+The mirror case `WHERE age = 25` shows exclusion preventing a stale hit: the index
+returns offset 1 (stale `25`), but offset 1 is excluded, re-evaluated to `26`, and
+correctly dropped.
+
+### Step 4 — a second, non-rectangular write
+
+```sql
+MERGE INTO users USING staged ON users.id = staged.id
+WHEN MATCHED AND staged.kind = 'rename' THEN UPDATE SET name = staged.name -- Carol(2), Dave(3)
+WHEN MATCHED AND staged.kind = 'revec' THEN UPDATE SET embedding = staged.embedding -- Bob(1)
+```
+
+`name` is updated for offsets `{2, 3}` and `embedding` for offset `{1}` — different
+fields over different rows. This is a sparse overlay, committed as version 3:
+
+```text
+DataOverlayFile {
+ data_file: { path: "data/overlay-.lance", fields: [2, 4], column_indices: [0, 1] }
+ coverage: field_coverage { offset_bitmaps: [ {2,3}, {1} ] }
+ // name (field 2) ^ ^ embedding (field 4)
+ committed_version: 3
+}
+```
+
+The file's `name` column has **two** values (`["Caroline", "David"]`, at
+ranks 0 and 1 of `{2,3}`) and its `embedding` column has **one** value (at rank 0
+of `{1}`) — columns of different lengths in one file.
+
+### Step 5 — read after the second write
+
+`SELECT name, age, embedding FROM users` resolves each field independently,
+newest overlay first:
+
+- `name`: the v3 overlay covers `{2,3}` → `["Alice", "Bob", "Caroline", "David"]`.
+- `age`: the v3 overlay does not cover `age`; the v2 overlay still applies at
+ offset 1 → `[30, 26, 40, 22]`.
+- `embedding`: the v3 overlay covers `{1}` → Bob's vector is the new one, others
+ from base.
+
+Overlays from different versions coexist and apply per field.
+
+### Step 6 — compaction (overlay → base)
+
+The scheduler folds both overlays into fragment `0` at version 4, computing
+post-images for `age`, `name`, and `embedding`, and writing a new base data file
+`data/file1.lance` with those columns. In the old file, fields 2, 3, and 4 are
+tombstoned (`-2`); field 1 (`id`) remains. The fragment's `overlays` list is
+cleared. Row addresses are preserved (a column rewrite, not a row rewrite), so
+stable row IDs and the deletion vector are untouched.
+
+Because the fold removed the overlay that was excluding offset 1 from the `age`
+index, the same commit must reconcile that index: either rebuild it at
+`dataset_version >= 2`, or drop fragment `0` from its coverage so `age` queries
+fall to the flat path. After a rebuild at version 4, no overlay remains and the
+`age` index directly returns `26` for Bob with no exclusion needed.
+
+## Guidance
+
+!!! note "This section is a stub."
+
+ The following are implementation considerations, not part of the on-disk
+ specification.
+
+### When to overlay vs. rewrite a column vs. move rows
+
+*(To be expanded.)* The choice between appending an overlay, rewriting a full
+column (data evolution), and moving updated rows to a new fragment depends on the
+fraction of rows changed, the fraction of columns changed, column width, the
+presence of indexes on the changed columns, and the accumulated overlay read
+cost. Roughly: few rows changed favors overlays; most rows in a few columns
+favors a column rewrite; most columns changed favors moving rows to a new
+fragment.
+
+### Writer support
+
+*(To be expanded.)* Dense (rectangular) overlays write with the existing
+equal-length file writer today. Sparse overlays stored as a **single** file
+require the writer to emit columns of independent lengths, which the current v2
+writer does not yet do (it advances all columns from one global row counter).
+Until that support lands, a writer can express a sparse update as multiple dense
+overlays in one transaction.
+
+### Scheduling compaction
+
+*(To be expanded.)* The overlay→overlay and overlay→base modes have very
+different costs; a cost/benefit scheduler decides when each is worthwhile, using
+the version gap as a staleness signal.
+
+### Open questions
+
+*(To be resolved.)*
+
+- **Per-fragment vs. per-table overlays.** Overlays are attached per fragment.
+ Should there be a table-level overlay concept, and how would it interact with
+ fragment-level row addressing?
+- **Relationship to LSM.** Overlays plus compaction resemble an LSM tree (newest
+ layer wins, periodic merge). How far should that analogy be taken, and what do
+ we deliberately do differently given Lance's random-access requirements?
+- **Coverage bitmap spill.** Coverage bitmaps live inline in the manifest. Very
+ large coverage (an overlay touching many rows) may warrant external spill, as
+ the row-ID and last-updated-at sequences already do above a size threshold.
+
+## Related specifications
+
+- [Table format overview](index.md)
+- [Transactions](transaction.md)
+- [Row ID & Lineage](row_id_lineage.md)
+- [Index Formats](../index/index.md)
+- [Format Versioning](versioning.md)
diff --git a/docs/src/format/table/index.md b/docs/src/format/table/index.md
index 94ea4b90dc9..ce4d0b26613 100644
--- a/docs/src/format/table/index.md
+++ b/docs/src/format/table/index.md
@@ -168,6 +168,35 @@ However, this invalidates row addresses and requires rebuilding indices, which c
+## Data Overlay Files
+
+!!! note "Overlay files require feature flag 64 (data overlay files)"
+
+Overlay files supply new values for a subset of `(row offset, field)` cells within
+a fragment without rewriting the base data files. They make updates cheap when only
+a small percentage of rows and/or columns change: a writer appends a small file
+carrying just the changed cells instead of rewriting whole columns or moving rows
+to a new fragment.
+
+On read, each cell is resolved by consulting the fragment's overlays from newest to
+oldest; the first overlay covering that `(offset, field)` wins, otherwise the value
+falls through to the base data file. Indices keep covering the fragment and reconcile
+overlays at query time through a field-aware exclusion set.
+
+For the full specification — coverage and resolution rules, dense vs. sparse layout,
+versioning, index integration, compaction, and a worked example — see the
+[Data Overlay Files Specification](data_overlay_file.md).
+
+
+DataOverlayFile protobuf message
+
+```protobuf
+%%% proto.message.DataOverlayFile %%%
+```
+
+
+
+
## Related Specifications
### Storage Layout
diff --git a/protos/table.proto b/protos/table.proto
index d298809d5d8..cc8b477a6a6 100644
--- a/protos/table.proto
+++ b/protos/table.proto
@@ -113,6 +113,11 @@ message Manifest {
// * 2: row ids are stable and stored as part of the fragment metadata.
// * 4: use v2 format (deprecated)
// * 8: table config is present
+ // * 16: data files use multiple base paths (shallow clone / multi-base)
+ // * 32: the transaction file under _transactions is not written (inline only)
+ // * 64: data overlay files are present (see DataOverlayFile). Readers that do
+ // not understand overlays must refuse the dataset, since ignoring an overlay
+ // would silently return stale base values.
uint64 reader_feature_flags = 9;
// Feature flags for writers.
@@ -311,6 +316,15 @@ message DataFragment {
repeated DataFile files = 2;
+ // Optional overlay files for this fragment, which supply new values for a
+ // subset of cells without rewriting the base data files. This MUST be empty
+ // if the data overlay files feature flag (64) is not set in the manifest.
+ //
+ // Order is significant: a later entry is newer than an earlier one. When two
+ // overlays cover the same (offset, field) and share a `committed_version`, the
+ // later entry wins. See DataOverlayFile for the full resolution rules.
+ repeated DataOverlayFile overlays = 11;
+
// File that indicates which rows, if any, should be considered deleted.
DeletionFile deletion_file = 3;
@@ -433,6 +447,66 @@ message DataFile {
optional uint32 base_id = 7;
} // DataFile
+// An overlay file supplies new values for a subset of (row offset, field) cells
+// within a fragment, without rewriting the fragment's base data files. It is
+// used for efficient updates when only a small fraction of rows and/or columns
+// change.
+//
+// On read, a cell is resolved by consulting the fragment's overlays from newest
+// to oldest: the first overlay that covers that (offset, field) wins; if none
+// cover it, the value falls through to the base data file. Because deletions
+// take precedence over overlays, an overlay value for an offset that is also
+// marked deleted is dead and is ignored.
+//
+// The overlay's data file does NOT store a row-offset key column. Within a value
+// column, the position of a covered offset's value is the rank (0-based count of
+// set bits below it) of that offset within the field's coverage bitmap. Because
+// fields may cover different offset sets, the value columns of a single overlay
+// data file may have different lengths (which the Lance file format permits).
+message DataOverlayFile {
+ // The data file storing the overlay's new cell values, one value column per
+ // field in `data_file.fields`. No row-offset key column is stored.
+ DataFile data_file = 1;
+
+ // Which (offset, field) cells this overlay provides values for.
+ oneof coverage {
+ // A single 32-bit Roaring bitmap of physical row offsets that applies to
+ // every field in `data_file.fields` (a "dense" / rectangular overlay).
+ // Every covered offset has a value for every field. This is the common case
+ // for a plain UPDATE, where one SET list is applied to one set of rows.
+ bytes shared_offset_bitmap = 2;
+ // Per-field coverage for a "sparse" overlay, used when different fields cover
+ // different offset sets (e.g. a MERGE with multiple WHEN MATCHED branches).
+ FieldCoverage field_coverage = 4;
+ }
+
+ // The dataset version at which this overlay became effective: the version of
+ // the commit that introduced it, NOT the version it was read from. It is
+ // stamped at commit time and re-stamped if the commit is retried, in the same
+ // way as the created-at / last-updated-at version sequences.
+ //
+ // This drives two orderings:
+ // * Versus index builds: an index whose `dataset_version` >= this value
+ // already incorporates this overlay. Otherwise the overlay's covered cells
+ // are excluded from index results for the affected fields and re-evaluated
+ // against their current values (see the Data Overlay Files specification).
+ // * Versus other overlays: when two overlays cover the same (offset, field),
+ // the one with the higher `committed_version` wins. Overlays that share a
+ // `committed_version` are ordered by their position in
+ // `DataFragment.overlays`, where a later entry is newer and wins.
+ uint64 committed_version = 3;
+}
+
+// Per-field coverage for a sparse overlay.
+message FieldCoverage {
+ // One entry per field in the overlay's `data_file.fields`, in the same order.
+ // Each is a 32-bit Roaring bitmap of the physical row offsets covered for that
+ // field. An offset present in a field's bitmap but mapped to a NULL value
+ // means the cell is overridden to NULL (distinct from an offset that is absent,
+ // which falls through to the base data file).
+ repeated bytes offset_bitmaps = 1;
+}
+
// Deletion File
//
// The path of the deletion file is constructed as:
diff --git a/protos/transaction.proto b/protos/transaction.proto
index e72e95025a4..bfc0eee354b 100644
--- a/protos/transaction.proto
+++ b/protos/transaction.proto
@@ -315,6 +315,44 @@ message Transaction {
repeated DataReplacementGroup replacements = 1;
}
+ // Overlay files to append to a single fragment, in order (the last entry is
+ // newest). The overlays are appended to the fragment's existing `overlays`
+ // list; they do not replace it, so overlays written by concurrent commits are
+ // preserved.
+ message DataOverlayGroup {
+ uint64 fragment_id = 1;
+ // Each DataOverlayFile.committed_version is left 0 by the writer and stamped
+ // to the new dataset version at commit time (re-stamped on retry), in the
+ // same way as the created-at / last-updated-at version sequences. The fields
+ // touched are read from each overlay's `data_file.fields`.
+ repeated DataOverlayFile overlays = 2;
+ }
+
+ // Attach overlay files to fragments, supplying new values for a subset of
+ // (row offset, field) cells without rewriting the fragments' base data files.
+ // See the DataOverlayFile message in table.proto and the Data Overlay Files
+ // specification for resolution, coverage, and versioning rules.
+ //
+ // Conflict semantics (intentionally permissive, like DataReplacement). Against
+ // a concurrent operation that touches one of the same fragments:
+ // * Another DataOverlay (any fields): COMPATIBLE. Overlays stack; when two
+ // overlays cover the same (offset, field) the one with the higher
+ // `committed_version` wins, so independent backfills never conflict.
+ // * Append / new fragments: COMPATIBLE.
+ // * Delete: COMPATIBLE. A deletion takes precedence over an overlay, so an
+ // overlay value for a deleted offset is inert (no special handling needed).
+ // * DataReplacement or column-rewrite (Update with REWRITE_COLUMNS) of the
+ // same field: COMPATIBLE. Both preserve physical row addresses, so overlay
+ // offsets stay valid; the overlay is newer and wins its covered cells, and
+ // the version gate excludes those cells from any rebuilt index.
+ // * Row-rewrite, compaction, or an overlay->base fold of the fragment:
+ // CONFLICT. These change physical row addresses or consume the overlays, so
+ // the overlay's offsets are no longer valid. The writer must re-read the new
+ // fragment, recompute, and retry.
+ message DataOverlay {
+ repeated DataOverlayGroup groups = 1;
+ }
+
// Update the merged generations in MemWAL index.
// This operation is used during merge-insert to atomically record which
// generations have been merged to the base table.
@@ -346,6 +384,7 @@ message Transaction {
UpdateMemWalState update_mem_wal_state = 112;
Clone clone = 113;
UpdateBases update_bases = 114;
+ DataOverlay data_overlay = 115;
}
// Fields 200/202 (`blob_append` / `blob_overwrite`) previously represented blob dataset ops.
diff --git a/rust/lance-file/src/reader.rs b/rust/lance-file/src/reader.rs
index c454f73819e..45d50541879 100644
--- a/rust/lance-file/src/reader.rs
+++ b/rust/lance-file/src/reader.rs
@@ -491,6 +491,21 @@ impl FileReader {
self.num_rows
}
+ /// The number of rows stored in a single physical column.
+ ///
+ /// For ordinary (rectangular) files every column has the same length, equal
+ /// to [`num_rows`](Self::num_rows). Files written with
+ /// [`FileWriter::write_columns`](crate::writer::FileWriter::write_columns)
+ /// may have columns of differing lengths; this returns the length of one
+ /// such column, derived by summing its pages' row counts. Returns `None` if
+ /// `column_index` is out of bounds.
+ pub fn column_num_rows(&self, column_index: usize) -> Option {
+ self.metadata
+ .column_metadatas
+ .get(column_index)
+ .map(|col| col.pages.iter().map(|page| page.length).sum())
+ }
+
pub fn metadata(&self) -> &Arc {
&self.metadata
}
diff --git a/rust/lance-file/src/writer.rs b/rust/lance-file/src/writer.rs
index 12bd50df6fe..63ff7a95314 100644
--- a/rust/lance-file/src/writer.rs
+++ b/rust/lance-file/src/writer.rs
@@ -6,7 +6,7 @@ use std::collections::HashMap;
use std::sync::Arc;
use std::sync::atomic::AtomicBool;
-use arrow_array::RecordBatch;
+use arrow_array::{ArrayRef, RecordBatch};
use arrow_data::ArrayData;
use bytes::{Buf, BufMut, Bytes, BytesMut};
@@ -221,6 +221,11 @@ pub struct FileWriter {
field_id_to_column_indices: Vec<(u32, u32)>,
num_columns: u32,
rows_written: u64,
+ // The number of rows written for each top-level field (i.e. each entry in
+ // `column_writers`). With `write_batch` every field advances together and
+ // these are all equal, but `write_columns` advances fields independently, so
+ // a single file may end up with columns of differing item counts.
+ field_rows_written: Vec,
global_buffers: Vec<(u64, u64)>,
schema_metadata: HashMap,
options: FileWriterOptions,
@@ -277,6 +282,7 @@ impl FileWriter {
column_metadata: Vec::new(),
num_columns: 0,
rows_written: 0,
+ field_rows_written: Vec::new(),
field_id_to_column_indices: Vec::new(),
global_buffers: Vec::new(),
schema_metadata: HashMap::new(),
@@ -467,6 +473,7 @@ impl FileWriter {
BatchEncoder::try_new(&schema, encoding_strategy.as_ref(), &encoding_options)?;
self.num_columns = encoder.num_columns();
+ self.field_rows_written = vec![0; encoder.field_encoders.len()];
self.column_writers = encoder.field_encoders;
self.column_metadata = vec![initial_column_metadata(); self.num_columns as usize];
self.field_id_to_column_indices = encoder.field_id_to_column_index;
@@ -490,13 +497,14 @@ impl FileWriter {
batch: &RecordBatch,
external_buffers: &mut OutOfLineBuffers,
) -> Result>> {
- self.schema
+ let items = self
+ .schema
.as_ref()
.unwrap()
.fields
.iter()
- .zip(self.column_writers.iter_mut())
- .map(|(field, column_writer)| {
+ .enumerate()
+ .map(|(field_idx, field)| {
let array =
batch
.column_by_name(&field.name)
@@ -507,19 +515,53 @@ impl FileWriter {
)
.into(),
))?;
+ Ok((field_idx, array.clone()))
+ })
+ .collect::>>()?;
+ self.encode_columns(&items, external_buffers)
+ }
+
+ /// Encode a set of `(field index, array)` pairs, each advancing only its own
+ /// column. The returned tasks must be written before the per-field row
+ /// counters are advanced (see `advance_columns`).
+ fn encode_columns(
+ &mut self,
+ items: &[(usize, ArrayRef)],
+ external_buffers: &mut OutOfLineBuffers,
+ ) -> Result>> {
+ // Snapshot the starting row number of each field before borrowing the
+ // column writers mutably below.
+ let row_numbers = items
+ .iter()
+ .map(|(field_idx, _)| self.field_rows_written[*field_idx])
+ .collect::>();
+ items
+ .iter()
+ .zip(row_numbers)
+ .map(|((field_idx, array), row_number)| {
let repdef = RepDefBuilder::default();
let num_rows = array.len() as u64;
- column_writer.maybe_encode(
+ self.column_writers[*field_idx].maybe_encode(
array.clone(),
external_buffers,
repdef,
- self.rows_written,
+ row_number,
num_rows,
)
})
.collect::>>()
}
+ /// Advance the per-field row counters after a set of columns has been
+ /// written, keeping `rows_written` (the file's logical length) in sync as the
+ /// maximum column length.
+ fn advance_columns(&mut self, items: &[(usize, ArrayRef)]) {
+ for (field_idx, array) in items {
+ self.field_rows_written[*field_idx] += array.len() as u64;
+ }
+ self.rows_written = self.field_rows_written.iter().copied().max().unwrap_or(0);
+ }
+
/// Schedule a batch of data to be written to the file
///
/// Note: the future returned by this method may complete before the data has been fully
@@ -557,18 +599,100 @@ impl FileWriter {
.flatten()
.collect::>();
- self.rows_written = match self.rows_written.checked_add(batch.num_rows() as u64) {
- Some(rows_written) => rows_written,
- None => {
- return Err(Error::invalid_input_source(format!("cannot write batch with {} rows because {} rows have already been written and Lance files cannot contain more than 2^64 rows", num_rows, self.rows_written).into()));
- }
- };
+ // `write_batch` advances every field by the same amount, keeping all
+ // columns equal length. Guard against overflowing the row counter.
+ if self.rows_written.checked_add(num_rows).is_none() {
+ return Err(Error::invalid_input_source(format!("cannot write batch with {} rows because {} rows have already been written and Lance files cannot contain more than 2^64 rows", num_rows, self.rows_written).into()));
+ }
+ for field_rows in self.field_rows_written.iter_mut() {
+ *field_rows += num_rows;
+ }
+ self.rows_written = self.field_rows_written.iter().copied().max().unwrap_or(0);
self.write_pages(encoding_tasks).await?;
Ok(())
}
+ /// Write a set of columns whose lengths may differ from one another.
+ ///
+ /// Unlike [`write_batch`](Self::write_batch), which advances every column
+ /// from a single shared row counter, this method advances each column
+ /// independently. The result is a single file whose columns may have
+ /// different item counts — the physical layout used by sparse data overlay
+ /// files, where each field covers a different set of rows.
+ ///
+ /// `columns` is a list of `(field index, array)` pairs, where the field
+ /// index refers to a top-level field in the writer's schema (the same order
+ /// as the schema's fields). A field may be written across multiple calls;
+ /// its values are appended. A field that is never written ends up as a
+ /// zero-length column. The writer must have been created with an explicit
+ /// schema (via [`try_new`](Self::try_new)); a lazy schema cannot be inferred
+ /// here because individual calls need not cover every field.
+ ///
+ /// ```
+ /// # use arrow_array::{ArrayRef, Int32Array};
+ /// # use std::sync::Arc;
+ /// # use lance_file::writer::FileWriter;
+ /// # async fn example(writer: &mut FileWriter) -> lance_core::Result<()> {
+ /// // Field 0 gets three values, field 1 gets one — a non-rectangular file.
+ /// let a: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
+ /// let b: ArrayRef = Arc::new(Int32Array::from(vec![10]));
+ /// writer.write_columns(vec![(0, a), (1, b)]).await?;
+ /// # Ok(())
+ /// # }
+ /// ```
+ pub async fn write_columns(&mut self, columns: Vec<(usize, ArrayRef)>) -> Result<()> {
+ let schema = self.schema.as_ref().ok_or_else(|| {
+ Error::invalid_input_source(
+ "write_columns requires the writer to be created with an explicit schema".into(),
+ )
+ })?;
+ // Validate field indices, lengths, and nullability up front.
+ for (field_idx, array) in &columns {
+ let field = schema.fields.get(*field_idx).ok_or_else(|| {
+ Error::invalid_input_source(
+ format!(
+ "write_columns: field index {} is out of bounds (schema has {} fields)",
+ field_idx,
+ schema.fields.len()
+ )
+ .into(),
+ )
+ })?;
+ if array.len() as u64 > u32::MAX as u64 {
+ return Err(Error::invalid_input_source(
+ "cannot write Lance files with more than 2^32 rows".into(),
+ ));
+ }
+ Self::verify_field_nullability(&array.to_data(), field)?;
+ }
+ // Skip empty arrays: a never-advanced field simply remains a zero-length
+ // column, which the encoders handle at `finish` time.
+ let columns = columns
+ .into_iter()
+ .filter(|(_, array)| !array.is_empty())
+ .collect::>();
+ if columns.is_empty() {
+ return Ok(());
+ }
+
+ let mut external_buffers =
+ OutOfLineBuffers::new(self.tell().await?, PAGE_BUFFER_ALIGNMENT as u64);
+ let encoding_tasks = self.encode_columns(&columns, &mut external_buffers)?;
+ for external_buffer in external_buffers.take_buffers() {
+ Self::do_write_buffer(&mut self.writer, &external_buffer).await?;
+ }
+ let encoding_tasks = encoding_tasks
+ .into_iter()
+ .flatten()
+ .collect::>();
+
+ self.advance_columns(&columns);
+ self.write_pages(encoding_tasks).await?;
+ Ok(())
+ }
+
async fn write_column_metadata(
&mut self,
metadata: pbfile::ColumnMetadata,
@@ -974,11 +1098,11 @@ mod tests {
use std::collections::HashMap;
use std::sync::Arc;
- use crate::reader::{FileReader, FileReaderOptions, describe_encoding};
+ use crate::reader::{FileReader, FileReaderOptions, ReaderProjection, describe_encoding};
use crate::testing::FsFixture;
use crate::writer::{ENV_LANCE_FILE_WRITER_MAX_PAGE_BYTES, FileWriter, FileWriterOptions};
use arrow_array::builder::{Float32Builder, Int32Builder};
- use arrow_array::{Int32Array, RecordBatch, UInt64Array};
+ use arrow_array::{ArrayRef, Int32Array, RecordBatch, UInt64Array};
use arrow_array::{RecordBatchReader, StringArray, types::Float64Type};
use arrow_schema::{DataType, Field, Field as ArrowField, Schema, Schema as ArrowSchema};
use lance_core::cache::LanceCache;
@@ -990,6 +1114,7 @@ mod tests {
use lance_encoding::version::LanceFileVersion;
use lance_io::object_store::ObjectStore;
use lance_io::utils::CachedFileSize;
+ use rstest::rstest;
#[tokio::test]
async fn test_basic_write() {
@@ -1040,6 +1165,196 @@ mod tests {
file_writer.finish().await.unwrap();
}
+ /// Read a single column back at an explicit range/index set, returning its
+ /// `Int32` values. Reading one column at a time is how unequal-length files
+ /// are consumed: a global full-scan would conflate columns of different
+ /// lengths into one (impossible) rectangular batch.
+ async fn read_int32_column(
+ reader: &FileReader,
+ schema: &LanceSchema,
+ version: LanceFileVersion,
+ name: &str,
+ params: lance_io::ReadBatchParams,
+ ) -> Vec