feat(index): write zone map seeds into data file footers during append by westonpace · Pull Request #7427 · lance-format/lance

westonpace · 2026-06-23T17:16:16Z

This PR introduces the concept of "index seeds". Indexes can opt-in to planting seeds during ingestion. The seeds are placed into the data files as global buffers. Later, when updating the index to include these data files, we can harvest the seeds instead of scanning the data itself.

This is primarily intended to avoid a potentially expensive data scan to update the index. As an example this PR adds index seeds for wide (binary, fixed-size-list, string) columns when creating a zone map index. Now we calculate the min/max/nulls during ingestion, when the data is already present and flowing through the system. Then, when we go to update the index, all we are doing is reading back those counts and adding them to the index (instead of scanning the large column all over again).

github-actions · 2026-06-25T12:58:33Z

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

Start a vote following the Lance community voting process.
Format specification modifications need 3 binding +1 votes (excluding the
proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
Once the vote passes, link the completed vote in this PR. It should not be
merged until the vote is linked.

westonpace · 2026-06-24T12:58:26Z

+//! Index seed writers — compact per-fragment summaries embedded in data files.
+//!
+//! A seed writer observes column values as they are written to a data file,
+//! accumulates compact statistics in memory, and serializes them to a byte


At the moment we assume the entire seed can comfortably fit in memory but in theory we could split it into multiple global buffers in the future if we needed to.

Introduces "index seeds" — compact per-fragment zone map statistics embedded as global buffers in data file footers at write time. During index update the seeds are harvested directly from the file footer, skipping the full column scan that would otherwise be required. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…on IndexSeedWriter Move seed writer creation out of WriteParams (a config struct shouldn't hold mutable stateful objects) into write_fragments_internal, passing them as an explicit parameter to do_write_fragments. Switch IndexSeedWriter trait methods to &mut self, eliminating the need for Arc + Mutex. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n capability Add `create_seed_writer` to `ScalarIndexPlugin` with a default `Ok(None)` implementation so any index type can opt in. Implement it on `ZoneMapIndexPlugin` by reading `rows_per_zone` from the index file metadata — no full index load required. Replace the hardcoded `create_zone_map_seed_writers` in the write path with a generic `create_seed_writers` that iterates all single-field indices, resolves the plugin via `IndexDetails::get_plugin`, and delegates to `plugin.create_seed_writer`. Also move `ZONEMAP_SEED_META_KEY_PREFIX` to `seed.rs` as the shared `SEED_META_KEY_PREFIX`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…/O from create_seed_writer Add `optional uint64 rows_per_zone` to the `ZoneMapIndexDetails` proto and populate it at all four CreatedIndex emission sites. This lets `ZoneMapIndexPlugin::create_seed_writer` read rows_per_zone directly from the decoded proto with no file I/O. Old datasets that predate the field fall back to ROWS_PER_ZONE_DEFAULT via the optional/None path. Also remove the `index_store` parameter from `ScalarIndexPlugin::create_seed_writer` and document that the method must not perform I/O — all needed parameters must come from `index_details`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…dexDetails Old datasets that predate the rows_per_zone proto field should not fall back to a default value — the seed buffer would silently use the wrong zone size. Return Ok(None) instead so no seed writer is created. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tly ignoring them Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ugin capability Add `update_from_seeds` to `ScalarIndexPlugin` (default `Ok(None)`) so any index type can opt into seed-based incremental updates without coupling the update path to a specific index type. Add `metadata_value` to `FragmentSeed` so plugins can validate seed compatibility (e.g. confirming `rows_per_zone`). Implement `ZoneMapIndexPlugin::update_from_seeds` with `rows_per_zone` validation and rename `try_harvest_zonemap_seeds` → `try_harvest_seeds` in `append.rs`. Remove the hardcoded `IndexType::ZoneMap` branch; the merge path now tries seeds generically before falling back to a full column scan. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add `use_seeds: bool` to `ZoneMapIndexDetails` proto and propagate it through the zone map plugin stack. The field defaults to `true` for variable-length (string, binary) and wide primitive types (≥ 8 bytes) where skipping a column scan has the most impact, and `false` for narrow fixed-width types (Int32, Float32, …) that are fast enough to scan directly. `ScalarIndexPlugin` gains a `might_use_seeds` hook (default `false`) that lets the update path in `append.rs` skip opening data files entirely when the index configuration is known to never write seeds. `ZoneMapIndexPlugin` implements the hook by reading `use_seeds` from the proto details. `create_seed_writer` now also respects `use_seeds`, so no seed buffer is embedded in the data file for indexes where seeds are disabled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

8-byte primitives (Int64, Float64, Timestamp, …) scan fast enough that seed overhead is not worthwhile. `default_use_seeds` now returns true only for variable-length types (strings, binary) and fixed-width types strictly wider than 8 bytes (Decimal128, Decimal256, FixedSizeBinary(n > 8)). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

`build_per_segment_filters` and `open_and_merge_segments` take `&[&IndexMetadata]` after an upstream signature change in main; pass `&selected_old_indices` to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Xuanwo · 2026-06-25T14:45:07Z

+  // Number of rows per zone. Optional for backwards compatibility: absent on
+  // datasets written before this field was added. When absent, no seed writer
+  // is created for the index.
+  optional uint64 rows_per_zone = 1;


We will need a vote for this change.

Xuanwo · 2026-06-25T14:45:26Z

Since we are going to touch this file. I'm thinking we need a better name for it.

github-actions Bot added A-index Vector index, linalg, tokenizer A-deps Dependency updates enhancement New feature or request labels Jun 23, 2026

github-actions Bot added the A-format On-disk format: protos and format spec docs label Jun 25, 2026

westonpace commented Jun 25, 2026

View reviewed changes

westonpace marked this pull request as ready for review June 25, 2026 14:09

westonpace and others added 10 commits June 25, 2026 14:13

fix(index): propagate errors from create_seed_writer instead of silen…

46311f2

…tly ignoring them Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

westonpace force-pushed the feat-index-write-breadcrumbs branch from f93e5cc to 338199a Compare June 25, 2026 14:20

Xuanwo reviewed Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(index): write zone map seeds into data file footers during append#7427

feat(index): write zone map seeds into data file footers during append#7427
westonpace wants to merge 10 commits into
lance-format:mainfrom
westonpace:feat-index-write-breadcrumbs

westonpace commented Jun 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

westonpace Jun 24, 2026

Uh oh!

Xuanwo Jun 25, 2026

Uh oh!

Xuanwo Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

westonpace commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

westonpace Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanwo Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Xuanwo Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

westonpace commented Jun 23, 2026 •

edited

Loading