Skip to content

feat(index): write zone map seeds into data file footers during append#7427

Open
westonpace wants to merge 10 commits into
lance-format:mainfrom
westonpace:feat-index-write-breadcrumbs
Open

feat(index): write zone map seeds into data file footers during append#7427
westonpace wants to merge 10 commits into
lance-format:mainfrom
westonpace:feat-index-write-breadcrumbs

Conversation

@westonpace

@westonpace westonpace commented Jun 23, 2026

Copy link
Copy Markdown
Member

This PR introduces the concept of "index seeds". Indexes can opt-in to planting seeds during ingestion. The seeds are placed into the data files as global buffers. Later, when updating the index to include these data files, we can harvest the seeds instead of scanning the data itself.

This is primarily intended to avoid a potentially expensive data scan to update the index. As an example this PR adds index seeds for wide (binary, fixed-size-list, string) columns when creating a zone map index. Now we calculate the min/max/nulls during ingestion, when the data is already present and flowing through the system. Then, when we go to update the index, all we are doing is reading back those counts and adding them to the index (instead of scanning the large column all over again).

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer A-deps Dependency updates enhancement New feature or request labels Jun 23, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Important

This PR touches the Lance format specification.

Substantive changes to the format specification — the .proto definitions
and the spec docs under docs/src/format/ — require a PMC vote before merge.
Minor edits such as typo fixes, wording, or formatting are excluded; use your
judgment.

If this is a meaningful format change:

  • Start a vote following the Lance community voting process.
    Format specification modifications need 3 binding +1 votes (excluding the
    proposer), held on GitHub Discussions, with a minimum voting period of 1 week.
  • Once the vote passes, link the completed vote in this PR. It should not be
    merged until the vote is linked.

@github-actions github-actions Bot added the A-format On-disk format: protos and format spec docs label Jun 25, 2026
//! Index seed writers — compact per-fragment summaries embedded in data files.
//!
//! A seed writer observes column values as they are written to a data file,
//! accumulates compact statistics in memory, and serializes them to a byte

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment we assume the entire seed can comfortably fit in memory but in theory we could split it into multiple global buffers in the future if we needed to.

@westonpace westonpace marked this pull request as ready for review June 25, 2026 14:09
westonpace and others added 10 commits June 25, 2026 14:13
Introduces "index seeds" — compact per-fragment zone map statistics
embedded as global buffers in data file footers at write time. During
index update the seeds are harvested directly from the file footer,
skipping the full column scan that would otherwise be required.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…on IndexSeedWriter

Move seed writer creation out of WriteParams (a config struct shouldn't
hold mutable stateful objects) into write_fragments_internal, passing
them as an explicit parameter to do_write_fragments. Switch IndexSeedWriter
trait methods to &mut self, eliminating the need for Arc + Mutex.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n capability

Add `create_seed_writer` to `ScalarIndexPlugin` with a default `Ok(None)`
implementation so any index type can opt in. Implement it on
`ZoneMapIndexPlugin` by reading `rows_per_zone` from the index file
metadata — no full index load required. Replace the hardcoded
`create_zone_map_seed_writers` in the write path with a generic
`create_seed_writers` that iterates all single-field indices, resolves
the plugin via `IndexDetails::get_plugin`, and delegates to
`plugin.create_seed_writer`. Also move `ZONEMAP_SEED_META_KEY_PREFIX`
to `seed.rs` as the shared `SEED_META_KEY_PREFIX`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/O from create_seed_writer

Add `optional uint64 rows_per_zone` to the `ZoneMapIndexDetails` proto
and populate it at all four CreatedIndex emission sites. This lets
`ZoneMapIndexPlugin::create_seed_writer` read rows_per_zone directly
from the decoded proto with no file I/O. Old datasets that predate
the field fall back to ROWS_PER_ZONE_DEFAULT via the optional/None
path. Also remove the `index_store` parameter from
`ScalarIndexPlugin::create_seed_writer` and document that the method
must not perform I/O — all needed parameters must come from
`index_details`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…dexDetails

Old datasets that predate the rows_per_zone proto field should not fall
back to a default value — the seed buffer would silently use the wrong
zone size. Return Ok(None) instead so no seed writer is created.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tly ignoring them

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ugin capability

Add `update_from_seeds` to `ScalarIndexPlugin` (default `Ok(None)`) so any
index type can opt into seed-based incremental updates without coupling the
update path to a specific index type. Add `metadata_value` to `FragmentSeed`
so plugins can validate seed compatibility (e.g. confirming `rows_per_zone`).

Implement `ZoneMapIndexPlugin::update_from_seeds` with `rows_per_zone`
validation and rename `try_harvest_zonemap_seeds` → `try_harvest_seeds` in
`append.rs`. Remove the hardcoded `IndexType::ZoneMap` branch; the merge path
now tries seeds generically before falling back to a full column scan.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add `use_seeds: bool` to `ZoneMapIndexDetails` proto and propagate it
through the zone map plugin stack. The field defaults to `true` for
variable-length (string, binary) and wide primitive types (≥ 8 bytes)
where skipping a column scan has the most impact, and `false` for narrow
fixed-width types (Int32, Float32, …) that are fast enough to scan
directly.

`ScalarIndexPlugin` gains a `might_use_seeds` hook (default `false`) that
lets the update path in `append.rs` skip opening data files entirely when
the index configuration is known to never write seeds. `ZoneMapIndexPlugin`
implements the hook by reading `use_seeds` from the proto details.

`create_seed_writer` now also respects `use_seeds`, so no seed buffer is
embedded in the data file for indexes where seeds are disabled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8-byte primitives (Int64, Float64, Timestamp, …) scan fast enough that
seed overhead is not worthwhile. `default_use_seeds` now returns true only
for variable-length types (strings, binary) and fixed-width types strictly
wider than 8 bytes (Decimal128, Decimal256, FixedSizeBinary(n > 8)).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
`build_per_segment_filters` and `open_and_merge_segments` take `&[&IndexMetadata]`
after an upstream signature change in main; pass `&selected_old_indices` to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@westonpace westonpace force-pushed the feat-index-write-breadcrumbs branch from f93e5cc to 338199a Compare June 25, 2026 14:20
Comment thread protos/index_old.proto
// Number of rows per zone. Optional for backwards compatibility: absent on
// datasets written before this field was added. When absent, no seed writer
// is created for the index.
optional uint64 rows_per_zone = 1;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need a vote for this change.

Comment thread protos/index_old.proto

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are going to touch this file. I'm thinking we need a better name for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-deps Dependency updates A-format On-disk format: protos and format spec docs A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants