feat: add blob v2 source id sharing#7442
Draft
Xuanwo wants to merge 2 commits into
Draft
Conversation
…-layer # Conflicts: # rust/lance-core/src/datatypes/field.rs # rust/lance/src/blob.rs # rust/lance/src/dataset/blob.rs
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR implements Blob v2 object-layer plumbing and explicit write-side source sharing without changing the on-disk descriptor format.
Blob values can now carry an optional
source_idas a write-only hint. The writer consumes this hint insideBlobPreprocessor, reuses the first materialized packed or dedicated descriptor within the same data file, and keepssource_idout of manifest and read schemas.API examples
Python:
External ingest can also use explicit source identity:
Rust:
Benefits
This enables exact descriptor sharing for Lance-owned blob payloads. Within a single data file, repeated packed or dedicated blobs with the same
source_idare written once, and later rows reuse the same descriptor.This is useful when multiple rows refer to the same logical object, such as repeated images, documents, embeddings payloads, or externally ingested objects. It reduces duplicated blob bytes and avoids repeated external reads in ingest mode.
The sharing is explicit rather than content-based. Lance does not hash or compare payload bytes;
source_idis the user's declaration that rows refer to the same logical source. The writer validates declared size when available, but same-size different-content inputs remain the user's responsibility.The scope is intentionally data-file-local. Cross-fragment or cross-data-file physical sharing is not part of this contract. Inline blobs ignore
source_id, and external reference mode does not materialize or deduplicate bytes.Compatibility
The on-disk Blob v2 descriptor and raw packed sidecar layout are unchanged. Shared rows are represented by multiple descriptors pointing at the same
(blob_id, position, size)range.source_idis a write-schema field only. It is consumed before encoding and is filtered from the manifest schema. Default reads still return the descriptor struct and never exposesource_id.Rust keeps the historical default
blob_field()shape. Python upgrades its default write storage type to includesource_id, while deserialization preserves older 4-field IPC storage types.Validation covered Rust blob/object-layer tests, Lance schema conversion tests, Python blob tests, formatting, and diff whitespace checks.