Skip to content

feat: add blob v2 source id sharing#7442

Draft
Xuanwo wants to merge 2 commits into
mainfrom
xuanwo/blob-v2-object-layer
Draft

feat: add blob v2 source id sharing#7442
Xuanwo wants to merge 2 commits into
mainfrom
xuanwo/blob-v2-object-layer

Conversation

@Xuanwo

@Xuanwo Xuanwo commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

This PR implements Blob v2 object-layer plumbing and explicit write-side source sharing without changing the on-disk descriptor format.

Blob values can now carry an optional source_id as a write-only hint. The writer consumes this hint inside BlobPreprocessor, reuses the first materialized packed or dedicated descriptor within the same data file, and keeps source_id out of manifest and read schemas.

API examples

Python:

import lance
import pyarrow as pa
from lance import Blob

payload = b"x" * (1024 * 1024)

table = pa.table({
    "image": lance.blob_array([
        Blob.from_bytes(payload, source_id="image:001"),
        Blob.from_bytes(payload, source_id="image:001"),
    ])
})

lance.write_dataset(table, "dataset", data_storage_version="2.2")

External ingest can also use explicit source identity:

table = pa.table({
    "image": lance.blob_array([
        Blob.from_uri("s3://bucket/image.jpg", source_id="image:001"),
        Blob.from_uri("s3://bucket/image.jpg", source_id="image:001"),
    ])
})

lance.write_dataset(
    table,
    "dataset",
    data_storage_version="2.2",
    external_blob_mode="ingest",
)

Rust:

use lance::blob::BlobArrayBuilder;

let payload = vec![0_u8; 1024 * 1024];

let mut builder = BlobArrayBuilder::new(2);
builder.push_bytes_with_source_id("image:001", &payload)?;
builder.push_bytes_with_source_id("image:001", &payload)?;

let field = builder.field("image", true);
let array = builder.finish()?;

Benefits

This enables exact descriptor sharing for Lance-owned blob payloads. Within a single data file, repeated packed or dedicated blobs with the same source_id are written once, and later rows reuse the same descriptor.

This is useful when multiple rows refer to the same logical object, such as repeated images, documents, embeddings payloads, or externally ingested objects. It reduces duplicated blob bytes and avoids repeated external reads in ingest mode.

The sharing is explicit rather than content-based. Lance does not hash or compare payload bytes; source_id is the user's declaration that rows refer to the same logical source. The writer validates declared size when available, but same-size different-content inputs remain the user's responsibility.

The scope is intentionally data-file-local. Cross-fragment or cross-data-file physical sharing is not part of this contract. Inline blobs ignore source_id, and external reference mode does not materialize or deduplicate bytes.

Compatibility

The on-disk Blob v2 descriptor and raw packed sidecar layout are unchanged. Shared rows are represented by multiple descriptors pointing at the same (blob_id, position, size) range.

source_id is a write-schema field only. It is consumed before encoding and is filtered from the manifest schema. Default reads still return the descriptor struct and never expose source_id.

Rust keeps the historical default blob_field() shape. Python upgrades its default write storage type to include source_id, while deserialization preserves older 4-field IPC storage types.

Validation covered Rust blob/object-layer tests, Lance schema conversion tests, Python blob tests, formatting, and diff whitespace checks.

@github-actions github-actions Bot added A-python Python bindings enhancement New feature or request labels Jun 24, 2026
…-layer

# Conflicts:
#	rust/lance-core/src/datatypes/field.rs
#	rust/lance/src/blob.rs
#	rust/lance/src/dataset/blob.rs
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 95.23026% with 29 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/blob.rs 95.83% 9 Missing and 6 partials ⚠️
rust/lance/src/blob.rs 93.54% 8 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-python Python bindings enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant