perf: narrow scan projection in merge_insert partial schema to avoid heavy column I/O by hfutatzhanghb · Pull Request #7443 · lance-format/lance

hfutatzhanghb · 2026-06-24T10:29:54Z

Summary

When performing merge_insert with a partial schema (e.g. only key + embedding columns for updates), the table scan in the join phase exposes ALL dataset columns. While DataFusion's optimizer can push projections down, the target."column_name" synthetic column references in the post-join loop effectively block this optimization — the planner sees references to every dataset column and must scan them all.

For large tables (1B rows, 20TB) where only 50K embedding values need updating, this means the join scans 20TB of data unnecessarily.

Fix

When a partial schema is detected (source_field_names.len() < self.dataset.schema().fields.len()), narrow the target scan to only _rowid, _rowaddr, and on-columns before the join via DataFrame::select_columns. DataFusion pushes this projection down to the table scan, avoiding I/O on heavy data columns during matching.

The target."column_name" synthetic column references in the post-join loop are unaffected — DataFusion's projection-pushdown operates at the physical-plan level and does not invalidate logical column references.

Performance Impact

	Before	After
Scan columns during join	All (20TB for 1B-row table)	`_rowid` + `_rowaddr` + on-columns (~16GB)
Reduction	—	~1000x less I/O for the join phase

Testing

cargo check -p lance --lib passes
cargo fmt --all passes

…ing cols in write exec

github-actions Bot added the performance label Jun 24, 2026

perf: skip target column synthetic loop for partial schema, fill miss…

2cae88e

…ing cols in write exec

hfutatzhanghb force-pushed the codex/merge-insert-optimize-v2 branch from d6a4984 to 2cae88e Compare June 24, 2026 12:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: narrow scan projection in merge_insert partial schema to avoid heavy column I/O#7443

perf: narrow scan projection in merge_insert partial schema to avoid heavy column I/O#7443
hfutatzhanghb wants to merge 1 commit into
lance-format:mainfrom
hfutatzhanghb:codex/merge-insert-optimize-v2

hfutatzhanghb commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hfutatzhanghb commented Jun 24, 2026

Summary

Fix

Performance Impact

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant