Skip to content

perf: narrow scan projection in merge_insert partial schema to avoid heavy column I/O#7443

Open
hfutatzhanghb wants to merge 1 commit into
lance-format:mainfrom
hfutatzhanghb:codex/merge-insert-optimize-v2
Open

perf: narrow scan projection in merge_insert partial schema to avoid heavy column I/O#7443
hfutatzhanghb wants to merge 1 commit into
lance-format:mainfrom
hfutatzhanghb:codex/merge-insert-optimize-v2

Conversation

@hfutatzhanghb

Copy link
Copy Markdown
Contributor

Summary

When performing merge_insert with a partial schema (e.g. only key + embedding columns for updates), the table scan in the join phase exposes ALL dataset columns. While DataFusion's optimizer can push projections down, the target."column_name" synthetic column references in the post-join loop effectively block this optimization — the planner sees references to every dataset column and must scan them all.

For large tables (1B rows, 20TB) where only 50K embedding values need updating, this means the join scans 20TB of data unnecessarily.

Fix

When a partial schema is detected (source_field_names.len() < self.dataset.schema().fields.len()), narrow the target scan to only _rowid, _rowaddr, and on-columns before the join via DataFrame::select_columns. DataFusion pushes this projection down to the table scan, avoiding I/O on heavy data columns during matching.

The target."column_name" synthetic column references in the post-join loop are unaffected — DataFusion's projection-pushdown operates at the physical-plan level and does not invalidate logical column references.

Performance Impact

Before After
Scan columns during join All (20TB for 1B-row table) _rowid + _rowaddr + on-columns (~16GB)
Reduction ~1000x less I/O for the join phase

Testing

  • cargo check -p lance --lib passes
  • cargo fmt --all passes

@hfutatzhanghb hfutatzhanghb force-pushed the codex/merge-insert-optimize-v2 branch from d6a4984 to 2cae88e Compare June 24, 2026 12:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant