[python] Add nested-field projection on primary-key merge-read path#7801
Merged
JingsongLi merged 5 commits intoapache:masterfrom May 10, 2026
Merged
Conversation
Introduce a wrapper RecordReader that walks each configured nested name path on the inner InternalRow and emits a flat OffsetRow whose slots are the resolved sub-field values. Reuses the same per-batch row-reuse contract as InternalRowWrapperIterator. The reader is unused on its own; the next commit wires it into MergeFileSplitRead so primary-key tables can serve nested projections without the merge function losing its full ROW input.
The merge reader previously looked DataFields up by their merge-internal aliases (``_KEY_id``), but the file stores PK columns under their bare user-facing name (``id``). When ``with_projection`` narrowed the value schema enough that the bare PK field was no longer present in ``self.read_fields``, the lookup silently dropped the PK column from the format-reader request, producing rows with a column count that did not match the merge ``KeyValue`` layout and raising ``ValueError: Offset N plus arity M is out of row length L``. Build the lookup map from both the merge-internal names and the trimmed user-side names so the PK column stays reachable regardless of how value projection is narrowed.
Wire OuterProjectionRecordReader into the merge-read path. When the user projects nested sub-paths on a PK table (e.g. ``with_projection(['mv.latest_version'])``), the reader now widens the inner read_type back to the unique top-level fields touched by the nested paths so the merge function sees full ROW sub-structures, then extracts the requested sub-paths out the back via the wrapper. The previous NotImplementedError guard in ``TableRead._create_split_read`` is removed for the merge path. DataEvolution + nested still raises because that path needs a similar widen-then-extract treatment that hasn't been wired yet. Replaces the e2e test that pinned the guard with three positive cases covering single sub-path, multiple sub-paths under the same struct, and mixed top-level / nested projection on PK tables.
The detailed reasoning lives in the PR description; the file just needs to say what each block does.
12182ff to
369b2be
Compare
leaves12138
requested changes
May 10, 2026
Contributor
leaves12138
left a comment
There was a problem hiding this comment.
Thanks for the PR. The merge-read widening/extraction approach looks reasonable, and the new Parquet/ORC-style paths passed my targeted checks. I found one blocking gap in the Avro path: the alias-safe PK field lookup added for the other format readers is not applied to FormatAvroReader, so PK merge reads still fail when the user projection drops the PK column (and the new nested PK projection fails for Avro as well).
The Avro branch in ``file_reader_supplier`` still passed ``self.read_fields`` as ``full_fields`` to ``FormatAvroReader``, so a PK merge read whose value projection drops the bare PK column raised ``KeyError`` when the format reader tried to resolve the trimmed user-side name (``id`` / ``user_id``) against a map that only contained merge-internal aliases (``_KEY_id``). Pass the alias-safe union (``name_to_field.values()``) instead so the same fix already applied to Lance / Vortex / Parquet / ORC also covers Avro. Add Avro regressions for both PK value-only projection and PK nested projection.
Contributor
|
+1 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Complete #7796: PK tables now serve nested-field projections through the merge reader.
Before this PR,
with_projection(['mv.latest_version'])on a PK table whose splitwent through the merge path raised
NotImplementedError.Approach
Inner reader keeps the full ROW sub-structure (so deduplicate / partial-update /
aggregation merge functions still see the original row). The new
OuterProjectionRecordReadersits above the merge unwrap and walks eachconfigured name path on the inner row to emit a flat OffsetRow matching the
user-visible read schema. This mirrors Java's NestedProjectedRowData split.
Commits
Add OuterProjectionRecordReader for nested-field PK reads— the wrapper +16-case unit test.
Fix primary-key read when value projection drops PK columns— pre-existingbug surfaced by this work: the PARQUET/ORC/Lance/Vortex format readers used
merge-internal aliases (
_KEY_id) to look up DataFields, but the filestores PK columns under their bare name (
id). When projection narrowedvalue fields enough to drop the bare PK from
self.read_fields, theformat reader silently skipped the PK column and the merge KeyValue layout
raised
Offset N plus arity M is out of row length L. Build the lookupfrom both name spaces.
Support nested-field projection on primary-key tables— wire commit 1into
MergeFileSplitReadand remove the guard inTableRead.Tests
OuterProjectionRecordReader(None handling, pathwalks, batch reuse contract, row-kind inheritance, empty-input rejection).
same struct, mixed top-level and nested ordering.
projection-utility + read-builder tests still pass.
Out of scope
NotImplementedError);needs the same widen-then-extract treatment, follow-up PR.
on ROW only).
PR-MERGE-AGGREGATION landing first.