Skip to content

[python] Add nested-field projection on primary-key merge-read path#7801

Merged
JingsongLi merged 5 commits intoapache:masterfrom
TheR1sing3un:py-pk-nested-projection-outer
May 10, 2026
Merged

[python] Add nested-field projection on primary-key merge-read path#7801
JingsongLi merged 5 commits intoapache:masterfrom
TheR1sing3un:py-pk-nested-projection-outer

Conversation

@TheR1sing3un
Copy link
Copy Markdown
Member

Purpose

Complete #7796: PK tables now serve nested-field projections through the merge reader.
Before this PR, with_projection(['mv.latest_version']) on a PK table whose split
went through the merge path raised NotImplementedError.

Approach

Inner reader keeps the full ROW sub-structure (so deduplicate / partial-update /
aggregation merge functions still see the original row). The new
OuterProjectionRecordReader sits above the merge unwrap and walks each
configured name path on the inner row to emit a flat OffsetRow matching the
user-visible read schema. This mirrors Java's NestedProjectedRowData split.

Commits

  1. Add OuterProjectionRecordReader for nested-field PK reads — the wrapper +
    16-case unit test.
  2. Fix primary-key read when value projection drops PK columns — pre-existing
    bug surfaced by this work: the PARQUET/ORC/Lance/Vortex format readers used
    merge-internal aliases (_KEY_id) to look up DataFields, but the file
    stores PK columns under their bare name (id). When projection narrowed
    value fields enough to drop the bare PK from self.read_fields, the
    format reader silently skipped the PK column and the merge KeyValue layout
    raised Offset N plus arity M is out of row length L. Build the lookup
    from both name spaces.
  3. Support nested-field projection on primary-key tables — wire commit 1
    into MergeFileSplitRead and remove the guard in TableRead.

Tests

  • 16 new unit tests for OuterProjectionRecordReader (None handling, path
    walks, batch reuse contract, row-kind inheritance, empty-input rejection).
  • 3 new e2e tests on PK tables: single sub-path, multiple sub-paths under the
    same struct, mixed top-level and nested ordering.
  • 1 new regression test on PK + value-only projection.
  • All existing append-only nested + reader_primary_key + reader_append_only +
    projection-utility + read-builder tests still pass.

Out of scope

  • DataEvolution + nested projection (still raises NotImplementedError);
    needs the same widen-then-extract treatment, follow-up PR.
  • ARRAY / MAP<X, ROW> nested projection (Java side also keeps NestedProjection
    on ROW only).
  • Aggregation merge engine + nested combo regression — depends on
    PR-MERGE-AGGREGATION landing first.

Introduce a wrapper RecordReader that walks each configured nested name
path on the inner InternalRow and emits a flat OffsetRow whose slots are
the resolved sub-field values. Reuses the same per-batch row-reuse
contract as InternalRowWrapperIterator.

The reader is unused on its own; the next commit wires it into
MergeFileSplitRead so primary-key tables can serve nested projections
without the merge function losing its full ROW input.
The merge reader previously looked DataFields up by their merge-internal
aliases (``_KEY_id``), but the file stores PK columns under their bare
user-facing name (``id``). When ``with_projection`` narrowed the value
schema enough that the bare PK field was no longer present in
``self.read_fields``, the lookup silently dropped the PK column from
the format-reader request, producing rows with a column count that
did not match the merge ``KeyValue`` layout and raising
``ValueError: Offset N plus arity M is out of row length L``.

Build the lookup map from both the merge-internal names and the
trimmed user-side names so the PK column stays reachable regardless
of how value projection is narrowed.
Wire OuterProjectionRecordReader into the merge-read path. When the
user projects nested sub-paths on a PK table (e.g.
``with_projection(['mv.latest_version'])``), the reader now widens the
inner read_type back to the unique top-level fields touched by the
nested paths so the merge function sees full ROW sub-structures, then
extracts the requested sub-paths out the back via the wrapper.

The previous NotImplementedError guard in ``TableRead._create_split_read``
is removed for the merge path. DataEvolution + nested still raises
because that path needs a similar widen-then-extract treatment that
hasn't been wired yet.

Replaces the e2e test that pinned the guard with three positive cases
covering single sub-path, multiple sub-paths under the same struct,
and mixed top-level / nested projection on PK tables.
The detailed reasoning lives in the PR description; the file just
needs to say what each block does.
@TheR1sing3un TheR1sing3un force-pushed the py-pk-nested-projection-outer branch from 12182ff to 369b2be Compare May 9, 2026 17:08
Copy link
Copy Markdown
Contributor

@leaves12138 leaves12138 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. The merge-read widening/extraction approach looks reasonable, and the new Parquet/ORC-style paths passed my targeted checks. I found one blocking gap in the Avro path: the alias-safe PK field lookup added for the other format readers is not applied to FormatAvroReader, so PK merge reads still fail when the user projection drops the PK column (and the new nested PK projection fails for Avro as well).

Comment thread paimon-python/pypaimon/read/split_read.py
The Avro branch in ``file_reader_supplier`` still passed
``self.read_fields`` as ``full_fields`` to ``FormatAvroReader``, so a
PK merge read whose value projection drops the bare PK column raised
``KeyError`` when the format reader tried to resolve the trimmed
user-side name (``id`` / ``user_id``) against a map that only
contained merge-internal aliases (``_KEY_id``).

Pass the alias-safe union (``name_to_field.values()``) instead so
the same fix already applied to Lance / Vortex / Parquet / ORC also
covers Avro. Add Avro regressions for both PK value-only projection
and PK nested projection.
@TheR1sing3un TheR1sing3un requested a review from leaves12138 May 10, 2026 03:24
@JingsongLi
Copy link
Copy Markdown
Contributor

+1

@JingsongLi JingsongLi merged commit fd60a8a into apache:master May 10, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants