Skip to content

[ntuple] Support multiple column representations in the merger#22017

Open
silverweed wants to merge 13 commits into
root-project:masterfrom
silverweed:ntuple_merge_colrep2
Open

[ntuple] Support multiple column representations in the merger#22017
silverweed wants to merge 13 commits into
root-project:masterfrom
silverweed:ntuple_merge_colrep2

Conversation

@silverweed
Copy link
Copy Markdown
Contributor

@silverweed silverweed commented Apr 22, 2026

This Pull request:

Significantly reworks the innards of the RNTupleMerger to support fast merging of fields with different but compatible column representations.
Basically it does two things:

  • turns all L3 merging cases into L2/L1.
  • no longer rejects merging fields with different column representations (previously this was only supported for representations that were the split/unsplit version of each other, and only via L3 merging).

A potentially negative consequence that we might want to revisit is that now the merger won't ever adapt the columns' splitness to the output compression (e.g. if merging changes the source compression from 0 to 505 it will still encode the columns as unsplit, and vice-versa). This will probably be readded in a future PR.

In order to achieve this, some new internal functionality had to be added, most notably RPagePersistentSink::AddColumnRepresentation.

Note that this PR is independent on #21740, which in fact might not be needed at all.

IMPORTANT

This PR introduces our first feature flag and thus the first bump to the specs' major version (1.1.0.0). This means we can now start producing RNTuples which cannot be read by older ROOT versions.

TODO

  • check if we need a feature flag for the changes in AddExtendedColumnRanges
  • add a test for merging of Real32Trunc/Quant columns with different bit width/value range
  • properly split the big merger commit
  • update Merging.md

Checklist:

  • tested changes locally
  • updated the docs (if necessary)

@silverweed silverweed requested a review from jblomer as a code owner April 22, 2026 15:07
@silverweed silverweed marked this pull request as draft April 22, 2026 15:07
@silverweed silverweed changed the title Ntuple merge colrep2 [ntuple] Support multiple column representations in the merger Apr 22, 2026
@silverweed silverweed self-assigned this Apr 22, 2026
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch 3 times, most recently from b2ae5fc to 1db6b5e Compare April 22, 2026 15:22
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 22, 2026

Test Results

    21 files      21 suites   3d 3h 27m 32s ⏱️
 3 862 tests  3 862 ✅ 0 💤 0 ❌
73 374 runs  73 374 ✅ 0 💤 0 ❌

Results for commit ab08930.

♻️ This comment has been updated with latest results.

@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch 9 times, most recently from 82e5299 to d3efcfc Compare April 30, 2026 09:45
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch 2 times, most recently from 857d425 to 60b1bde Compare May 4, 2026 13:36
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch 3 times, most recently from 1d462ee to a249301 Compare May 5, 2026 07:42
Instead of calling continue multiple times in the AddColumnFromField
loop, just early return in case of projected fields.
We are currently serializing columns per-field, but in case of late
column extension this might result in inconsistent sorting of the columns
in the serialized footer.

e.g. assume you have fields "A" and "B", both late model extended, both
with a single column:
    - col 0 -> field A, repr 0
    - col 1 -> field B, repr 0

Now you add a new column representation to field "A"; this new column
has id 2:
    - col 2 -> field A, repr 1

When serializing this RNTuple, all columns are written in the footer by
RNTupleSerialize::SerializeColumnsForFields(). Before this change, they
would end up on disk in order: [0, 2, 1].
This would corrupt the data by swapping the pages for columns 2 and 1.

After this change, they get written as [0, 1, 2] which is the correct
order.

Note that this exact case is tested in ntuple_merger in the unit test
MergeDeferredAdvanced.
Internal functionality to be used by the Merger.

This entails 2 additional changes:

- AddExtendedColumnRanges needs to be updated to handle the case where
a column representation is added to a field during writing after some
clusters have already been written;
- ShiftAliasColumns needs to properly shift the ids of extended alias
columns when called, otherwise a mismatch may happen when serializing
the footer
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch from a249301 to ab08930 Compare May 13, 2026 15:05
@silverweed silverweed marked this pull request as ready for review May 15, 2026 06:48
@pcanal
Copy link
Copy Markdown
Member

pcanal commented May 23, 2026

now the merger won't ever adapt the columns' splitness to the output compression (e.g. if merging changes the source compression from 0 to 505 it will still encode the columns as unsplit, and vice-versa). This will probably be readded in a future PR.

indeed, we do need to provide a way for the user to require the L3 type of merging ('urgency' of this is less if we already have a way to for L4 type of merging).

Addendum: L4 is currently not supported, so re-adding L3 would be helpful. In particular to allow re-selection of the compression algorithm used.

@pcanal pcanal closed this May 23, 2026

| Flag Bit | Introduced in | Name | Meaning |
|----------|---------------|-------------------------|----------------------------------------------|
| 0 | 1.1.0.0 | Nested Deferred Columns | Signals that the RNTuple contains at least one deferred column that is part of a collection and was extended<br>(i.e. it appears in the footer). This can happen when merging two RNTuples that have the same collection field<br>backed by columns with different encoding, e.g. a `vector<float>` whose elements are represented by SplitReal32<br>in the first ntuple and by Real32 in the second. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mentions explicitly collections. Is the feature (merging RNTuple with 'same' column with different representation) not supported for simple type (i.e. just a float instead of a vector<float>)? If not, why not?

@pcanal
Copy link
Copy Markdown
Member

pcanal commented May 23, 2026

I did not mean to close this.

@pcanal pcanal reopened this May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants