[ntuple] Support multiple column representations in the merger#22017
[ntuple] Support multiple column representations in the merger#22017silverweed wants to merge 13 commits into
Conversation
b2ae5fc to
1db6b5e
Compare
Test Results 21 files 21 suites 3d 3h 27m 32s ⏱️ Results for commit ab08930. ♻️ This comment has been updated with latest results. |
82e5299 to
d3efcfc
Compare
857d425 to
60b1bde
Compare
1d462ee to
a249301
Compare
Instead of calling continue multiple times in the AddColumnFromField loop, just early return in case of projected fields.
We are currently serializing columns per-field, but in case of late
column extension this might result in inconsistent sorting of the columns
in the serialized footer.
e.g. assume you have fields "A" and "B", both late model extended, both
with a single column:
- col 0 -> field A, repr 0
- col 1 -> field B, repr 0
Now you add a new column representation to field "A"; this new column
has id 2:
- col 2 -> field A, repr 1
When serializing this RNTuple, all columns are written in the footer by
RNTupleSerialize::SerializeColumnsForFields(). Before this change, they
would end up on disk in order: [0, 2, 1].
This would corrupt the data by swapping the pages for columns 2 and 1.
After this change, they get written as [0, 1, 2] which is the correct
order.
Note that this exact case is tested in ntuple_merger in the unit test
MergeDeferredAdvanced.
Also fix the type of result
Internal functionality to be used by the Merger. This entails 2 additional changes: - AddExtendedColumnRanges needs to be updated to handle the case where a column representation is added to a field during writing after some clusters have already been written; - ShiftAliasColumns needs to properly shift the ids of extended alias columns when called, otherwise a mismatch may happen when serializing the footer
a249301 to
ab08930
Compare
indeed, we do need to provide a way for the user to require the L3 type of merging ('urgency' of this is less if we already have a way to for L4 type of merging). Addendum: L4 is currently not supported, so re-adding L3 would be helpful. In particular to allow re-selection of the compression algorithm used. |
|
|
||
| | Flag Bit | Introduced in | Name | Meaning | | ||
| |----------|---------------|-------------------------|----------------------------------------------| | ||
| | 0 | 1.1.0.0 | Nested Deferred Columns | Signals that the RNTuple contains at least one deferred column that is part of a collection and was extended<br>(i.e. it appears in the footer). This can happen when merging two RNTuples that have the same collection field<br>backed by columns with different encoding, e.g. a `vector<float>` whose elements are represented by SplitReal32<br>in the first ntuple and by Real32 in the second. | |
There was a problem hiding this comment.
This mentions explicitly collections. Is the feature (merging RNTuple with 'same' column with different representation) not supported for simple type (i.e. just a float instead of a vector<float>)? If not, why not?
|
I did not mean to close this. |
This Pull request:
Significantly reworks the innards of the RNTupleMerger to support fast merging of fields with different but compatible column representations.
Basically it does two things:
A potentially negative consequence that we might want to revisit is that now the merger won't ever adapt the columns' splitness to the output compression (e.g. if merging changes the source compression from 0 to 505 it will still encode the columns as unsplit, and vice-versa). This will probably be readded in a future PR.
In order to achieve this, some new internal functionality had to be added, most notably
RPagePersistentSink::AddColumnRepresentation.Note that this PR is independent on #21740, which in fact might not be needed at all.
IMPORTANT
This PR introduces our first feature flag and thus the first bump to the specs' major version (1.1.0.0). This means we can now start producing RNTuples which cannot be read by older ROOT versions.
TODO
AddExtendedColumnRangesChecklist: