Fix MERGE_M2 for extreme finite partial means#22393
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes numerical edge cases in the MERGE_M2 groupby aggregation when merging partial states with extreme (but finite) means, preventing NaN production when the accumulator is still empty and aligning the merge update with a numerically safer formulation.
Changes:
- Treat merging the first non-empty partial into an empty accumulator as an identity operation to avoid
inf * 0 -> NaN. - Update non-empty merge math to use a central-moment form (
delta_n = delta / new_n) for improved numerical stability. - Add groupby regression tests covering extreme finite partials (identity case) and extreme+finite merges (expected
m2 = +inf).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
cpp/src/groupby/sort/group_merge_m2.cu |
Adds an early identity-path for empty accumulators and updates the merge formula to a safer central-moment form. |
cpp/tests/groupby/merge_m2_tests.cpp |
Adds regression tests for extreme finite means to ensure MERGE_M2 does not produce NaN and behaves as expected. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Allen Xu <allxu@nvidia.com>
ad5ba04 to
c12348e
Compare
…l struct Cover both int64_t and double count columns for the MERGE_M2 extreme-finite cases. Spark stores the count as FLOAT64, which the original two TEST_F variants did not exercise. Strengthen the assertion to compare the full result struct (counts, means, m2) rather than only the m2 child column. Signed-off-by: Allen Xu <allxu@nvidia.com>
|
/ok to test 8b781d9 |
|
/ok to test c331016 |
|
@wjxiz1992 Are you planning to resolve the Copilot review comments? |
| // Merging an empty accumulator with a non-empty partial is an identity operation. Running | ||
| // the generic formula for this case can evaluate inf * 0 and turn extreme finite partials | ||
| // into NaN. | ||
| if (n == 0) { |
There was a problem hiding this comment.
yes but what if the input mean is literally infinity? or it's a NaN? then it should return NaN right? You should also check std::isfinite() here. Or am I misunderstanding what merge m2 is trying to do.
There was a problem hiding this comment.
Walking through the cases:
partial_avg = +Inf (with partial_n > 0): the identity branch propagates avg = +Inf, m2 = partial_m2 as-is. The old generic path produced NaN here via delta * delta_n * n * partial_n = +Inf * +Inf * 0 * partial_n = inf*0 — same inf*0=NaN side effect this PR is fixing. Propagating +Inf preserves the upstream "overflowed" signal; coercing to NaN would discard it.
partial_avg = NaN: identity sets avg = NaN; any subsequent merge step propagates NaN through the generic formula (NaN ⊕ anything = NaN). Final result is NaN regardless of partial position, as expected.
In practice Spark's CentralMomentAgg doesn't emit (count, +Inf, m2_finite) partials — Welford hits +Inf - +Inf = NaN on the first overflowing row, so the partial becomes (count, NaN, NaN). So the "+Inf avg" case really only shows up for direct callers of MERGE_M2 with hand-crafted partials, and for those propagation is strictly more informative than coercion.
I pushed 5d917711 (now 071266d after rebase) with regression tests pinning these semantics: NanMeanFirstPartial, InfMeanFirstPartial, and NanMeanMergedWithFinite for both INT64 and FLOAT64 count types. Let me know if there's a Spark scenario where NaN coercion is actually wanted — I'm not seeing one.
Add regression tests showing the identity branch propagates non-finite partial means as-is, instead of coercing them to NaN. Covers single NaN-mean partial, single +Inf-mean partial, and NaN-mean merged with a finite partial. Both INT64 and FLOAT64 count types are covered. Signed-off-by: Allen Xu <allxu@nvidia.com>
|
@davidwendt the two Copilot review comments are now resolved (replied inline) — both were already addressed in 8b781d9 (full struct comparison, plus FLOAT64-count variants for the Spark path). Also pushed 071266d with regression tests for NaN/Inf partial means in response to @pmattione-nvidia. |
|
/ok to test 071266d |
@wjxiz1992, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
|
ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe PR fixes a NaN propagation bug in MERGE_M2 aggregation for M2 partial states. A special case was added to the merge logic to directly assign the first non-empty partial's values when the accumulator is empty, preventing inf * 0 overflow. Tests validate extreme finite means, NaN, and Inf propagation across single and merged partials. ChangesM2 Merge Extreme Values
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
|
/ok to test f49c069 |
|
/merge |
Closes rapidsai#22391. `MERGE_M2` now treats merging the first non-empty partial into an empty accumulator as an identity operation. This avoids evaluating the generic merge formula with `n == 0`, where an extreme finite mean can make `delta * delta` overflow to `inf` and then produce `NaN` via `inf * 0`. For non-empty merges, the update now uses the central-moment form with `delta_n = delta / new_n`, matching the numerically safer order used by Spark's CPU implementation. Added groupby tests for: - a single extreme finite partial, which should preserve `m2 = 0.0` - merging an extreme finite partial with another finite partial, which should produce `m2 = +inf` Local validation: ```text cmake -S cpp -B cpp/build \ -DCMAKE_INSTALL_PREFIX=/home/allxu/work/spark-set/cudf-14681-merge-m2/cpp/build/install \ -DCMAKE_CUDA_ARCHITECTURES=NATIVE \ -DUSE_NVTX=ON \ -DBUILD_TESTS=ON \ -DBUILD_BENCHMARKS=OFF \ -DDISABLE_DEPRECATION_WARNINGS=ON \ -DCMAKE_BUILD_TYPE=Release \ -DZLIB_INCLUDE_DIR=/home/allxu/.local/lib/python3.12/site-packages/lxml/includes/extlibs \ -DZLIB_LIBRARY=/usr/lib/x86_64-linux-gnu/libz.so.1 ``` ```text cmake --build cpp/build --target GROUPBY_TEST -j12 [100%] Built target GROUPBY_TEST ``` ```text ./cpp/build/gtests/GROUPBY_TEST --gtest_filter='GroupbyMergeM2*' --gtest_color=no [==========] 44 tests from 7 test suites ran. (134 ms total) [ PASSED ] 44 tests. ``` ```text cmake --build cpp/build --target generate_ctest_json -j12 cmake --build cpp/build --target cudf_identify_stream_usage_mode_cudf -j12 ctest --test-dir cpp/build -R '^GROUPBY_TEST$' --output-on-failure 100% tests passed, 0 tests failed out of 2 ``` Authors: - Allen Xu (https://github.com/wjxiz1992) Approvers: - Paul Mattione (https://github.com/pmattione-nvidia) - David Wendt (https://github.com/davidwendt) URL: rapidsai#22393
Description
Closes #22391.
MERGE_M2now treats merging the first non-empty partial into an empty accumulator as an identity operation. This avoids evaluating the generic merge formula withn == 0, where an extreme finite mean can makedelta * deltaoverflow toinfand then produceNaNviainf * 0.For non-empty merges, the update now uses the central-moment form with
delta_n = delta / new_n, matching the numerically safer order used by Spark's CPU implementation.Added groupby tests for:
m2 = 0.0m2 = +infLocal validation:
Checklist