Skip to content

Fix MERGE_M2 for extreme finite partial means#22393

Merged
rapids-bot[bot] merged 5 commits into
mainfrom
fix/14681-merge-m2-extreme
May 8, 2026
Merged

Fix MERGE_M2 for extreme finite partial means#22393
rapids-bot[bot] merged 5 commits into
mainfrom
fix/14681-merge-m2-extreme

Conversation

@wjxiz1992
Copy link
Copy Markdown
Member

Description

Closes #22391.

MERGE_M2 now treats merging the first non-empty partial into an empty accumulator as an identity operation. This avoids evaluating the generic merge formula with n == 0, where an extreme finite mean can make delta * delta overflow to inf and then produce NaN via inf * 0.

For non-empty merges, the update now uses the central-moment form with delta_n = delta / new_n, matching the numerically safer order used by Spark's CPU implementation.

Added groupby tests for:

  • a single extreme finite partial, which should preserve m2 = 0.0
  • merging an extreme finite partial with another finite partial, which should produce m2 = +inf

Local validation:

cmake -S cpp -B cpp/build \
  -DCMAKE_INSTALL_PREFIX=/home/allxu/work/spark-set/cudf-14681-merge-m2/cpp/build/install \
  -DCMAKE_CUDA_ARCHITECTURES=NATIVE \
  -DUSE_NVTX=ON \
  -DBUILD_TESTS=ON \
  -DBUILD_BENCHMARKS=OFF \
  -DDISABLE_DEPRECATION_WARNINGS=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DZLIB_INCLUDE_DIR=/home/allxu/.local/lib/python3.12/site-packages/lxml/includes/extlibs \
  -DZLIB_LIBRARY=/usr/lib/x86_64-linux-gnu/libz.so.1
cmake --build cpp/build --target GROUPBY_TEST -j12
[100%] Built target GROUPBY_TEST
./cpp/build/gtests/GROUPBY_TEST --gtest_filter='GroupbyMergeM2*' --gtest_color=no
[==========] 44 tests from 7 test suites ran. (134 ms total)
[  PASSED  ] 44 tests.
cmake --build cpp/build --target generate_ctest_json -j12
cmake --build cpp/build --target cudf_identify_stream_usage_mode_cudf -j12
ctest --test-dir cpp/build -R '^GROUPBY_TEST$' --output-on-failure
100% tests passed, 0 tests failed out of 2

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copilot AI review requested due to automatic review settings May 6, 2026 04:21
@wjxiz1992 wjxiz1992 requested a review from a team as a code owner May 6, 2026 04:21
@wjxiz1992 wjxiz1992 requested review from davidwendt and mythrocks May 6, 2026 04:21
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the libcudf Affects libcudf (C++/CUDA) code. label May 6, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes numerical edge cases in the MERGE_M2 groupby aggregation when merging partial states with extreme (but finite) means, preventing NaN production when the accumulator is still empty and aligning the merge update with a numerically safer formulation.

Changes:

  • Treat merging the first non-empty partial into an empty accumulator as an identity operation to avoid inf * 0 -> NaN.
  • Update non-empty merge math to use a central-moment form (delta_n = delta / new_n) for improved numerical stability.
  • Add groupby regression tests covering extreme finite partials (identity case) and extreme+finite merges (expected m2 = +inf).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
cpp/src/groupby/sort/group_merge_m2.cu Adds an early identity-path for empty accumulators and updates the merge formula to a safer central-moment form.
cpp/tests/groupby/merge_m2_tests.cpp Adds regression tests for extreme finite means to ensure MERGE_M2 does not produce NaN and behaves as expected.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread cpp/tests/groupby/merge_m2_tests.cpp Outdated
Comment thread cpp/tests/groupby/merge_m2_tests.cpp Outdated
@wjxiz1992 wjxiz1992 added bug Something isn't working non-breaking Non-breaking change labels May 6, 2026
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 force-pushed the fix/14681-merge-m2-extreme branch from ad5ba04 to c12348e Compare May 6, 2026 05:01
…l struct

Cover both int64_t and double count columns for the MERGE_M2 extreme-finite cases. Spark stores the count as FLOAT64, which the original two TEST_F variants did not exercise. Strengthen the assertion to compare the full result struct (counts, means, m2) rather than only the m2 child column.

Signed-off-by: Allen Xu <allxu@nvidia.com>
@davidwendt
Copy link
Copy Markdown
Contributor

/ok to test 8b781d9

@wjxiz1992
Copy link
Copy Markdown
Member Author

/ok to test c331016

@davidwendt
Copy link
Copy Markdown
Contributor

@wjxiz1992 Are you planning to resolve the Copilot review comments?

// Merging an empty accumulator with a non-empty partial is an identity operation. Running
// the generic formula for this case can evaluate inf * 0 and turn extreme finite partials
// into NaN.
if (n == 0) {
Copy link
Copy Markdown
Contributor

@pmattione-nvidia pmattione-nvidia May 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but what if the input mean is literally infinity? or it's a NaN? then it should return NaN right? You should also check std::isfinite() here. Or am I misunderstanding what merge m2 is trying to do.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walking through the cases:

partial_avg = +Inf (with partial_n > 0): the identity branch propagates avg = +Inf, m2 = partial_m2 as-is. The old generic path produced NaN here via delta * delta_n * n * partial_n = +Inf * +Inf * 0 * partial_n = inf*0 — same inf*0=NaN side effect this PR is fixing. Propagating +Inf preserves the upstream "overflowed" signal; coercing to NaN would discard it.

partial_avg = NaN: identity sets avg = NaN; any subsequent merge step propagates NaN through the generic formula (NaN ⊕ anything = NaN). Final result is NaN regardless of partial position, as expected.

In practice Spark's CentralMomentAgg doesn't emit (count, +Inf, m2_finite) partials — Welford hits +Inf - +Inf = NaN on the first overflowing row, so the partial becomes (count, NaN, NaN). So the "+Inf avg" case really only shows up for direct callers of MERGE_M2 with hand-crafted partials, and for those propagation is strictly more informative than coercion.

I pushed 5d917711 (now 071266d after rebase) with regression tests pinning these semantics: NanMeanFirstPartial, InfMeanFirstPartial, and NanMeanMergedWithFinite for both INT64 and FLOAT64 count types. Let me know if there's a Spark scenario where NaN coercion is actually wanted — I'm not seeing one.

Add regression tests showing the identity branch propagates non-finite partial means as-is, instead of coercing them to NaN. Covers single NaN-mean partial, single +Inf-mean partial, and NaN-mean merged with a finite partial. Both INT64 and FLOAT64 count types are covered.

Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992
Copy link
Copy Markdown
Member Author

@davidwendt the two Copilot review comments are now resolved (replied inline) — both were already addressed in 8b781d9 (full struct comparison, plus FLOAT64-count variants for the Spark path). Also pushed 071266d with regression tests for NaN/Inf partial means in response to @pmattione-nvidia.

@wjxiz1992
Copy link
Copy Markdown
Member Author

/ok to test 071266d

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 8, 2026

/ok to test 071266d

@wjxiz1992, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 090952b1-cb09-46a6-88c4-393705c0ffe9

📥 Commits

Reviewing files that changed from the base of the PR and between 65df106 and 071266d.

📒 Files selected for processing (2)
  • cpp/src/groupby/sort/group_merge_m2.cu
  • cpp/tests/groupby/merge_m2_tests.cpp

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes

    • Fixed merge logic for group-by aggregations to correctly handle empty accumulators. Previously, merging empty accumulators with non-empty ones could produce NaN values; values are now assigned directly from the first non-empty accumulator.
  • Tests

    • Added test coverage for extreme and non-finite (NaN/Inf) values in M2 aggregation merge operations.

Walkthrough

The PR fixes a NaN propagation bug in MERGE_M2 aggregation for M2 partial states. A special case was added to the merge logic to directly assign the first non-empty partial's values when the accumulator is empty, preventing inf * 0 overflow. Tests validate extreme finite means, NaN, and Inf propagation across single and merged partials.

Changes

M2 Merge Extreme Values

Layer / File(s) Summary
Core Implementation
cpp/src/groupby/sort/group_merge_m2.cu
Added a special-case branch in merge_fn::operator() that directly copies partial_n, partial_avg, and partial_m2 into the accumulator when the running count n is zero, preventing the generic formula from computing inf * 0 = NaN.
Test Validation
cpp/tests/groupby/merge_m2_tests.cpp
Added #include <limits>, templated helper functions (test_extreme_finite_first_partial, test_extreme_finite_merged_partials, test_nan_mean_first_partial, test_inf_mean_first_partial, test_nan_mean_merged_with_finite), new fixture GroupbyMergeM2ExtremeTest, and concrete TEST_F cases for int64_t and double types covering extreme finite, NaN, and Inf mean propagation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main fix: addressing MERGE_M2 behavior for extreme finite partial means, which is the core issue from #22391.
Description check ✅ Passed The description provides comprehensive context including the issue link, technical explanation of the problem and solution, and validation evidence through test execution.
Linked Issues check ✅ Passed The PR implements all key requirements from #22391: short-circuiting when n==0 to avoid NaN, using central-moment form for safety, and adding comprehensive tests for extreme finite values.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing MERGE_M2 behavior and adding corresponding tests; no unrelated modifications are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/14681-merge-m2-extreme

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@wjxiz1992
Copy link
Copy Markdown
Member Author

/ok to test f49c069

@davidwendt
Copy link
Copy Markdown
Contributor

/merge

@rapids-bot rapids-bot Bot merged commit 0a1620e into main May 8, 2026
465 of 495 checks passed
shrshi pushed a commit to shrshi/cudf that referenced this pull request May 12, 2026
Closes rapidsai#22391.

`MERGE_M2` now treats merging the first non-empty partial into an empty accumulator as an identity operation. This avoids evaluating the generic merge formula with `n == 0`, where an extreme finite mean can make `delta * delta` overflow to `inf` and then produce `NaN` via `inf * 0`.

For non-empty merges, the update now uses the central-moment form with `delta_n = delta / new_n`, matching the numerically safer order used by Spark's CPU implementation.

Added groupby tests for:
- a single extreme finite partial, which should preserve `m2 = 0.0`
- merging an extreme finite partial with another finite partial, which should produce `m2 = +inf`

Local validation:

```text
cmake -S cpp -B cpp/build \
  -DCMAKE_INSTALL_PREFIX=/home/allxu/work/spark-set/cudf-14681-merge-m2/cpp/build/install \
  -DCMAKE_CUDA_ARCHITECTURES=NATIVE \
  -DUSE_NVTX=ON \
  -DBUILD_TESTS=ON \
  -DBUILD_BENCHMARKS=OFF \
  -DDISABLE_DEPRECATION_WARNINGS=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DZLIB_INCLUDE_DIR=/home/allxu/.local/lib/python3.12/site-packages/lxml/includes/extlibs \
  -DZLIB_LIBRARY=/usr/lib/x86_64-linux-gnu/libz.so.1
```

```text
cmake --build cpp/build --target GROUPBY_TEST -j12
[100%] Built target GROUPBY_TEST
```

```text
./cpp/build/gtests/GROUPBY_TEST --gtest_filter='GroupbyMergeM2*' --gtest_color=no
[==========] 44 tests from 7 test suites ran. (134 ms total)
[  PASSED  ] 44 tests.
```

```text
cmake --build cpp/build --target generate_ctest_json -j12
cmake --build cpp/build --target cudf_identify_stream_usage_mode_cudf -j12
ctest --test-dir cpp/build -R '^GROUPBY_TEST$' --output-on-failure
100% tests passed, 0 tests failed out of 2
```

Authors:
  - Allen Xu (https://github.com/wjxiz1992)

Approvers:
  - Paul Mattione (https://github.com/pmattione-nvidia)
  - David Wendt (https://github.com/davidwendt)

URL: rapidsai#22393
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MERGE_M2 returns NaN when first partial has an extreme finite mean

4 participants