[AutoSparkUT] Fix std variance floating overflow coverage by wjxiz1992 · Pull Request #14762 · NVIDIA/spark-rapids

wjxiz1992 · 2026-05-09T03:22:35Z

This follows the discussion in this issue comment to keep strict std/variance coverage for common floating-point inputs and split the overflow-oriented NaN/Inf cases into a smaller documented test set.

Upstream dependency:

cuDF MERGE_M2 fix: rapidsai/cudf#22393 (0a1620e5b3), which adds the first non-empty partial identity path for extreme finite means. spark-rapids-jni origin/main already points thirdparty/cudf at this fix.
Scope split: the cuDF fix covers the native GroupByAggregation.mergeM2() path. The Scala change in this PR is still needed for CudfMergeM2.reductionAggregate, the host-side/global reduction merge path, because that path copies partial buffers to the JVM and does not call cuDF MERGE_M2. Without the Scala identity branch, the first non-empty partial can still evaluate the generic formula with mergeN == 0, causing delta * delta to overflow to Inf and then Inf * 0 to become NaN.
The Scala merge formula is also reordered to match Spark CPU CentralMomentAgg (deltaN = delta / newN; m2 += delta * deltaN * n1 * n2), reducing remaining CPU/GPU differences that are purely caused by merge-order arithmetic.

Changes:

Match Spark's central-moment merge order in the host-side M2 reduction merge, including an identity path for the first non-empty partial.
Split std/variance integration coverage into common finite floating-point cases with strict comparison and extreme floating-point cases with a scoped NaN/Inf overflow allowance.
Pass the existing generic result_canonicalize_func_before_compare hook through the SQL assertion helper and keep std/variance overflow canonicalization local to the extreme-input test.
Tighten post-review test plumbing by making the extreme FP generators finite-only with no_nans=True, so NaN/Inf output is attributable to overflow rather than input special cases. The broader dict/map comparison cleanup was kept out of this PR after Blossom exposed unrelated existing map/JSON/ORC differences.

Performance impact:
The Scala change runs only in the host-side reductionAggregate merge over partial aggregate rows. Complexity and allocation behavior are unchanged; the first non-empty partial now takes an identity branch and later partials use scalar double arithmetic equivalent to Spark's merge order. No GPU kernel or per-row expression hot path is changed.

Validation:

python -m py_compile integration_tests/src/main/python/asserts.py integration_tests/src/main/python/hash_aggregate_test.py
git diff --check
mvn package -DskipTests -pl dist,integration_tests -am -Dbuildver=330 -Dmaven.repo.local=./.mvn-repo -s jenkins/settings.xml -P mirror-apache-to-urm: BUILD SUCCESS
TESTS='hash_aggregate_test.py::test_std_variance hash_aggregate_test.py::test_std_variance_extreme_floating_point' TEST_PARALLEL=1 DATAGEN_SEED=1777180076 ./integration_tests/run_pyspark_from_build.sh --tb=short: 75 passed, 3 warnings in 985.02s (0:16:25)
Confirmed spark-rapids-jni origin/main has the cuDF fix via thirdparty/cudf -> 0a1620e5b3 Fix MERGE_M2 for extreme finite partial means (#22393).
Rebuilt local spark-rapids dist with a JNI jar containing the equivalent cuDF fix revision c12348ed68e05227139a54e19d35374530c23e7b; confirmed the resulting rapids-4-spark_2.12-26.06.0-SNAPSHOT-cuda12.jar embeds that cuDF revision.
Re-ran targeted IT with that rebuilt dist jar: TESTS='hash_aggregate_test.py::test_std_variance hash_aggregate_test.py::test_std_variance_extreme_floating_point' TEST_PARALLEL=1 DATAGEN_SEED=1777180076 ./integration_tests/run_pyspark_from_build.sh --tb=short: 75 passed, 3 warnings in 670.56s (0:11:10)
Post-review fix validation: TESTS="hash_aggregate_test.py::test_std_variance_extreme_floating_point" TEST_PARALLEL=1 ./integration_tests/run_pyspark_from_build.sh -s: 10 passed, 3 warnings in 101.80s (0:01:41)
Blossom blocker fix validation: TESTS="hash_aggregate_test.py::test_std_variance_extreme_floating_point json_matrix_test.py::test_from_json_map_string_string[int_formatted.json]" TEST_PARALLEL=1 ./integration_tests/run_pyspark_from_build.sh --tb=short -q: 11 passed, 3 warnings in 95.10s (0:01:35)
New review-comment follow-up validation after removing this PR's dict/map assertion diff: TESTS="hash_aggregate_test.py::test_std_variance_extreme_floating_point json_matrix_test.py::test_from_json_map_string_string[int_formatted.json]" TEST_PARALLEL=1 ./integration_tests/run_pyspark_from_build.sh --tb=short -q: 11 passed, 3 warnings in 116.69s (0:01:56)
Review fix validation after moving overflow handling out of the generic assert framework, using Python 3.10 and Spark 3.3.0: TESTS='hash_aggregate_test.py::test_std_variance hash_aggregate_test.py::test_std_variance_extreme_floating_point' TEST_PARALLEL=1 DATAGEN_SEED=1777180076 ./integration_tests/run_pyspark_from_build.sh --tb=short -q: 75 passed, 3 warnings in 656.13s (0:10:56)
Review nit follow-up (drop -Infinity from overflow sentinels; revert two unrelated wraps in asserts.py), Python 3.10 + Spark 3.3.0, f402c5fb5: TEST='test_std_variance_extreme_floating_point or (test_std_variance and not extreme)' TEST_PARALLEL=0 DATAGEN_SEED=1777180076 ./integration_tests/run_pyspark_from_build.sh --tb=short -q: 265 passed, 39510 deselected, 8 warnings in 1425.14s (0:23:45).

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

Testing

Added or modified tests to cover new code paths
Covered by existing tests
(Please provide the names of the existing tests in the PR description.)
Not required

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

Signed-off-by: Allen Xu <allxu@nvidia.com>

greptile-apps · 2026-05-09T03:26:29Z

Greptile Summary

This PR fixes a NaN-producing overflow bug in the host-side M2 reduction merge path (CudfMergeM2.reductionAggregate) and splits the std/variance integration test coverage into a strict common-case suite and a loosely-compared extreme-input suite.

Scala fix: Adds an identity branch for mergeN == 0.0 in the host-side merge loop. Without it, the first non-empty partial evaluates the generic formula with mergeN == 0, producing mean * mean → Inf (overflow), then Inf * 0 → NaN per IEEE 754. The merge order is also reordered to match Spark CPU's CentralMomentAgg (deltaN = delta / newN; m2 += delta * deltaN * n1 * n2), narrowing remaining CPU/GPU merge-arithmetic differences.
Python test split: test_std_variance now uses DoubleGen(min_exp=-200, max_exp=200, no_nans=True) for strict equality, while test_std_variance_extreme_floating_point uses the full-exponent generator with a canonicalize hook that maps NaN/+Inf overflow sentinels to a common placeholder.
assert_gpu_and_cpu_are_equal_sql plumbing: The result_canonicalize_func_before_compare parameter is now threaded through, keeping the canonicalization local to the extreme-input test.

Confidence Score: 5/5

Safe to merge — the host-side merge fix is minimal, well-scoped, and the test suite was run end-to-end with 75 passing tests.

The Scala change is a targeted fix adding an identity branch that short-circuits the first non-empty partial, eliminating the Inf×0→NaN path. The code path is only reached in the host-side global reduction merge, not in GPU kernels or per-row expressions.

No files require special attention.

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/aggregateFunctions.scala	Adds identity branch for first non-empty partial in host-side M2 merge, and reorders merge arithmetic to match Spark CPU CentralMomentAgg.
integration_tests/src/main/python/hash_aggregate_test.py	Splits std/variance coverage into common-finite (strict) and extreme-FP (loose sentinel) suites.
integration_tests/src/main/python/asserts.py	Threads result_canonicalize_func_before_compare through assert_gpu_and_cpu_are_equal_sql.
docs/compatibility.md	Adds documentation for stddev/variance overflow behavior on extreme finite inputs.

_{Reviews (7): Last reviewed commit: "Drop -Infinity sentinel and revert unrel..." | Re-trigger Greptile}

Copilot

Pull request overview

This PR addresses intermittent stddev/variance test failures caused by floating-point overflow differences (NaN vs ±Inf) by aligning the host-side M2 merge logic with Spark’s merge order and refining integration test coverage/compare semantics.

Changes:

Update CudfMergeM2 host-side reductionAggregate merge to follow Spark’s central-moment merge order (including an identity path for the first non-empty partial).
Split std/variance integration tests into (a) common finite floating-point coverage with strict comparisons and (b) extreme floating-point coverage with a scoped NaN/Inf overflow equivalence allowance.
Add an assertion-option plumb-through in the integration test framework to treat NaN vs ±Inf as equivalent for documented overflow cases.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/aggregate/aggregateFunctions.scala	Adjust host-side merge of partial (n, mean, m2) to mirror Spark merge order and reduce order-dependent overflow artifacts.
integration_tests/src/main/python/hash_aggregate_test.py	Split std/variance coverage into common vs extreme floating-point inputs; extreme path enables NaN/Inf-overflow-equivalence comparisons.
integration_tests/src/main/python/asserts.py	Plumb a new comparison option through equality helpers to allow NaN vs ±Inf equivalence for documented overflow tests.

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-05-09T04:21:11Z

build

Signed-off-by: Allen Xu <allxu@nvidia.com>

greptile-apps · 2026-05-09T08:54:36Z

        cpu_items = list(cpu.items()).sort(key=_RowCmp)
        gpu_items = list(gpu.items()).sort(key=_RowCmp)
-        _assert_equal(cpu_items, gpu_items, float_check, path + ["map"])
+        _assert_equal(cpu_items, gpu_items, float_check, path + ["map"],
+            nan_inf_equivalent_for_overflow)


The sorted(...) fix mentioned in the PR description was not applied. list.sort() is an in-place method that returns None, so cpu_items and gpu_items are both None. When _assert_equal(None, None, ...) is called, it falls through to the cpu == None branch (elif (cpu == None): assert cpu == gpu) which always passes. Any dict comparison therefore silently succeeds regardless of content — a false green that masks real CPU/GPU divergence in map-type columns.

Suggested change

cpu_items = list(cpu.items()).sort(key=_RowCmp)

gpu_items = list(gpu.items()).sort(key=_RowCmp)

_assert_equal(cpu_items, gpu_items, float_check, path + ["map"])

_assert_equal(cpu_items, gpu_items, float_check, path + ["map"],

nan_inf_equivalent_for_overflow)

cpu_items = sorted(list(cpu.items()), key=_RowCmp)

gpu_items = sorted(list(gpu.items()), key=_RowCmp)

_assert_equal(cpu_items, gpu_items, float_check, path + ["map"],

nan_inf_equivalent_for_overflow)

Addressed in e964fed: this PR no longer changes the dict/map comparison branch. The list.sort() behavior is a pre-existing issue, but applying the suggested sorted(...) cleanup in this PR exposed unrelated existing map/JSON/ORC differences in Blossom #13006, so that cleanup should be handled separately instead of being bundled with the std/variance fix.

Local validation after removing this PR's dict-branch change passed:
hash_aggregate_test.py::test_std_variance_extreme_floating_point plus json_matrix_test.py::test_from_json_map_string_string[int_formatted.json]: 11 passed, 3 warnings in 116.69s.

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-05-09T09:17:09Z

build

wjxiz1992 · 2026-05-11T03:08:11Z

build

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-05-11T08:37:00Z

build

wjxiz1992 · 2026-05-12T04:41:29Z

build

wjxiz1992 · 2026-05-12T06:55:26Z

build

Signed-off-by: Allen Xu <allxu@nvidia.com>

thirtiseven · 2026-05-13T09:33:49Z

+}
+
+def _canonicalize_std_variance_overflow_value(value):
+    if isinstance(value, float) and (math.isnan(value) or math.isinf(value)):


nit: should we avoid treating -Infinity as an acceptable overflow sentinel here? For stddev/variance over finite inputs, NaN vs +Infinity can be explained by overflow and merge order, but -Infinity would likely indicate a different issue, such as a negative M2/sign bug. It may be safer to canonicalize only NaN and +Infinity, and update the compatibility note accordingly.

Addressed in f402c5f. The canonicalizer now treats only NaN and +Infinity as accepted overflow sentinels; -Infinity is left as-is so a negative-M2/sign bug would surface as a real diff. The compatibility note in docs/compatibility.md and the inline test comment are updated to match.

Re-ran hash_aggregate_test.py::test_std_variance and ::test_std_variance_extreme_floating_point locally with Python 3.10 + Spark 3.3.0: 265 passed, 39510 deselected, 8 warnings in 1425.14s.

thirtiseven · 2026-05-13T09:34:22Z

    return (from_cpu, from_gpu)

-def assert_gpu_and_cpu_are_equal_collect(func, conf={}, is_cpu_first=True, result_canonicalize_func_before_compare=None):
+def assert_gpu_and_cpu_are_equal_collect(func, conf={}, is_cpu_first=True,


nit: unnecessary change?

Reverted in f402c5f. The assert_gpu_and_cpu_are_equal_collect signature is back to a single line; this PR no longer changes that line.

thirtiseven · 2026-05-13T09:34:36Z

    """
-    _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first, result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)
+    _assert_gpu_and_cpu_are_equal(func, 'COLLECT', conf=conf, is_cpu_first=is_cpu_first,
+        result_canonicalize_func_before_compare=result_canonicalize_func_before_compare)


nit: unnecessary change?

Reverted in f402c5f. The _assert_gpu_and_cpu_are_equal call is back to a single line; this PR no longer changes that line.

Address PR review nits on NVIDIA#14762: - hash_aggregate_test.py: the std/variance overflow canonicalizer no longer accepts -Infinity. Over finite inputs, NaN and +Infinity can be explained by overflow plus partial-merge order; -Infinity would indicate a different issue (e.g. negative M2 / sign bug) and must not be hidden by sentinel canonicalization. - docs/compatibility.md: align the std/variance overflow note with the test scope (NaN and +Infinity only). - asserts.py: revert two unnecessary line wraps on assert_gpu_and_cpu_are_equal_collect signature and its call to _assert_gpu_and_cpu_are_equal; the new parameter plumbing through assert_gpu_and_cpu_are_equal_sql is kept. Validation (Python 3.10, Spark 3.3.0): TESTS='hash_aggregate_test.py::test_std_variance hash_aggregate_test.py::test_std_variance_extreme_floating_point': 265 passed, 39510 deselected, 8 warnings in 1425.14s. Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-05-14T04:06:54Z

build

thirtiseven

LGTM

Fix std variance floating overflow coverage

cdd2c3a

Signed-off-by: Allen Xu <allxu@nvidia.com>

Copilot AI review requested due to automatic review settings May 9, 2026 03:22

Copilot started reviewing on behalf of wjxiz1992 May 9, 2026 03:23 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread integration_tests/src/main/python/asserts.py

Comment thread integration_tests/src/main/python/hash_aggregate_test.py Outdated

Fix std variance review comments

b9b2a69

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 requested a review from revans2 May 9, 2026 04:10

Keep std variance assertion change scoped

b86f3c8

Signed-off-by: Allen Xu <allxu@nvidia.com>

greptile-apps Bot reviewed May 9, 2026

View reviewed changes

Keep map assertion behavior unchanged

e964fed

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 mentioned this pull request May 9, 2026

[BUG] test_std_variance fails with GPU nan vs CPU inf on Double data with small batchSizeBytes intermittently #14681

Closed

docs: document std variance floating overflow incompatibility

55d9918

Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 requested a review from thirtiseven May 13, 2026 06:36

thirtiseven reviewed May 13, 2026

View reviewed changes

Comment thread integration_tests/src/main/python/asserts.py Outdated

Address std variance overflow review

5f18b08

Signed-off-by: Allen Xu <allxu@nvidia.com>

thirtiseven reviewed May 13, 2026

View reviewed changes

wjxiz1992 requested a review from thirtiseven May 14, 2026 09:34

thirtiseven approved these changes May 14, 2026

View reviewed changes

wjxiz1992 merged commit ca3c8f1 into NVIDIA:main May 15, 2026
49 checks passed

Conversation

wjxiz1992 commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

wjxiz1992 commented May 9, 2026

Uh oh!

greptile-apps Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 May 9, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 commented May 9, 2026

Uh oh!

wjxiz1992 commented May 11, 2026

Uh oh!

wjxiz1992 commented May 11, 2026

Uh oh!

wjxiz1992 commented May 12, 2026

Uh oh!

wjxiz1992 commented May 12, 2026

Uh oh!

Uh oh!

thirtiseven May 13, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

thirtiseven May 13, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

thirtiseven May 13, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 May 14, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 commented May 14, 2026

Uh oh!

thirtiseven left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wjxiz1992 commented May 9, 2026 •

edited

Loading

greptile-apps Bot commented May 9, 2026 •

edited

Loading