Fix assertion failures in `assert_tpch_result_equal` due to float sort ambiguity by Matt711 · Pull Request #22378 · rapidsai/cudf

Matt711 · 2026-05-05T13:43:29Z

Description

When comparing results, sorting by non-float columns alone can leave rows with equal non-float keys in an arbitrary order, causing assert_frame_equal to fail on valid results. This PR retries the comparison using float columns as a secondary sort key before raising a validation error.

Closes [BUG] Validate TPC-DS Q64 #22129

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…ambiguity

TomAugspurger · 2026-05-05T14:23:55Z

So my original thinking here was that sorting on the float columns should be unnecessary. Suppose you have a sequence of (key, value) pairs, sorted by key:

Key, Value
0, x
0, x-e
0, x+e

Assuming e (or more precisely, 2e) is negligible, then any permutation of those rows should be considered equal to any other. And so my hope was that we could sort by Key and then validate with

assert_frame_equal on Key without any abs_tol
assert_frame_equal on all the columns with abs_tol

we'd correctly implement that logic. It would pass as long as abs_tol >= 2e, and fail otherwise (on the second stage).

This reverts commit c270bb1.

…at sort ambiguity" This reverts commit dc28164.

Always sort by non-float columns, but do it after sorting by float columns.

TomAugspurger · 2026-05-12T12:59:53Z

I retract all my concerns about this change :) This was a bug in how we handled the float columns.

At the time of the assertion error, we've already validated that the sort_by columns match. Now we're just dealing with non-sort_by columns.

This stage works by

Sorting on all non-float columns.
Doing an assert_frame_equal on all columns with a tolerance.

So IIUC, the issue is when you have a pair of tables that are equal on all non-float columns, but for some
reason the float columns are in a different order (but have equal values). For example, table 1:

A B
1 1.0
1 2.0
1 3.0

and table 2:

A B
1 2.0
1 3.0
1 1.0

We want these two to compare equal, but they currently don't because the float columns.

The simplest fix seems to be to sort by all columns, but in a specific order: non-float first, then float. I've done that in 65827dc, along with a test that was previously failing.

coderabbitai · 2026-05-12T13:01:58Z

📝 Walkthrough

Summary by CodeRabbit

Tests
- Improved assertion logic for comparisons: results are now deterministically sorted with floating-point columns placed last to avoid nondeterministic ordering and reduce floating-point fuzziness.
- Added coverage for grouped floating-point comparisons, ensuring tolerance-aware equality within groups and clear failure messages when differences exceed thresholds.

Walkthrough

The assert_tpch_result_equal function now sorts columns by type (non-float first, float last) via sort_for_comparison and applies this grouped sort across all comparison branches (sort_by non-ties, sort_by ties, and no sort_by). A new parametrized test verifies correct behavior for permuted float rows within non-float groups and rejects out-of-tolerance differences.

Changes

Grouped float column sorting for result comparison

Layer / File(s)	Summary
Sort-for-comparison helper definition `python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py`	Splits columns into non-float and float groups and defines `sort_for_comparison` helper that sorts both groups in sequence, with float columns placed last.
Apply grouped sorting to sort_by comparison paths `python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py`	In the sort_by-driven `.head(n)` logic, applies `sort_for_comparison` to both non-ties and ties frame pairs before equality checks instead of sorting only by `non_float_columns`.
Apply grouped sorting to unsorted frames `python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py`	When `sort_by` is not provided, applies grouped sorting (non-float then float) to both frames before asserting equality, replacing direct frame comparison to ignore nondeterministic row ordering.
Test grouped float sorting behavior `python/cudf_polars/tests/testing/test_asserts.py`	Parametrized test verifies `assert_tpch_result_equal` succeeds when float rows are permuted within non-float groups under tolerance, and fails with ValidationError when differences exceed tolerance.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main fix: addressing assertion failures in assert_tpch_result_equal caused by float sort ambiguity, which directly matches the core changes in both modified files.
Description check	✅ Passed	The description is directly related to the changeset, explaining the problem (sorting by non-float columns alone leaves rows in arbitrary order), the solution (use float columns as secondary sort key), and references the closed issue.
Linked Issues check	✅ Passed	The PR fully addresses issue `#22129` by implementing deterministic sorting with non-float columns first and float columns second to prevent validation failures when row ordering differs for rows with equal non-float key values.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the validation issue: modifying assert_tpch_result_equal's sorting logic and adding a comprehensive test case for the grouped float sort behavior.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py (1)

375-394: 💤 Low value

Consider extracting common sorting logic to reduce duplication.

The column classification (non_float_columns, float_columns, grouped_sort_columns) and sorting logic is duplicated between the if sort_by: branch (lines 263-278) and this else branch. Extracting the helper function and column lists before the if sort_by: block would reduce maintenance burden.

Proposed refactor

Move the column classification and helper before the if sort_by: block:

     left = left.with_columns(*float_casts)
 
+    non_float_columns = [
+        col
+        for col in left.columns
+        if left.schema[col] not in (pl.Float32, pl.Float64)
+    ]
+    float_columns = [
+        col for col in left.columns if left.schema[col] in (pl.Float32, pl.Float64)
+    ]
+    grouped_sort_columns = [*non_float_columns, *float_columns]
+
+    def sort_for_comparison(df: pl.DataFrame) -> pl.DataFrame:
+        return (
+            df.sort(by=grouped_sort_columns, nulls_last=nulls_last)
+            if grouped_sort_columns
+            else df
+        )
+
     if sort_by:
         by, descending = list(zip(*sort_by, strict=True))
         # ... sortedness checks ...
-        non_float_columns = [
-            col
-            for col in left.columns
-            if left.schema[col] not in (pl.Float32, pl.Float64)
-        ]
-        float_columns = [...]
-        grouped_sort_columns = [...]
-        def sort_for_comparison(df: pl.DataFrame) -> pl.DataFrame:
-            ...
         left_sorted = sort_for_comparison(left)
         right_sorted = sort_for_comparison(right)
         # ...
     else:
-        non_float_columns = [...]
-        float_columns = [...]
-        grouped_sort_columns = [...]
-        left_sorted = (
-            left.sort(by=grouped_sort_columns, nulls_last=nulls_last)
-            if grouped_sort_columns
-            else left
-        )
-        right_sorted = (...)
+        left_sorted = sort_for_comparison(left)
+        right_sorted = sort_for_comparison(right)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py` around
lines 375 - 394, Extract the duplicated column-classification and sorting logic
into a small helper and run it once before the if sort_by: branch: compute
non_float_columns (columns where schema not in pl.Float32/Float64),
float_columns (columns where schema in pl.Float32/Float64), and
grouped_sort_columns (concatenate the two), then create a helper function (e.g.,
sort_by_grouped_columns(left, right, grouped_sort_columns, nulls_last)) that
returns left_sorted and right_sorted using left.sort(...) and right.sort(...)
when grouped_sort_columns is non-empty; call this helper from both the existing
if sort_by: branch and the else branch so the column lists and sorting code are
not duplicated.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py`:
- Around line 375-394: Extract the duplicated column-classification and sorting
logic into a small helper and run it once before the if sort_by: branch: compute
non_float_columns (columns where schema not in pl.Float32/Float64),
float_columns (columns where schema in pl.Float32/Float64), and
grouped_sort_columns (concatenate the two), then create a helper function (e.g.,
sort_by_grouped_columns(left, right, grouped_sort_columns, nulls_last)) that
returns left_sorted and right_sorted using left.sort(...) and right.sort(...)
when grouped_sort_columns is non-empty; call this helper from both the existing
if sort_by: branch and the else branch so the column lists and sorting code are
not duplicated.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 19b7a51b-8bbf-4c4b-ae8c-c26003a6422e

📥 Commits

Reviewing files that changed from the base of the PR and between d09d10d and 65827dc.

📒 Files selected for processing (2)

python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py
python/cudf_polars/tests/testing/test_asserts.py

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py`:
- Around line 218-233: The helper sort_for_comparison currently closes over the
full original grouped_sort_columns which lets payload columns influence
ordering; change it to derive the actual sort keys from the frame being compared
by inspecting df.schema/df.columns (e.g. build a local_grouped list containing
only the grouped_sort_columns present in df), then call
df.sort(by=local_grouped, nulls_last=nulls_last) (or return df when
local_grouped is empty) so tie partitions are aligned using only the shared
sort_by keys (this makes comparisons like
sort_for_comparison(result_ties.select(by)) vs
sort_for_comparison(expected_ties.select(by)) stable).
- Around line 369-376: The current normalization via sort_for_comparison masks
row-order differences even when check_row_order is True; change the logic so
canonicalization only happens when not check_row_order: keep original left/right
when check_row_order is True and only call sort_for_comparison for the not
check_row_order path, then call polars.testing.assert_frame_equal on the
appropriately chosen left/right and preserve the check_row_order flag (or
explicitly set check_row_order=False only in the canonicalized branch with a
comment explaining order is intentionally ignored). Reference symbols:
sort_for_comparison, left_sorted, right_sorted,
polars.testing.assert_frame_equal, and the check_row_order parameter.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c961925a-22b8-4cc7-8069-a9d604ec2507

📥 Commits

Reviewing files that changed from the base of the PR and between 9c66981 and 75a5e60.

📒 Files selected for processing (1)

python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py

coderabbitai · 2026-05-12T14:13:23Z

+    def sort_for_comparison(df: pl.DataFrame) -> pl.DataFrame:
+        # We know that each dataframe is sorted on `sort_by` according to itself.
+        # Now we have some freedom to reorder the rows. We'll use this freedom to avoid
+        # any kind of fuzziness from sorting on floating-point columns.
+        #
+        # As long as we sort by the non-float columns *first*, we'll avoid any
+        # false positives / false negatives from comparing two tables that have the
+        # same values but happen to be in a different order. Sorting by floating-point
+        # columns *last* ensures that records that are close to each other appear in
+        # (roughly) the same order, such that polar's approximate equality checks
+        # will allow them to be considered equal (or not, if the aren't actually close).
+        return (
+            df.sort(by=grouped_sort_columns, nulls_last=nulls_last)
+            if grouped_sort_columns
+            else df
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Derive the grouped sort keys from the frame being compared.

sort_for_comparison() currently closes over the full original column set, so the ties branch has to sort on payload columns before dropping to by. That means two tie partitions with the same by values can still be aligned differently by non-sort_by columns and fail the approximate comparison. Making this helper inspect df.schema would let the ties path compare sort_for_comparison(result_ties.select(by)) against sort_for_comparison(expected_ties.select(by)) instead of letting extra columns influence the order. As per coding guidelines, python/**/*.{py,pyx}: Logic errors producing wrong results - Verify algorithm correctness and data integrity in operations.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py` around lines 218 - 233, The helper sort_for_comparison currently closes over the full original grouped_sort_columns which lets payload columns influence ordering; change it to derive the actual sort keys from the frame being compared by inspecting df.schema/df.columns (e.g. build a local_grouped list containing only the grouped_sort_columns present in df), then call df.sort(by=local_grouped, nulls_last=nulls_last) (or return df when local_grouped is empty) so tie partitions are aligned using only the shared sort_by keys (this makes comparisons like sort_for_comparison(result_ties.select(by)) vs sort_for_comparison(expected_ties.select(by)) stable).

Matt711 added 2 commits May 5, 2026 06:38

Fix assertion failures in assert_tpch_result_equal due to float sort …

dc28164

…ambiguity

remove comment

c270bb1

Matt711 requested a review from a team as a code owner May 5, 2026 13:43

Matt711 requested a review from vyasr May 5, 2026 13:43

Matt711 added bug Something isn't working non-breaking Non-breaking change labels May 5, 2026

github-actions Bot assigned Matt711 May 5, 2026

github-actions Bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels May 5, 2026

github-project-automation Bot added this to cuDF Python May 5, 2026

GPUtester moved this to In Progress in cuDF Python May 5, 2026

quasiben mentioned this pull request May 11, 2026

[DO NOT MERGE] PDS DS ALL #22469

Open

TomAugspurger added 3 commits May 12, 2026 05:51

Revert "remove comment"

c671992

This reverts commit c270bb1.

Revert "Fix assertion failures in assert_tpch_result_equal due to flo…

3c7fda9

…at sort ambiguity" This reverts commit dc28164.

Update floating-point handling

65827dc

Always sort by non-float columns, but do it after sorting by float columns.

Merge branch 'main' into bug/pdsds/q64-validation

9c66981

TomAugspurger approved these changes May 12, 2026

View reviewed changes

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Deduplicate sort handling

75a5e60

quasiben added a commit to quasiben/cudf that referenced this pull request May 12, 2026

Merge PR rapidsai#22378 updates

4f69818

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

Merge branch 'main' into bug/pdsds/q64-validation

ea7c9ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix assertion failures in `assert_tpch_result_equal` due to float sort ambiguity#22378

Fix assertion failures in `assert_tpch_result_equal` due to float sort ambiguity#22378
Matt711 wants to merge 8 commits into
rapidsai:mainfrom
Matt711:bug/pdsds/q64-validation

Matt711 commented May 5, 2026 •

edited

Loading

Uh oh!

TomAugspurger commented May 5, 2026 •

edited

Loading

Uh oh!

TomAugspurger commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 •

edited

Loading

Summary by CodeRabbit

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Matt711 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

TomAugspurger commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented May 12, 2026

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Matt711 commented May 5, 2026 •

edited

Loading

TomAugspurger commented May 5, 2026 •

edited

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading