Skip to content

Fix assertion failures in assert_tpch_result_equal due to float sort ambiguity#22378

Open
Matt711 wants to merge 8 commits into
rapidsai:mainfrom
Matt711:bug/pdsds/q64-validation
Open

Fix assertion failures in assert_tpch_result_equal due to float sort ambiguity#22378
Matt711 wants to merge 8 commits into
rapidsai:mainfrom
Matt711:bug/pdsds/q64-validation

Conversation

@Matt711
Copy link
Copy Markdown
Contributor

@Matt711 Matt711 commented May 5, 2026

Description

When comparing results, sorting by non-float columns alone can leave rows with equal non-float keys in an arbitrary order, causing assert_frame_equal to fail on valid results. This PR retries the comparison using float columns as a secondary sort key before raising a validation error.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Matt711 Matt711 requested a review from a team as a code owner May 5, 2026 13:43
@Matt711 Matt711 requested a review from vyasr May 5, 2026 13:43
@Matt711 Matt711 added bug Something isn't working non-breaking Non-breaking change labels May 5, 2026
@github-actions github-actions Bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels May 5, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python May 5, 2026
@TomAugspurger
Copy link
Copy Markdown
Contributor

TomAugspurger commented May 5, 2026

So my original thinking here was that sorting on the float columns should be unnecessary. Suppose you have a sequence of (key, value) pairs, sorted by key:

Key, Value
0, x
0, x-e
0, x+e

Assuming e (or more precisely, 2e) is negligible, then any permutation of those rows should be considered equal to any other. And so my hope was that we could sort by Key and then validate with

  1. assert_frame_equal on Key without any abs_tol
  2. assert_frame_equal on all the columns with abs_tol

we'd correctly implement that logic. It would pass as long as abs_tol >= 2e, and fail otherwise (on the second stage).

This reverts commit c270bb1.
Always sort by non-float columns, but do it after sorting by float
columns.
@TomAugspurger
Copy link
Copy Markdown
Contributor

I retract all my concerns about this change :) This was a bug in how we handled the float columns.

At the time of the assertion error, we've already validated that the sort_by columns match. Now we're just dealing with non-sort_by columns.

This stage works by

  1. Sorting on all non-float columns.
  2. Doing an assert_frame_equal on all columns with a tolerance.

So IIUC, the issue is when you have a pair of tables that are equal on all non-float columns, but for some
reason the float columns are in a different order (but have equal values). For example, table 1:

A B
1 1.0
1 2.0
1 3.0

and table 2:

A B
1 2.0
1 3.0
1 1.0

We want these two to compare equal, but they currently don't because the float columns.

The simplest fix seems to be to sort by all columns, but in a specific order: non-float first, then float. I've done that in 65827dc, along with a test that was previously failing.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • Tests
    • Improved assertion logic for comparisons: results are now deterministically sorted with floating-point columns placed last to avoid nondeterministic ordering and reduce floating-point fuzziness.
    • Added coverage for grouped floating-point comparisons, ensuring tolerance-aware equality within groups and clear failure messages when differences exceed thresholds.

Walkthrough

The assert_tpch_result_equal function now sorts columns by type (non-float first, float last) via sort_for_comparison and applies this grouped sort across all comparison branches (sort_by non-ties, sort_by ties, and no sort_by). A new parametrized test verifies correct behavior for permuted float rows within non-float groups and rejects out-of-tolerance differences.

Changes

Grouped float column sorting for result comparison

Layer / File(s) Summary
Sort-for-comparison helper definition
python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py
Splits columns into non-float and float groups and defines sort_for_comparison helper that sorts both groups in sequence, with float columns placed last.
Apply grouped sorting to sort_by comparison paths
python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py
In the sort_by-driven .head(n) logic, applies sort_for_comparison to both non-ties and ties frame pairs before equality checks instead of sorting only by non_float_columns.
Apply grouped sorting to unsorted frames
python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py
When sort_by is not provided, applies grouped sorting (non-float then float) to both frames before asserting equality, replacing direct frame comparison to ignore nondeterministic row ordering.
Test grouped float sorting behavior
python/cudf_polars/tests/testing/test_asserts.py
Parametrized test verifies assert_tpch_result_equal succeeds when float rows are permuted within non-float groups under tolerance, and fails with ValidationError when differences exceed tolerance.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main fix: addressing assertion failures in assert_tpch_result_equal caused by float sort ambiguity, which directly matches the core changes in both modified files.
Description check ✅ Passed The description is directly related to the changeset, explaining the problem (sorting by non-float columns alone leaves rows in arbitrary order), the solution (use float columns as secondary sort key), and references the closed issue.
Linked Issues check ✅ Passed The PR fully addresses issue #22129 by implementing deterministic sorting with non-float columns first and float columns second to prevent validation failures when row ordering differs for rows with equal non-float key values.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the validation issue: modifying assert_tpch_result_equal's sorting logic and adding a comprehensive test case for the grouped float sort behavior.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py (1)

375-394: 💤 Low value

Consider extracting common sorting logic to reduce duplication.

The column classification (non_float_columns, float_columns, grouped_sort_columns) and sorting logic is duplicated between the if sort_by: branch (lines 263-278) and this else branch. Extracting the helper function and column lists before the if sort_by: block would reduce maintenance burden.

Proposed refactor

Move the column classification and helper before the if sort_by: block:

     left = left.with_columns(*float_casts)
 
+    non_float_columns = [
+        col
+        for col in left.columns
+        if left.schema[col] not in (pl.Float32, pl.Float64)
+    ]
+    float_columns = [
+        col for col in left.columns if left.schema[col] in (pl.Float32, pl.Float64)
+    ]
+    grouped_sort_columns = [*non_float_columns, *float_columns]
+
+    def sort_for_comparison(df: pl.DataFrame) -> pl.DataFrame:
+        return (
+            df.sort(by=grouped_sort_columns, nulls_last=nulls_last)
+            if grouped_sort_columns
+            else df
+        )
+
     if sort_by:
         by, descending = list(zip(*sort_by, strict=True))
         # ... sortedness checks ...
-        non_float_columns = [
-            col
-            for col in left.columns
-            if left.schema[col] not in (pl.Float32, pl.Float64)
-        ]
-        float_columns = [...]
-        grouped_sort_columns = [...]
-        def sort_for_comparison(df: pl.DataFrame) -> pl.DataFrame:
-            ...
         left_sorted = sort_for_comparison(left)
         right_sorted = sort_for_comparison(right)
         # ...
     else:
-        non_float_columns = [...]
-        float_columns = [...]
-        grouped_sort_columns = [...]
-        left_sorted = (
-            left.sort(by=grouped_sort_columns, nulls_last=nulls_last)
-            if grouped_sort_columns
-            else left
-        )
-        right_sorted = (...)
+        left_sorted = sort_for_comparison(left)
+        right_sorted = sort_for_comparison(right)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py` around
lines 375 - 394, Extract the duplicated column-classification and sorting logic
into a small helper and run it once before the if sort_by: branch: compute
non_float_columns (columns where schema not in pl.Float32/Float64),
float_columns (columns where schema in pl.Float32/Float64), and
grouped_sort_columns (concatenate the two), then create a helper function (e.g.,
sort_by_grouped_columns(left, right, grouped_sort_columns, nulls_last)) that
returns left_sorted and right_sorted using left.sort(...) and right.sort(...)
when grouped_sort_columns is non-empty; call this helper from both the existing
if sort_by: branch and the else branch so the column lists and sorting code are
not duplicated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py`:
- Around line 375-394: Extract the duplicated column-classification and sorting
logic into a small helper and run it once before the if sort_by: branch: compute
non_float_columns (columns where schema not in pl.Float32/Float64),
float_columns (columns where schema in pl.Float32/Float64), and
grouped_sort_columns (concatenate the two), then create a helper function (e.g.,
sort_by_grouped_columns(left, right, grouped_sort_columns, nulls_last)) that
returns left_sorted and right_sorted using left.sort(...) and right.sort(...)
when grouped_sort_columns is non-empty; call this helper from both the existing
if sort_by: branch and the else branch so the column lists and sorting code are
not duplicated.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 19b7a51b-8bbf-4c4b-ae8c-c26003a6422e

📥 Commits

Reviewing files that changed from the base of the PR and between d09d10d and 65827dc.

📒 Files selected for processing (2)
  • python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py
  • python/cudf_polars/tests/testing/test_asserts.py

quasiben added a commit to quasiben/cudf that referenced this pull request May 12, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py`:
- Around line 218-233: The helper sort_for_comparison currently closes over the
full original grouped_sort_columns which lets payload columns influence
ordering; change it to derive the actual sort keys from the frame being compared
by inspecting df.schema/df.columns (e.g. build a local_grouped list containing
only the grouped_sort_columns present in df), then call
df.sort(by=local_grouped, nulls_last=nulls_last) (or return df when
local_grouped is empty) so tie partitions are aligned using only the shared
sort_by keys (this makes comparisons like
sort_for_comparison(result_ties.select(by)) vs
sort_for_comparison(expected_ties.select(by)) stable).
- Around line 369-376: The current normalization via sort_for_comparison masks
row-order differences even when check_row_order is True; change the logic so
canonicalization only happens when not check_row_order: keep original left/right
when check_row_order is True and only call sort_for_comparison for the not
check_row_order path, then call polars.testing.assert_frame_equal on the
appropriately chosen left/right and preserve the check_row_order flag (or
explicitly set check_row_order=False only in the canonicalized branch with a
comment explaining order is intentionally ignored). Reference symbols:
sort_for_comparison, left_sorted, right_sorted,
polars.testing.assert_frame_equal, and the check_row_order parameter.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c961925a-22b8-4cc7-8069-a9d604ec2507

📥 Commits

Reviewing files that changed from the base of the PR and between 9c66981 and 75a5e60.

📒 Files selected for processing (1)
  • python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py

Comment on lines +218 to +233
def sort_for_comparison(df: pl.DataFrame) -> pl.DataFrame:
# We know that each dataframe is sorted on `sort_by` according to itself.
# Now we have some freedom to reorder the rows. We'll use this freedom to avoid
# any kind of fuzziness from sorting on floating-point columns.
#
# As long as we sort by the non-float columns *first*, we'll avoid any
# false positives / false negatives from comparing two tables that have the
# same values but happen to be in a different order. Sorting by floating-point
# columns *last* ensures that records that are close to each other appear in
# (roughly) the same order, such that polar's approximate equality checks
# will allow them to be considered equal (or not, if the aren't actually close).
return (
df.sort(by=grouped_sort_columns, nulls_last=nulls_last)
if grouped_sort_columns
else df
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Derive the grouped sort keys from the frame being compared.

sort_for_comparison() currently closes over the full original column set, so the ties branch has to sort on payload columns before dropping to by. That means two tie partitions with the same by values can still be aligned differently by non-sort_by columns and fail the approximate comparison. Making this helper inspect df.schema would let the ties path compare sort_for_comparison(result_ties.select(by)) against sort_for_comparison(expected_ties.select(by)) instead of letting extra columns influence the order. As per coding guidelines, python/**/*.{py,pyx}: Logic errors producing wrong results - Verify algorithm correctness and data integrity in operations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py` around
lines 218 - 233, The helper sort_for_comparison currently closes over the full
original grouped_sort_columns which lets payload columns influence ordering;
change it to derive the actual sort keys from the frame being compared by
inspecting df.schema/df.columns (e.g. build a local_grouped list containing only
the grouped_sort_columns present in df), then call df.sort(by=local_grouped,
nulls_last=nulls_last) (or return df when local_grouped is empty) so tie
partitions are aligned using only the shared sort_by keys (this makes
comparisons like sort_for_comparison(result_ties.select(by)) vs
sort_for_comparison(expected_ties.select(by)) stable).

Comment thread python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cudf-polars Issues specific to cudf-polars non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

[BUG] Validate TPC-DS Q64

3 participants