Fix assertion failures in assert_tpch_result_equal due to float sort ambiguity#22378
Fix assertion failures in assert_tpch_result_equal due to float sort ambiguity#22378Matt711 wants to merge 8 commits into
assert_tpch_result_equal due to float sort ambiguity#22378Conversation
|
So my original thinking here was that sorting on the float columns should be unnecessary. Suppose you have a sequence of Assuming
we'd correctly implement that logic. It would pass as long as |
|
I retract all my concerns about this change :) This was a bug in how we handled the float columns. At the time of the assertion error, we've already validated that the This stage works by
So IIUC, the issue is when you have a pair of tables that are equal on all non-float columns, but for some and table 2: We want these two to compare equal, but they currently don't because the float columns. The simplest fix seems to be to sort by all columns, but in a specific order: non-float first, then float. I've done that in 65827dc, along with a test that was previously failing. |
📝 WalkthroughSummary by CodeRabbit
WalkthroughThe ChangesGrouped float column sorting for result comparison
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py (1)
375-394: 💤 Low valueConsider extracting common sorting logic to reduce duplication.
The column classification (
non_float_columns,float_columns,grouped_sort_columns) and sorting logic is duplicated between theif sort_by:branch (lines 263-278) and thiselsebranch. Extracting the helper function and column lists before theif sort_by:block would reduce maintenance burden.Proposed refactor
Move the column classification and helper before the
if sort_by:block:left = left.with_columns(*float_casts) + non_float_columns = [ + col + for col in left.columns + if left.schema[col] not in (pl.Float32, pl.Float64) + ] + float_columns = [ + col for col in left.columns if left.schema[col] in (pl.Float32, pl.Float64) + ] + grouped_sort_columns = [*non_float_columns, *float_columns] + + def sort_for_comparison(df: pl.DataFrame) -> pl.DataFrame: + return ( + df.sort(by=grouped_sort_columns, nulls_last=nulls_last) + if grouped_sort_columns + else df + ) + if sort_by: by, descending = list(zip(*sort_by, strict=True)) # ... sortedness checks ... - non_float_columns = [ - col - for col in left.columns - if left.schema[col] not in (pl.Float32, pl.Float64) - ] - float_columns = [...] - grouped_sort_columns = [...] - def sort_for_comparison(df: pl.DataFrame) -> pl.DataFrame: - ... left_sorted = sort_for_comparison(left) right_sorted = sort_for_comparison(right) # ... else: - non_float_columns = [...] - float_columns = [...] - grouped_sort_columns = [...] - left_sorted = ( - left.sort(by=grouped_sort_columns, nulls_last=nulls_last) - if grouped_sort_columns - else left - ) - right_sorted = (...) + left_sorted = sort_for_comparison(left) + right_sorted = sort_for_comparison(right)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py` around lines 375 - 394, Extract the duplicated column-classification and sorting logic into a small helper and run it once before the if sort_by: branch: compute non_float_columns (columns where schema not in pl.Float32/Float64), float_columns (columns where schema in pl.Float32/Float64), and grouped_sort_columns (concatenate the two), then create a helper function (e.g., sort_by_grouped_columns(left, right, grouped_sort_columns, nulls_last)) that returns left_sorted and right_sorted using left.sort(...) and right.sort(...) when grouped_sort_columns is non-empty; call this helper from both the existing if sort_by: branch and the else branch so the column lists and sorting code are not duplicated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py`:
- Around line 375-394: Extract the duplicated column-classification and sorting
logic into a small helper and run it once before the if sort_by: branch: compute
non_float_columns (columns where schema not in pl.Float32/Float64),
float_columns (columns where schema in pl.Float32/Float64), and
grouped_sort_columns (concatenate the two), then create a helper function (e.g.,
sort_by_grouped_columns(left, right, grouped_sort_columns, nulls_last)) that
returns left_sorted and right_sorted using left.sort(...) and right.sort(...)
when grouped_sort_columns is non-empty; call this helper from both the existing
if sort_by: branch and the else branch so the column lists and sorting code are
not duplicated.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 19b7a51b-8bbf-4c4b-ae8c-c26003a6422e
📒 Files selected for processing (2)
python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.pypython/cudf_polars/tests/testing/test_asserts.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py`:
- Around line 218-233: The helper sort_for_comparison currently closes over the
full original grouped_sort_columns which lets payload columns influence
ordering; change it to derive the actual sort keys from the frame being compared
by inspecting df.schema/df.columns (e.g. build a local_grouped list containing
only the grouped_sort_columns present in df), then call
df.sort(by=local_grouped, nulls_last=nulls_last) (or return df when
local_grouped is empty) so tie partitions are aligned using only the shared
sort_by keys (this makes comparisons like
sort_for_comparison(result_ties.select(by)) vs
sort_for_comparison(expected_ties.select(by)) stable).
- Around line 369-376: The current normalization via sort_for_comparison masks
row-order differences even when check_row_order is True; change the logic so
canonicalization only happens when not check_row_order: keep original left/right
when check_row_order is True and only call sort_for_comparison for the not
check_row_order path, then call polars.testing.assert_frame_equal on the
appropriately chosen left/right and preserve the check_row_order flag (or
explicitly set check_row_order=False only in the canonicalized branch with a
comment explaining order is intentionally ignored). Reference symbols:
sort_for_comparison, left_sorted, right_sorted,
polars.testing.assert_frame_equal, and the check_row_order parameter.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c961925a-22b8-4cc7-8069-a9d604ec2507
📒 Files selected for processing (1)
python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py
| def sort_for_comparison(df: pl.DataFrame) -> pl.DataFrame: | ||
| # We know that each dataframe is sorted on `sort_by` according to itself. | ||
| # Now we have some freedom to reorder the rows. We'll use this freedom to avoid | ||
| # any kind of fuzziness from sorting on floating-point columns. | ||
| # | ||
| # As long as we sort by the non-float columns *first*, we'll avoid any | ||
| # false positives / false negatives from comparing two tables that have the | ||
| # same values but happen to be in a different order. Sorting by floating-point | ||
| # columns *last* ensures that records that are close to each other appear in | ||
| # (roughly) the same order, such that polar's approximate equality checks | ||
| # will allow them to be considered equal (or not, if the aren't actually close). | ||
| return ( | ||
| df.sort(by=grouped_sort_columns, nulls_last=nulls_last) | ||
| if grouped_sort_columns | ||
| else df | ||
| ) |
There was a problem hiding this comment.
Derive the grouped sort keys from the frame being compared.
sort_for_comparison() currently closes over the full original column set, so the ties branch has to sort on payload columns before dropping to by. That means two tie partitions with the same by values can still be aligned differently by non-sort_by columns and fail the approximate comparison. Making this helper inspect df.schema would let the ties path compare sort_for_comparison(result_ties.select(by)) against sort_for_comparison(expected_ties.select(by)) instead of letting extra columns influence the order. As per coding guidelines, python/**/*.{py,pyx}: Logic errors producing wrong results - Verify algorithm correctness and data integrity in operations.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@python/cudf_polars/cudf_polars/experimental/benchmarks/asserts.py` around
lines 218 - 233, The helper sort_for_comparison currently closes over the full
original grouped_sort_columns which lets payload columns influence ordering;
change it to derive the actual sort keys from the frame being compared by
inspecting df.schema/df.columns (e.g. build a local_grouped list containing only
the grouped_sort_columns present in df), then call df.sort(by=local_grouped,
nulls_last=nulls_last) (or return df when local_grouped is empty) so tie
partitions are aligned using only the shared sort_by keys (this makes
comparisons like sort_for_comparison(result_ties.select(by)) vs
sort_for_comparison(expected_ties.select(by)) stable).
Description
When comparing results, sorting by non-float columns alone can leave rows with equal non-float keys in an arbitrary order, causing assert_frame_equal to fail on valid results. This PR retries the comparison using float columns as a secondary sort key before raising a validation error.
Checklist