PERF: short-circuit sentinel scans on integer indexers#65298
Draft
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
Draft
PERF: short-circuit sentinel scans on integer indexers#65298jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
Conversation
Add `lib.has_sentinel(arr, sentinel)` — a short-circuiting Cython helper for the common `(arr == sentinel).any()` pattern on integer indexers, with an 8x-unrolled inner loop over a fused int8/16/32/64 memoryview. Wire up the clearest `(indexer == -1).any()` call sites: - `merge._MergeOperation._maybe_add_join_keys` (outer/left/right merges) - `reshape._Unstacker.new_index` (single-level unstack) - `sorting.get_group_index` / `decons_obs_group_ids` (groupby) - `MultiIndex._get_indexer_strict` NaN-key path - `DataFrame.__setitem__` non-unique columns path User-visible impact (best-of-9 repeats × ~200 iters, ARM64): - groupby.sum / groupby.count at n >= 10K: -5% to -16% - DataFrame[cols] = value with non-unique columns: ~-30% - unstack with NaN-introducing gaps (100x500): -14% - crosstab: -4% to -7% - multi_loc_nan_key: -3% - merge outer with no overlap (short-circuit fires immediately): -3% to -5% One known regression: merge(how='outer') with ~50% overlap and n ~100K sees ~+2% because the scan (length ~150K, first sentinel near the middle) fits in L2 cache, where numpy's SIMD (arr == -1).any() beats our scalar unroll. Unchanged at 1M+ (memory-bound) and small sizes (Python overhead dominates). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
lib.has_sentinel(arr, sentinel)— a short-circuiting Cython helper for the common(arr == sentinel).any()pattern on integer indexers, with an 8x-unrolled inner loop over a fusedint8/16/32/64memoryview.Wire up the clearest
(indexer == -1).any()call sites:_MergeOperation._maybe_add_join_keys(non-inner merges)_Unstacker.new_index(single-level unstack)sorting.get_group_index/decons_obs_group_ids(groupby key lifting)MultiIndex._get_indexer_strictNaN-key pathDataFrame.__setitem__non-unique columns pathThe
idxmin/idxmaxsites were intentionally left alone — they can receive anExtensionArray(e.g.int64[pyarrow]) rather than a numpy array, and the fused memoryview helper doesn't accept those.Benchmarks
Best-of-9 repeats × ~200 iters each, ARM64 (Apple clang). Full 39-case sweep; significant signals only (|z|>2 and |Δmean|>3%):
DataFrame.__setitem__non-unique colsgroupby.countn=100Kunstacksparse 100×500 (gaps)groupby.sumn=10Kgroupby.sumn=100Kgroupby.sum2 keys n=10Kcrosstabn=100Kgroupby.sumn=1Munstackdense int16 codes 500×500crosstabn=10KMultiIndex.locNaN keymerge(how='outer')50% overlap n=100KOnly one significant regression:
merge(how='outer')50% overlap n=100K ≈ +2%. Root cause is the known mid-size SIMD gap — numpy's(arr == -1).any()does 2 int64 lanes/cycle; our 8-wide scalar unroll can't match when the array fits in L2 and the first sentinel isn't near the start. Disappears at 1M+ (memory-bound) and small sizes (Python overhead dominates).Test plan
pandas/tests/libs/test_lib.py+ targeted correctness tests across int8/16/32/64, including all tail-positioning edge casesframe/,series/,indexing/,reshape/,indexes/,libs/,groupby/(~78K tests)Notes
(ilocs < 0).any()form where values can also be< -1, and the(indices == -1).any()form on EA-typedres._valuespaths..all()analog would require a companionall_sentinelhelper; could follow up if useful, but the majority cluster in the codebase is.any().🤖 Generated with Claude Code