Skip to content

PERF: short-circuit sentinel scans on integer indexers#65298

Draft
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
jbrockmendel:perf-has-sentinel
Draft

PERF: short-circuit sentinel scans on integer indexers#65298
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
jbrockmendel:perf-has-sentinel

Conversation

@jbrockmendel
Copy link
Copy Markdown
Member

Summary

Add lib.has_sentinel(arr, sentinel) — a short-circuiting Cython helper for the common (arr == sentinel).any() pattern on integer indexers, with an 8x-unrolled inner loop over a fused int8/16/32/64 memoryview.

Wire up the clearest (indexer == -1).any() call sites:

  • _MergeOperation._maybe_add_join_keys (non-inner merges)
  • _Unstacker.new_index (single-level unstack)
  • sorting.get_group_index / decons_obs_group_ids (groupby key lifting)
  • MultiIndex._get_indexer_strict NaN-key path
  • DataFrame.__setitem__ non-unique columns path

The idxmin/idxmax sites were intentionally left alone — they can receive an ExtensionArray (e.g. int64[pyarrow]) rather than a numpy array, and the fused memoryview helper doesn't accept those.

Benchmarks

Best-of-9 repeats × ~200 iters each, ARM64 (Apple clang). Full 39-case sweep; significant signals only (|z|>2 and |Δmean|>3%):

Case Δbest Δmean
DataFrame.__setitem__ non-unique cols −5% −30%
groupby.count n=100K −22% −16%
unstack sparse 100×500 (gaps) −21% −14%
groupby.sum n=10K −12% −13%
groupby.sum n=100K −12% −13%
groupby.sum 2 keys n=10K −13% −10%
crosstab n=100K +1% −7%
groupby.sum n=1M −5% −6%
unstack dense int16 codes 500×500 −5% −4%
crosstab n=10K −2% −4%
MultiIndex.loc NaN key −2% −3%
merge(how='outer') 50% overlap n=100K +1% +2%

Only one significant regression: merge(how='outer') 50% overlap n=100K ≈ +2%. Root cause is the known mid-size SIMD gap — numpy's (arr == -1).any() does 2 int64 lanes/cycle; our 8-wide scalar unroll can't match when the array fits in L2 and the first sentinel isn't near the start. Disappears at 1M+ (memory-bound) and small sizes (Python overhead dominates).

Test plan

  • pandas/tests/libs/test_lib.py + targeted correctness tests across int8/16/32/64, including all tail-positioning edge cases
  • Full regression sweep across frame/, series/, indexing/, reshape/, indexes/, libs/, groupby/ (~78K tests)

Notes

  • Kept the existing (ilocs < 0).any() form where values can also be < -1, and the (indices == -1).any() form on EA-typed res._values paths.
  • The .all() analog would require a companion all_sentinel helper; could follow up if useful, but the majority cluster in the codebase is .any().

🤖 Generated with Claude Code

Add `lib.has_sentinel(arr, sentinel)` — a short-circuiting Cython helper
for the common `(arr == sentinel).any()` pattern on integer indexers,
with an 8x-unrolled inner loop over a fused int8/16/32/64 memoryview.

Wire up the clearest `(indexer == -1).any()` call sites:

- `merge._MergeOperation._maybe_add_join_keys` (outer/left/right merges)
- `reshape._Unstacker.new_index` (single-level unstack)
- `sorting.get_group_index` / `decons_obs_group_ids` (groupby)
- `MultiIndex._get_indexer_strict` NaN-key path
- `DataFrame.__setitem__` non-unique columns path

User-visible impact (best-of-9 repeats × ~200 iters, ARM64):

- groupby.sum / groupby.count at n >= 10K: -5% to -16%
- DataFrame[cols] = value with non-unique columns: ~-30%
- unstack with NaN-introducing gaps (100x500): -14%
- crosstab: -4% to -7%
- multi_loc_nan_key: -3%
- merge outer with no overlap (short-circuit fires immediately): -3% to -5%

One known regression: merge(how='outer') with ~50% overlap and n ~100K
sees ~+2% because the scan (length ~150K, first sentinel near the
middle) fits in L2 cache, where numpy's SIMD (arr == -1).any() beats
our scalar unroll. Unchanged at 1M+ (memory-bound) and small sizes
(Python overhead dominates).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Apr 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant