Skip to content

perf: short-circuit COO.reshape when -1 resolves to self.shape#935

Merged
hameerabbasi merged 2 commits intopydata:mainfrom
thodson-usgs:perf/reshape-resolve-neg1-before-shortcircuit
Apr 22, 2026
Merged

perf: short-circuit COO.reshape when -1 resolves to self.shape#935
hameerabbasi merged 2 commits intopydata:mainfrom
thodson-usgs:perf/reshape-resolve-neg1-before-shortcircuit

Conversation

@thodson-usgs
Copy link
Copy Markdown
Contributor

@thodson-usgs thodson-usgs commented Apr 21, 2026

Summary

Two small changes to COO.reshape:

  1. Resolve any -1 in the target shape before the self.shape == shape short-circuit, rather than after.
  2. Replace any(d == -1 for d in shape) with -1 in shape (C-level tuple containment).
-        if self.shape == shape:
-            return self
-        if any(d == -1 for d in shape):
+        if -1 in shape:
             extra = int(self.size / np.prod([d for d in shape if d != -1]))
             shape = tuple([d if d != -1 else extra for d in shape])
+        if self.shape == shape:
+            return self

Why

sparse.tensordot reshapes its operands to 2D with calls like a.reshape((-1, K)). When a is already 2D with trailing dim K, the target shape equals a.shape — but the equality check runs before -1 is resolved, so it doesn't match. The full reshape path (linear_loc(), coord rebuild, new COO allocation) then runs and produces a copy identical to the input.

Resolving -1 first lets the short-circuit catch this case and return self. The -1 in shape swap is a readability / micro-perf nit that matters here because unconditionally checking for -1 before the equality short-circuit (which the first change requires) would otherwise add ~200 ns to every reshape — including the exact-shape case that was already hitting the short-circuit.

Semantics

Strictly fewer copies:

  • Any call that previously returned self still does (exact-shape input has no -1, so the resolution step is a no-op and the equality check sees the same shape).
  • reshape((-1, ...)) that resolves to the current shape now also returns self. This is consistent with the documented behavior (test_reshape_same codifies s.reshape(s.shape) is s) and matches NumPy, which similarly does not promise a copy from ndarray.reshape when the target is equivalent.
  • Error paths are unchanged — the size-mismatch ValueError below still runs with the resolved shape.
  • -1 in shape is equivalent to any(d == -1 for d in shape) because shape is already a tuple of ints by this point.

Minimal repro

Run with the project's pixi env (pixi run -e test python <file>), or standalone with uv run --with sparse --with numpy python <file>:

import time
import numpy as np
import sparse

# Moderately sparse matrix; shape picked so (-1, 300) resolves to self.shape
a = sparse.random((200, 300), density=0.02, random_state=0)
N = 20_000

t0 = time.perf_counter()
for _ in range(N):
    a.reshape((-1, 300))          # the hot tensordot pattern
dt_neg1 = time.perf_counter() - t0

t0 = time.perf_counter()
for _ in range(N):
    a.reshape(a.shape)            # already-short-circuited path for comparison
dt_exact = time.perf_counter() - t0

print(f"reshape((-1, 300)): {dt_neg1*1e6/N:6.1f} us/call")
print(f"reshape(a.shape):   {dt_exact*1e6/N:6.1f} us/call")

# Correctness: -1 resolution returning self is consistent with exact-shape
assert a.reshape((-1, 300)) is a
assert a.reshape(a.shape) is a

Measured on this script (macOS / Python 3.13, median of 3 runs):

reshape((-1, 300)) reshape(a.shape)
main 22.7 μs/call 0.2 μs/call
this PR 2.3 μs/call 0.2 μs/call

~10× faster on the (-1, K) pattern, exact-shape path unchanged. All 6050 existing numba-backend tests pass.

COO.reshape returns self when self.shape equals the requested shape,
but only checks before resolving any -1 in the target. sparse.tensordot
passes shapes like (-1, N) even for 2D x 2D matmul that doesn't actually
change shape, so the short-circuit never fires and a full reshape runs
(linear_loc + coord rebuild + new COO allocation).

Moving the -1 resolution before the equality check avoids that work for
callers that pass a -1 factorization of the current shape. Behavior is
a strict subset ("fewer copies"): any reshape that already returned
self before still does, and reshape((-1, ...)) that resolves to the
current shape now also returns self, matching the documented contract.

Measured ~16% speedup on a warm conservative-regrid loop (xarray-regrid
ConservativeRegridder.regrid) whose tensordot call sits on the hot path;
bit-identical output. All 6050 existing numba-backend tests pass.
hameerabbasi
hameerabbasi previously approved these changes Apr 21, 2026
@hameerabbasi
Copy link
Copy Markdown
Collaborator

Thanks, @thodson-usgs!

Replace `any(d == -1 for d in shape)` with `-1 in shape`. The latter is
a C-level tuple containment, the former a Python-level generator.

On this machine (micro): 221 ns -> 45 ns per check.

End-to-end on the PR's repro (median of 3):
  reshape(a.shape):  0.4 us -> 0.2 us  (matches main; erases the regression
                                        introduced by running the -1 check
                                        unconditionally)
  reshape((-1, K)):  2.7 us -> 2.3 us  (small incremental win)

Pure readability / perf nit; semantics are identical since shape is
already a tuple of ints by the line above.
@thodson-usgs
Copy link
Copy Markdown
Contributor Author

Sorry @hameerabbasi, I noticed a tiny performance regression for reshape(a.shape): 0.2 -> 0.4 μs/call. Claude found a simplification that got this back to 0.2 μs/call.

@thodson-usgs thodson-usgs marked this pull request as ready for review April 21, 2026 20:24
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 21, 2026

Merging this PR will degrade performance by 18.04%

⚡ 20 improved benchmarks
❌ 2 regressed benchmarks
✅ 318 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
test_gcxs_dot_ndarray[coo-m=200-n=200-p=200] 1.8 ms 1.6 ms +15.33%
test_gcxs_dot_ndarray[coo-m=200-n=500-p=200] 2.4 ms 2.2 ms +11.14%
test_gcxs_dot_ndarray[coo-m=500-n=200-p=200] 2.5 ms 2.2 ms +10.82%
test_index_fancy[side=100-rank=1-format='coo'] 1.2 ms 1.4 ms -16.97%
test_index_slice[side=100-rank=2-format='gcxs'] 2.2 ms 2.7 ms -18.04%
test_matmul[m=1000-n=1000-p=200-format='coo'] 9.6 ms 8.3 ms +15.34%
test_matmul[m=200-n=1000-p=200-format='coo'] 3.7 ms 3 ms +22.93%
test_matmul[m=1000-n=200-p=200-format='coo'] 3.6 ms 3 ms +19.29%
test_matmul[m=200-n=1000-p=1000-format='coo'] 10.4 ms 9.2 ms +13.17%
test_matmul[m=1000-n=500-p=200-format='coo'] 5.8 ms 5 ms +16.51%
test_matmul[m=200-n=200-p=1000-format='coo'] 4.4 ms 3.8 ms +14.74%
test_matmul[m=200-n=200-p=500-format='coo'] 3.2 ms 2.7 ms +18.49%
test_matmul[m=200-n=500-p=200-format='coo'] 2.8 ms 2.3 ms +22.87%
test_matmul[m=200-n=1000-p=500-format='coo'] 6.3 ms 5.4 ms +16.9%
test_matmul[m=200-n=200-p=200-format='coo'] 2.3 ms 1.9 ms +24.31%
test_matmul[m=200-n=500-p=1000-format='coo'] 6.7 ms 5.9 ms +14.02%
test_matmul[m=200-n=500-p=500-format='coo'] 4.4 ms 3.7 ms +17.54%
test_matmul[m=500-n=1000-p=200-format='coo'] 5.9 ms 5 ms +18.39%
test_matmul[m=500-n=200-p=200-format='coo'] 2.8 ms 2.3 ms +21.29%
test_matmul[m=500-n=500-p=200-format='coo'] 4 ms 3.3 ms +19.23%
... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing thodson-usgs:perf/reshape-resolve-neg1-before-shortcircuit (487809e) with main (f65a764)

Open in CodSpeed

@hameerabbasi hameerabbasi enabled auto-merge (squash) April 22, 2026 07:20
@hameerabbasi
Copy link
Copy Markdown
Collaborator

Sorry @hameerabbasi, I noticed a tiny performance regression for reshape(a.shape): 0.2 -> 0.4 μs/call. Claude found a simplification that got this back to 0.2 μs/call.

Thanks for being thorough! I've kicked off CI and auto-merge once more.

@hameerabbasi hameerabbasi merged commit fa58c7e into pydata:main Apr 22, 2026
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants