cranelift: fold ctz/clz comparisons against zero into direct LSB / sign-bit tests#13332
Conversation
…gn-bit tests When the result of a count-trailing/leading-zeros instruction is fed into a comparison against zero (the only thing the consumer cares about is whether the count is zero, not its numeric value), rewrite to test the corresponding bit of X directly: ctz(X) == 0 iff LSB of X is set iff (X & 1) != 0 clz(X) == 0 iff MSB of X is set iff X is signed-negative The bit-counting instruction can then be DCE'd. Backend emits a single-cycle `test reg, imm` (LSB case) or `test reg, reg; js` (sign case) instead of TZCNT/BSF/LZCNT/BSR + TEST + JCC — saves ~3 cycles of latency on Intel x86_64 per occurrence and removes the false GPR dependency. JIT-less backends benefit even more: their bit-counting paths are typically loops. Motivated by the converse wasm-side peephole in WebAssembly/binaryen#8562 (LSB→ctz fold under -Os for byte savings). With these mid-end rules in place, that fold is cycle-neutral on cranelift JITs even when fed unconditionally. Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (ctz(X) == 4 must NOT trigger — that's a numeric-value test on the count, a different rewrite family).
1734a52 to
30531ed
Compare
|
Sketched an extension to also catch the wasm-emitted shape Scope creep: the natural place is each backend's would lower That's 4× backend files + filetests, different reviewers per arch, and a different review audience from this egraph PR. Punting on the amendment and filing a separate follow-up instead. |
|
Concrete real-world workload for the So the JIT-side fold here is the natural meeting point for classical-Motoko output: clz directly into brif, no icmp, no ireduce. The rules in this PR don't yet catch that shape (it's |
ctz/clz comparisons against zero into direct LSB / sign-bit tests
Subscribe to Label ActionDetailsThis issue or pull request has been labeled: "cranelift", "isle"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
|
Thanks for this! Are the backend rules necessary with the mid-end egraph rules? I'd expect that the egraph rewrites would be sufficient and the backends largely wouldn't need to change, unless they need to emit a new instruction pattern which isn't currently recognized. One thing you may also want to do is to add something in |
Exercises three consumers (if, select, eqz) over the icmp-mediated shapes the egraph rewrites in `cranelift/codegen/src/opts/icmp.isle` target: `(ctz X) == 0`, `(ctz X) != 0`, and the analogous clz forms, across i32/i64. The blessed disassembly shows: - icmp-mediated cases collapse to a single bit test (`testl $1, %edx; jne` for ctz, `testl %edx, %edx; jl` for clz). - a bare `if (ctz X)` / `if (clz X)` form (no icmp interposed, i.e. the wasm-natural shape produced by frontends like Motoko's `moc`) compiles to full bsf+cmov+test or bsr+cmov+sub+test, since the brif's implicit zero-test is not visible to the value-level egraph rules. - `(ctz X) == 4` (numeric, not boolean) correctly stays as bsf+cmp+je — the rules don't over-fire. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Added tests/disas/ctz-clz-bool-condition.wat (commit 0519796) per your suggestion — covers Empirical answer to your first question — the egraph rules in this PR are complete for the icmp-mediated case but don't catch the wasm-natural bare form: icmp-mediated (this PR's target — collapses correctly): ;; if_ctz_eq0_i32: testl $1, %edx; jne
;; select_ctz_eq0_i32: testl $1, %edx; cmovne
;; eqz_ctz_eq0_i32: testl $1, %edx; sete
;; if_clz_eq0_i32: testl %edx, %edx; jlbare ;; if_ctz_bare_i32: bsfl %edx, %r9d; cmovel; testl %r9d, %r9d; jne
;; if_clz_bare_i32: bsrl + cmovel + 0x1f-sub + test ;; 5 insnsnegative test (correctly not collapsed — confirms the rules don't over-fire on numeric comparisons): ;; if_ctz_eq4_i32: bsfl %edx, %r9d; cmpl $4, %r9d; jeSo the egraph rules cover their intended shape end-to-end. The bare form would need backend |
alexcrichton
left a comment
There was a problem hiding this comment.
Ah yes good point, and yeah sounds good to me. Thanks for your work here! Happy to review backend-specific changes as well
Four mid-end ISLE rules in `opts/icmp.isle` for the boolean-context cases — when `ctz`/`clz` flows into a comparison against zero (the consumer cares only "is it zero?", not the numeric value):
```
ctz(X) == 0 iff (X & 1) != 0 ; LSB of X set
ctz(X) != 0 iff (X & 1) == 0 ; LSB of X clear
clz(X) == 0 iff X <signed 0 ; MSB of X set (X is negative)
clz(X) != 0 iff X >=signed 0 ; MSB of X clear (X is non-negative)
```
The bit-counting instruction is DCE'd; backend emits a single-cycle `test reg, imm` (LSB case) or `test reg, reg; js` (sign case) instead of `TZCNT/BSF/LZCNT/BSR` + `TEST` + `JCC` — saves ~3 cycles per occurrence on Intel x86_64 (TZCNT/LZCNT are 3-cycle latency with a false GPR dep), proportionally more on JIT-less backends.
Why this matters in practice
The poster-child workload is the Motoko runtime's discriminator test on every `Nat`/`Int` operation:
Every arithmetic op begins with this LSB test. The Motoko codegen (`src/codegen/instrList.ml:97-100`) already emits the LSB-test-of-AND-1 pattern as `(ctz X)` — unconditionally, no flag gate — so every moc-compiled wasm running on wasmtime today does TZCNT + TEST + JCC on the hot path of every numeric op. The Rust RTS / GC paths that work on the same tagged pointer scheme see the same pattern.
With these rules in place, cranelift collapses the comparison back to a single `test r, 1` — restoring the original cost of the discriminator and unlocking measurable speed-ups for every Motoko canister on a wasmtime-based IC subnet (and any other wasm that produces this shape).
The clz / sign-bit half exists for the same reason on the rare paths that test sign before dispatching; structurally parallel rewrite, ships in the same patch.
The converse fold on the wasm-byte-savings side is in WebAssembly/binaryen#8562 (LSB→ctz under `-Os`); landing it there together with this in cranelift gives byte savings without cycle cost.
Filetest covers i32/i64 ctz and clz in both eq and ne forms plus a negative case (`ctz(X) == 4` must not trigger — that's a numeric-value test on the count, a different rewrite family).