[X86] Use shift+add/sub for vXi8 splat multiplies by grodranlorth · Pull Request #174110 · llvm/llvm-project

grodranlorth · 2025-12-31T20:47:29Z

~~I will create a separate PR to the llvm-test-suite repo for the microbenchmark for this change.~~ The benchmark is in llvm/llvm-test-suite#316

In my experiments on an EC2 c6i.4xl, the change gives a small improvement for the x86-64, x86-64-v2, and x86-64-v3 targets. It regresses performance on x86-64-v4 (in particular, when the constant decomposes into two shifts). The performance summary follows:

$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v1.json results-opt-generic-v1.json  |tail -n1
OVERALL_GEOMEAN                         -0.2846         -0.2846             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v2.json results-opt-generic-v2.json  |tail -n1
OVERALL_GEOMEAN                         -0.0907         -0.0907             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v3.json results-opt-generic-v3.json  |tail -n1
OVERALL_GEOMEAN                         -0.1821         -0.1821             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v4.json results-opt-generic-v4.json  |tail -n1
OVERALL_GEOMEAN                         +0.0190         +0.0190             0             0             0             0

And avoid extending to 16 bits

github-actions · 2025-12-31T20:47:48Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2025-12-31T20:48:15Z

@llvm/pr-subscribers-backend-x86

Author: Cody Cutler (grodranlorth)

Changes

Issue #164200

I will create a separate PR to the llvm-test-suite repo for the microbenchmark for this change.

In my experiments on an EC2 c6i.4xl, the change gives a small improvement for the x86-64, x86-64-v2, and x86-64-v3 targets. It regresses performance on x86-64-v4 (in particular, when the constant decomposes into two shifts). The performance summary follows:

$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v1.json results-opt-generic-v1.json  |tail -n1
OVERALL_GEOMEAN                         -0.2846         -0.2846             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v2.json results-opt-generic-v2.json  |tail -n1
OVERALL_GEOMEAN                         -0.0907         -0.0907             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v3.json results-opt-generic-v3.json  |tail -n1
OVERALL_GEOMEAN                         -0.1821         -0.1821             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v4.json results-opt-generic-v4.json  |tail -n1
OVERALL_GEOMEAN                         +0.0190         +0.0190             0             0             0             0

Patch is 81.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/174110.diff

6 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+17)
(modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+24)
(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+59)
(modified) llvm/lib/Target/X86/X86ISelLowering.h (+2)
(added) llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll (+1231)
(added) llvm/test/CodeGen/X86/vector-mul-i8-negative.ll (+466)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 8ad64a852b74d..7594b487b9666 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -2538,6 +2538,23 @@ class LLVM_ABI TargetLoweringBase {
     return false;
   }
 
+  /// Structure to hold detailed decomposition of multiply by constant.
+  struct MulByConstInfo {
+    bool IsDecomposable = false;
+    bool Negate = false;    // True if result should be negated
+    unsigned NumShifts = 0; // 1 or 2
+    unsigned Shift1 = 0;    // Primary shift amount
+    unsigned Shift2 = 0;    // Secondary shift amount (for 2-shift case)
+    bool IsSub = false;     // True for SUB, false for ADD (for 2-shift case)
+  };
+
+  /// Get detailed decomposition of multiply by constant if available.
+  /// Returns decomposition info if the target has a custom decomposition
+  /// for this multiply-by-constant, otherwise returns IsDecomposable = false.
+  virtual MulByConstInfo getMulByConstInfo(EVT VT, const APInt &C) const {
+    return MulByConstInfo();
+  }
+
   /// Return true if it may be profitable to transform
   /// (mul (add x, c1), c2) -> (add (mul x, c2), c1*c2).
   /// This may not be true if c1 and c2 can be represented as immediates but
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 74d00317c3649..26ff36a768167 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -4836,6 +4836,30 @@ template <class MatchContextClass> SDValue DAGCombiner::visitMUL(SDNode *N) {
   //           x * -0xf800 --> -((x << 16) - (x << 11)) ; (x << 11) - (x << 16)
   if (!UseVP && N1IsConst &&
       TLI.decomposeMulByConstant(*DAG.getContext(), VT, N1)) {
+    // First check if target has custom decomposition info
+    TargetLowering::MulByConstInfo Info =
+        TLI.getMulByConstInfo(VT, ConstValue1);
+    if (Info.IsDecomposable) {
+      // Emit custom decomposition based on target's info
+      SDValue Result;
+      if (Info.NumShifts == 1) {
+        // Single shift: result = N0 << Shift1
+        Result = DAG.getNode(ISD::SHL, DL, VT, N0,
+                             DAG.getConstant(Info.Shift1, DL, VT));
+      } else if (Info.NumShifts == 2) {
+        // Two shifts with add or sub
+        SDValue Shl1 = DAG.getNode(ISD::SHL, DL, VT, N0,
+                                   DAG.getConstant(Info.Shift1, DL, VT));
+        SDValue Shl2 = DAG.getNode(ISD::SHL, DL, VT, N0,
+                                   DAG.getConstant(Info.Shift2, DL, VT));
+        Result =
+            DAG.getNode(Info.IsSub ? ISD::SUB : ISD::ADD, DL, VT, Shl1, Shl2);
+      }
+      if (Info.Negate)
+        Result = DAG.getNegative(Result, DL, VT);
+      return Result;
+    }
+
     // TODO: We could handle more general decomposition of any constant by
     //       having the target set a limit on number of ops and making a
     //       callback to determine that sequence (similar to sqrt expansion).
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 20136ade7c317..f75ce5a53188f 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -3389,6 +3389,12 @@ bool X86TargetLowering::decomposeMulByConstant(LLVMContext &Context, EVT VT,
   if (!ISD::isConstantSplatVector(C.getNode(), MulC))
     return false;
 
+  // Check if this is an 8-bit vector multiply that can be decomposed to shifts.
+  if (VT.isVector() && VT.getScalarSizeInBits() == 8) {
+    if (getMulByConstInfo(VT, MulC).IsDecomposable)
+      return true;
+  }
+
   // Find the type this will be legalized too. Otherwise we might prematurely
   // convert this to shl+add/sub and then still have to type legalize those ops.
   // Another choice would be to defer the decision for illegal types until
@@ -3413,6 +3419,59 @@ bool X86TargetLowering::decomposeMulByConstant(LLVMContext &Context, EVT VT,
          (1 - MulC).isPowerOf2() || (-(MulC + 1)).isPowerOf2();
 }
 
+TargetLowering::MulByConstInfo
+X86TargetLowering::getMulByConstInfo(EVT VT, const APInt &Constant) const {
+  // Only handle 8-bit vector multiplies
+  if (!VT.isVector() || VT.getScalarSizeInBits() != 8 ||
+      Constant.getBitWidth() < 8)
+    return MulByConstInfo();
+
+  TargetLowering::MulByConstInfo Info;
+  int8_t SignedC = static_cast<int8_t>(Constant.getZExtValue());
+  Info.Negate = SignedC < 0;
+
+  uint32_t U = static_cast<uint8_t>(Info.Negate ? -SignedC : SignedC);
+  if (U == 0 || U == 1)
+    return Info;
+
+  // Power of 2.
+  if (isPowerOf2_32(U)) {
+    Info.Shift1 = llvm::countr_zero(U);
+    Info.NumShifts = 1;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  // Decomposition logic:
+  //   m = 2^x + 2^y  => (shl x, x) + (shl x, y)
+  //   m = 2^x - 2^y  => (shl x, x) - (shl x, y)
+  // where 2^y is the lowest set bit.
+  uint32_t LowBit = U & (0U - U);
+  unsigned Shift2 = llvm::countr_zero(LowBit);
+
+  uint32_t Rem = U - LowBit;
+  if (isPowerOf2_32(Rem)) {
+    Info.Shift1 = llvm::countr_zero(Rem);
+    Info.Shift2 = Shift2;
+    Info.IsSub = false;
+    Info.NumShifts = 2;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  uint32_t Sum = U + LowBit;
+  if (Sum <= 0xFF && isPowerOf2_32(Sum)) {
+    Info.Shift1 = llvm::countr_zero(Sum);
+    Info.Shift2 = Shift2;
+    Info.IsSub = true;
+    Info.NumShifts = 2;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  return Info;
+}
+
 bool X86TargetLowering::isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
                                                 unsigned Index) const {
   if (!isOperationLegalOrCustom(ISD::EXTRACT_SUBVECTOR, ResVT))
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index a528c311975d8..24372598aaf53 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1537,6 +1537,8 @@ namespace llvm {
     bool decomposeMulByConstant(LLVMContext &Context, EVT VT,
                                 SDValue C) const override;
 
+    MulByConstInfo getMulByConstInfo(EVT VT, const APInt &C) const override;
+
     /// Return true if EXTRACT_SUBVECTOR is cheap for this result type
     /// with this index.
     bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
diff --git a/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll b/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
new file mode 100644
index 0000000000000..9648352ebc2a9
--- /dev/null
+++ b/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
@@ -0,0 +1,1231 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift | FileCheck %s --check-prefixes=CHECK,SSE2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift,+avx2 | FileCheck %s --check-prefixes=CHECK,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift,+avx512f,+avx512bw | FileCheck %s --check-prefixes=CHECK,AVX512
+
+;; Tests vXi8 constant-multiply decomposition into shift/add/sub sequences.
+;;
+;; Examples:
+;;   6 = 2^2 + 2^1 = 4 + 2  (or 8 - 2)
+;;   10 = 2^3 + 2^1 = 8 + 2
+;;   12 = 2^3 + 2^2 = 8 + 4  (or 16 - 4)
+;;   18 = 2^4 + 2^1 = 16 + 2
+;;   20 = 2^4 + 2^2 = 16 + 4
+;;   24 = 2^4 + 2^3 = 16 + 8  (or 32 - 8)
+;;
+;; To run this test:
+;;   llvm-lit llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
+;;
+;; To regenerate CHECK lines:
+;;   python llvm/utils/update_llc_test_checks.py llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
+
+;; ============================================================================
+;; v16i8 Tests (128-bit vectors) - Sum of two powers of 2
+;; ============================================================================
+
+define <16 x i8> @mul_v16i8_const6(<16 x i8> %a) nounwind {
+; Test multiply by 6 = 4 + 2 = (1 << 2) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const6:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const6:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const6:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const10(<16 x i8> %a) nounwind {
+; Test multiply by 10 = 8 + 2 = (1 << 3) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const10:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const10:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const10:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const12(<16 x i8> %a) nounwind {
+; Test multiply by 12 = 8 + 4 = (1 << 3) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const12:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $3, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const12:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const12:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const18(<16 x i8> %a) nounwind {
+; Test multiply by 18 = 16 + 2 = (1 << 4) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const18:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const18:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const18:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const20(<16 x i8> %a) nounwind {
+; Test multiply by 20 = 16 + 4 = (1 << 4) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const20:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const20:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const20:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const24(<16 x i8> %a) nounwind {
+; Test multiply by 24 = 16 + 8 = (1 << 4) + (1 << 3)
+; SSE2-LABEL: mul_v16i8_const24:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const24:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const24:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const34(<16 x i8> %a) nounwind {
+; Test multiply by 34 = 32 + 2 = (1 << 5) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const34:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const34:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const34:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const36(<16 x i8> %a) nounwind {
+; Test multiply by 36 = 32 + 4 = (1 << 5) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const36:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const36:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const36:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const40(<16 x i8> %a) nounwind {
+; Test multiply by 40 = 32 + 8 = (1 << 5) + (1 << 3)
+; SSE2-LABEL: mul_v16i8_const40:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const40:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const40:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const48(<16 x i8> %a) nounwind {
+; Test multiply by 48 = 32 + 16 = (1 << 5) + (1 << 4)
+; SSE2-LABEL: mul_v16i8_const48:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $4, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const48:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const48:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48>
+  ret <16 x i...
[truncated]

llvmbot · 2025-12-31T20:48:16Z

@llvm/pr-subscribers-llvm-selectiondag

Author: Cody Cutler (grodranlorth)

Changes

Issue #164200

I will create a separate PR to the llvm-test-suite repo for the microbenchmark for this change.

In my experiments on an EC2 c6i.4xl, the change gives a small improvement for the x86-64, x86-64-v2, and x86-64-v3 targets. It regresses performance on x86-64-v4 (in particular, when the constant decomposes into two shifts). The performance summary follows:

$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v1.json results-opt-generic-v1.json  |tail -n1
OVERALL_GEOMEAN                         -0.2846         -0.2846             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v2.json results-opt-generic-v2.json  |tail -n1
OVERALL_GEOMEAN                         -0.0907         -0.0907             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v3.json results-opt-generic-v3.json  |tail -n1
OVERALL_GEOMEAN                         -0.1821         -0.1821             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v4.json results-opt-generic-v4.json  |tail -n1
OVERALL_GEOMEAN                         +0.0190         +0.0190             0             0             0             0

Patch is 81.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/174110.diff

6 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+17)
(modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+24)
(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+59)
(modified) llvm/lib/Target/X86/X86ISelLowering.h (+2)
(added) llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll (+1231)
(added) llvm/test/CodeGen/X86/vector-mul-i8-negative.ll (+466)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 8ad64a852b74d..7594b487b9666 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -2538,6 +2538,23 @@ class LLVM_ABI TargetLoweringBase {
     return false;
   }
 
+  /// Structure to hold detailed decomposition of multiply by constant.
+  struct MulByConstInfo {
+    bool IsDecomposable = false;
+    bool Negate = false;    // True if result should be negated
+    unsigned NumShifts = 0; // 1 or 2
+    unsigned Shift1 = 0;    // Primary shift amount
+    unsigned Shift2 = 0;    // Secondary shift amount (for 2-shift case)
+    bool IsSub = false;     // True for SUB, false for ADD (for 2-shift case)
+  };
+
+  /// Get detailed decomposition of multiply by constant if available.
+  /// Returns decomposition info if the target has a custom decomposition
+  /// for this multiply-by-constant, otherwise returns IsDecomposable = false.
+  virtual MulByConstInfo getMulByConstInfo(EVT VT, const APInt &C) const {
+    return MulByConstInfo();
+  }
+
   /// Return true if it may be profitable to transform
   /// (mul (add x, c1), c2) -> (add (mul x, c2), c1*c2).
   /// This may not be true if c1 and c2 can be represented as immediates but
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 74d00317c3649..26ff36a768167 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -4836,6 +4836,30 @@ template <class MatchContextClass> SDValue DAGCombiner::visitMUL(SDNode *N) {
   //           x * -0xf800 --> -((x << 16) - (x << 11)) ; (x << 11) - (x << 16)
   if (!UseVP && N1IsConst &&
       TLI.decomposeMulByConstant(*DAG.getContext(), VT, N1)) {
+    // First check if target has custom decomposition info
+    TargetLowering::MulByConstInfo Info =
+        TLI.getMulByConstInfo(VT, ConstValue1);
+    if (Info.IsDecomposable) {
+      // Emit custom decomposition based on target's info
+      SDValue Result;
+      if (Info.NumShifts == 1) {
+        // Single shift: result = N0 << Shift1
+        Result = DAG.getNode(ISD::SHL, DL, VT, N0,
+                             DAG.getConstant(Info.Shift1, DL, VT));
+      } else if (Info.NumShifts == 2) {
+        // Two shifts with add or sub
+        SDValue Shl1 = DAG.getNode(ISD::SHL, DL, VT, N0,
+                                   DAG.getConstant(Info.Shift1, DL, VT));
+        SDValue Shl2 = DAG.getNode(ISD::SHL, DL, VT, N0,
+                                   DAG.getConstant(Info.Shift2, DL, VT));
+        Result =
+            DAG.getNode(Info.IsSub ? ISD::SUB : ISD::ADD, DL, VT, Shl1, Shl2);
+      }
+      if (Info.Negate)
+        Result = DAG.getNegative(Result, DL, VT);
+      return Result;
+    }
+
     // TODO: We could handle more general decomposition of any constant by
     //       having the target set a limit on number of ops and making a
     //       callback to determine that sequence (similar to sqrt expansion).
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 20136ade7c317..f75ce5a53188f 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -3389,6 +3389,12 @@ bool X86TargetLowering::decomposeMulByConstant(LLVMContext &Context, EVT VT,
   if (!ISD::isConstantSplatVector(C.getNode(), MulC))
     return false;
 
+  // Check if this is an 8-bit vector multiply that can be decomposed to shifts.
+  if (VT.isVector() && VT.getScalarSizeInBits() == 8) {
+    if (getMulByConstInfo(VT, MulC).IsDecomposable)
+      return true;
+  }
+
   // Find the type this will be legalized too. Otherwise we might prematurely
   // convert this to shl+add/sub and then still have to type legalize those ops.
   // Another choice would be to defer the decision for illegal types until
@@ -3413,6 +3419,59 @@ bool X86TargetLowering::decomposeMulByConstant(LLVMContext &Context, EVT VT,
          (1 - MulC).isPowerOf2() || (-(MulC + 1)).isPowerOf2();
 }
 
+TargetLowering::MulByConstInfo
+X86TargetLowering::getMulByConstInfo(EVT VT, const APInt &Constant) const {
+  // Only handle 8-bit vector multiplies
+  if (!VT.isVector() || VT.getScalarSizeInBits() != 8 ||
+      Constant.getBitWidth() < 8)
+    return MulByConstInfo();
+
+  TargetLowering::MulByConstInfo Info;
+  int8_t SignedC = static_cast<int8_t>(Constant.getZExtValue());
+  Info.Negate = SignedC < 0;
+
+  uint32_t U = static_cast<uint8_t>(Info.Negate ? -SignedC : SignedC);
+  if (U == 0 || U == 1)
+    return Info;
+
+  // Power of 2.
+  if (isPowerOf2_32(U)) {
+    Info.Shift1 = llvm::countr_zero(U);
+    Info.NumShifts = 1;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  // Decomposition logic:
+  //   m = 2^x + 2^y  => (shl x, x) + (shl x, y)
+  //   m = 2^x - 2^y  => (shl x, x) - (shl x, y)
+  // where 2^y is the lowest set bit.
+  uint32_t LowBit = U & (0U - U);
+  unsigned Shift2 = llvm::countr_zero(LowBit);
+
+  uint32_t Rem = U - LowBit;
+  if (isPowerOf2_32(Rem)) {
+    Info.Shift1 = llvm::countr_zero(Rem);
+    Info.Shift2 = Shift2;
+    Info.IsSub = false;
+    Info.NumShifts = 2;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  uint32_t Sum = U + LowBit;
+  if (Sum <= 0xFF && isPowerOf2_32(Sum)) {
+    Info.Shift1 = llvm::countr_zero(Sum);
+    Info.Shift2 = Shift2;
+    Info.IsSub = true;
+    Info.NumShifts = 2;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  return Info;
+}
+
 bool X86TargetLowering::isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
                                                 unsigned Index) const {
   if (!isOperationLegalOrCustom(ISD::EXTRACT_SUBVECTOR, ResVT))
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index a528c311975d8..24372598aaf53 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1537,6 +1537,8 @@ namespace llvm {
     bool decomposeMulByConstant(LLVMContext &Context, EVT VT,
                                 SDValue C) const override;
 
+    MulByConstInfo getMulByConstInfo(EVT VT, const APInt &C) const override;
+
     /// Return true if EXTRACT_SUBVECTOR is cheap for this result type
     /// with this index.
     bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
diff --git a/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll b/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
new file mode 100644
index 0000000000000..9648352ebc2a9
--- /dev/null
+++ b/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
@@ -0,0 +1,1231 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift | FileCheck %s --check-prefixes=CHECK,SSE2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift,+avx2 | FileCheck %s --check-prefixes=CHECK,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift,+avx512f,+avx512bw | FileCheck %s --check-prefixes=CHECK,AVX512
+
+;; Tests vXi8 constant-multiply decomposition into shift/add/sub sequences.
+;;
+;; Examples:
+;;   6 = 2^2 + 2^1 = 4 + 2  (or 8 - 2)
+;;   10 = 2^3 + 2^1 = 8 + 2
+;;   12 = 2^3 + 2^2 = 8 + 4  (or 16 - 4)
+;;   18 = 2^4 + 2^1 = 16 + 2
+;;   20 = 2^4 + 2^2 = 16 + 4
+;;   24 = 2^4 + 2^3 = 16 + 8  (or 32 - 8)
+;;
+;; To run this test:
+;;   llvm-lit llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
+;;
+;; To regenerate CHECK lines:
+;;   python llvm/utils/update_llc_test_checks.py llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
+
+;; ============================================================================
+;; v16i8 Tests (128-bit vectors) - Sum of two powers of 2
+;; ============================================================================
+
+define <16 x i8> @mul_v16i8_const6(<16 x i8> %a) nounwind {
+; Test multiply by 6 = 4 + 2 = (1 << 2) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const6:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const6:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const6:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const10(<16 x i8> %a) nounwind {
+; Test multiply by 10 = 8 + 2 = (1 << 3) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const10:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const10:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const10:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const12(<16 x i8> %a) nounwind {
+; Test multiply by 12 = 8 + 4 = (1 << 3) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const12:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $3, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const12:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const12:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const18(<16 x i8> %a) nounwind {
+; Test multiply by 18 = 16 + 2 = (1 << 4) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const18:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const18:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const18:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const20(<16 x i8> %a) nounwind {
+; Test multiply by 20 = 16 + 4 = (1 << 4) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const20:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const20:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const20:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const24(<16 x i8> %a) nounwind {
+; Test multiply by 24 = 16 + 8 = (1 << 4) + (1 << 3)
+; SSE2-LABEL: mul_v16i8_const24:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const24:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const24:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const34(<16 x i8> %a) nounwind {
+; Test multiply by 34 = 32 + 2 = (1 << 5) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const34:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const34:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const34:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const36(<16 x i8> %a) nounwind {
+; Test multiply by 36 = 32 + 4 = (1 << 5) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const36:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const36:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const36:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const40(<16 x i8> %a) nounwind {
+; Test multiply by 40 = 32 + 8 = (1 << 5) + (1 << 3)
+; SSE2-LABEL: mul_v16i8_const40:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const40:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const40:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const48(<16 x i8> %a) nounwind {
+; Test multiply by 48 = 32 + 16 = (1 << 5) + (1 << 4)
+; SSE2-LABEL: mul_v16i8_const48:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $4, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const48:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const48:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48>
+  ret <16 x i...
[truncated]

github-actions · 2026-01-01T21:03:56Z

⚠️ We detected that you are using a GitHub private e-mail address to contribute to the repo.
Please turn off Keep my email addresses private setting in your account.
See LLVM Developer Policy and LLVM Discourse for more information.

diffray-bot · 2026-01-02T22:02:25Z

Changes Summary

This PR adds a specialized optimization for X86 vector multiplication by small constants (vXi8). It introduces a new virtual method getMulByConstInfo() in the TargetLowering interface to allow targets to provide detailed decomposition information for multiply-by-constant operations, enabling X86 to decompose 8-bit vector multiplies into more efficient shift/add/sub sequences instead of using generic multiplication instructions.

Type: feature

Components Affected: X86 Code Generation (X86ISelLowering), DAG Combiner (SelectionDAG), Target Lowering Interface, LLVM CodeGen Optimization

Files Changed

File	Summary	Change	Impact
`llvm/include/llvm/CodeGen/TargetLowering.h`	Added MulByConstInfo structure and virtual getMulByConstInfo() method to the TargetLowering interface for targets to provide custom multiply-by-constant decomposition info	✏️	🟡
`llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp`	Enhanced MUL visitor to use target-specific getMulByConstInfo() to emit optimized shift/add/sub sequences instead of generic multiplication	✏️	🟡
`llvm/lib/Target/X86/X86ISelLowering.h`	Added getMulByConstInfo() override declaration for X86TargetLowering class	✏️	🟢
`llvm/lib/Target/X86/X86ISelLowering.cpp`	Implemented getMulByConstInfo() with decomposition logic for 8-bit vector multiplies; enhanced decomposeMulByConstant() to recognize decomposable 8-bit vector patterns	✏️	🔴
`llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll`	Comprehensive test suite for vXi8 multiplication decomposition with ~1200 lines covering SSE2, AVX2, and AVX512 code generation	➕	🟡
`llvm/test/CodeGen/X86/vector-mul-i8-negative.ll`	Test suite for edge cases and non-decomposable constants to ensure fallback to standard multiply instructions works correctly	➕	🟡

Architecture Impact

New Patterns: Target-specific decomposition callback pattern via virtual method override, Shift+add and shift-subtract patterns for constant multiplication
Coupling: Adds new virtual method to TargetLowering base class, allowing targets to override default behavior. DAGCombiner now queries targets for decomposition info before falling back to generic logic. This maintains backward compatibility (default implementation returns empty/disabled info).

Risk Areas: The decomposition logic for 8-bit constants (getMulByConstInfo in X86ISelLowering.cpp) involves bitwise manipulation and must handle negative constants correctly - requires careful testing of edge cases, Performance regression noted on x86-64-v4 targets when constants decompose into two shifts (+0.019% geomean), indicating potential suboptimality for newer instruction sets with better multiplication capabilities, The optimization targets only vXi8 types; correctness across vector widths (v16i8, v32i8, v64i8) and different ISA extensions (SSE2, AVX2, AVX512) must be validated, DAGCombiner changes introduce new code path that runs before existing generic decomposition - subtle interactions possible with other optimization passes

Suggestions

Add comments explaining the decomposition algorithm's mathematical basis in getMulByConstInfo() (e.g., why the formula (U - LowBit) and (U + LowBit) patterns are valid)
Consider adding a target feature flag to conditionally enable/disable this optimization for different ISA levels to mitigate x86-64-v4 regression
Expand the implementation to handle other vector types (16-bit, 32-bit) if performance benefits are demonstrated
The PR description mentions microbenchmarks in llvm-test-suite - ensure those are included/linked in CI validation

_{Full review in progress... | Powered by diffray}

diffray-bot · 2026-01-02T22:08:56Z

Review Summary

Free public review - Want AI code reviews on your PRs? Check out diffray.ai

Validated 10 issues: 3 kept (testing gaps, legitimate documentation concern), 7 filtered (low value style/docs, speculative performance claims)

Issues Found: 3

💬 See 3 individual line comment(s) for details.

📋 Full issue list (click to expand)

🟡 MEDIUM - New virtual method getMulByConstInfo() lacks unit tests

Agent: testing

Category: testing

Why this matters: Comprehensive test coverage prevents regressions, builds confidence in changes, and reduces debugging time. Manual verification is unreliable and doesn't scale.

File: llvm/include/llvm/CodeGen/TargetLowering.h:2542-2556

Description: The new virtual method getMulByConstInfo() added to TargetLowering is tested only indirectly through CodeGen integration tests, but there are no unit tests for the method's logic, edge cases, or different input parameters.

Suggestion: Add unit tests in llvm/unittests/CodeGen/ that directly test getMulByConstInfo() for power-of-2 constants, sum/difference decomposable constants, non-decomposable constants, negative values, and boundary conditions (0, 1, -1, 255).

Confidence: 75%

Rule: testing_coverage_gaps

🔵 LOW - Power-of-2 multipliers not explicitly tested in decompose file

Agent: testing

Category: testing

Why this matters: Prevents regressions when adding features, ensures all code paths are tested, catches bugs in edge cases and configuration combinations that would otherwise only surface in production.

File: llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll:1-30

Description: The decompose test file focuses on sum/difference of powers of 2 (const6, const10, const12, etc.) but doesn't have explicit test cases for pure power-of-2 multipliers (2, 4, 8, 16, etc.) that should use single shift decomposition.

Suggestion: Add explicit test cases for power-of-2 multipliers (@mul_v16i8_const2, @mul_v16i8_const4, @mul_v16i8_const8, etc.) to verify they emit a single shift instruction rather than add/sub operations.

Confidence: 70%

Rule: test_all_variations_comprehensive

🔵 LOW - Test file RUN line missing documentation of attribute choice

Agent: testing

Category: docs

Why this matters: Minimal tests are easier to understand, debug, and maintain. They make test failures clearer by isolating exactly what broke, and reduce test execution time by avoiding unnecessary operations.

File: llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll:1-10

Description: The test uses -mattr=-tuning-fast-imm-vector-shift but doesn't document in comments why this attribute needs to be disabled. This makes the test intent unclear.

Suggestion: Add a comment explaining why the attribute is disabled: '; NOTE: -tuning-fast-imm-vector-shift is disabled to test shift+add/sub decomposition optimizations' or similar explanation.

Confidence: 60%

Rule: test_minimal_complexity

🔗 View full review details

_{Review ID: cd5c2c96-97d5-41f9-a889-452fd9bf5030}
_{Rate it 👍 or 👎 to improve future reviews | Powered by diffray}

- Remove unnecessary target API - Use existing decomposition logic

- Second round of PR feedback

RKSimon

One last thing - please can you move the negative tests to the bottom of the vector-mul-i8-decompose.ll - no need to split them

- remove redundant comment

RKSimon · 2026-02-17T15:48:25Z

+  ret <16 x i8> %result
+}
+
+; Test multiply by 160 = 128 + 32 = (1 << 7) + (1 << 5)


This DOES use (v)psubb but your comment says its an add?

Thanks, this was a surprising one. The optimization does decompose this case as an add, but it looks like some later pass converts it to $0 - (2^5 + 2^6) = -96 \equiv 160$ .

- Fixed tests to actually exercise the negative branch - Use "splat" - Fix test comments

RKSimon

LGTM - cheers

github-actions · 2026-03-11T18:46:29Z

@grodranlorth Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

Fixes llvm#164200 ~~I will create a separate PR to the `llvm-test-suite` repo for the microbenchmark for this change.~~ The benchmark is in llvm/llvm-test-suite#316 In my experiments on an EC2 `c6i.4xl`, the change gives a small improvement for the `x86-64`, `x86-64-v2`, and `x86-64-v3` targets. It regresses performance on `x86-64-v4` (in particular, when the constant decomposes into two shifts). The performance summary follows: ``` $ ../MicroBenchmarks/libs/benchmark/tools/compare.py benchmarks results-baseline-generic-v1.json results-opt-generic-v1.json |tail -n1 OVERALL_GEOMEAN -0.2846 -0.2846 0 0 0 0 $ ../MicroBenchmarks/libs/benchmark/tools/compare.py benchmarks results-baseline-generic-v2.json results-opt-generic-v2.json |tail -n1 OVERALL_GEOMEAN -0.0907 -0.0907 0 0 0 0 $ ../MicroBenchmarks/libs/benchmark/tools/compare.py benchmarks results-baseline-generic-v3.json results-opt-generic-v3.json |tail -n1 OVERALL_GEOMEAN -0.1821 -0.1821 0 0 0 0 $ ../MicroBenchmarks/libs/benchmark/tools/compare.py benchmarks results-baseline-generic-v4.json results-opt-generic-v4.json |tail -n1 OVERALL_GEOMEAN +0.0190 +0.0190 0 0 0 0 ```

grodranlorth added 2 commits December 30, 2025 21:25

[X86] lit tests for vXi8 shift+add/sub without width extension

bb437eb

[X86] Use shift+add/sub for vXi8 splat multiply

e418ad1

And avoid extending to 16 bits

llvmbot added backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels Dec 31, 2025

grodranlorth changed the title ~~Use shift+add/sub for vXi8 splat multiplies #164200~~ [X86] Use shift+add/sub for vXi8 splat multiplies #164200 Dec 31, 2025

grodranlorth changed the title ~~[X86] Use shift+add/sub for vXi8 splat multiplies #164200~~ [X86] Use shift+add/sub for vXi8 splat multiplies Dec 31, 2025

RKSimon self-requested a review December 31, 2025 21:48

arsenm reviewed Jan 2, 2026

View reviewed changes

Comment thread llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Outdated

Comment thread llvm/lib/Target/X86/X86ISelLowering.cpp Outdated

diffray-bot reviewed Jan 2, 2026

View reviewed changes

Comment thread llvm/include/llvm/CodeGen/TargetLowering.h Outdated

Comment thread llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll

Comment thread llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll

RKSimon requested changes Jan 6, 2026

View reviewed changes

Comment thread llvm/include/llvm/CodeGen/TargetLowering.h Outdated

fixup! [X86] Use shift+add/sub for vXi8 splat multiply

625a6f2

- Remove unnecessary target API - Use existing decomposition logic

grodranlorth requested a review from RKSimon January 28, 2026 02:01

RKSimon requested changes Jan 28, 2026

View reviewed changes

Comment thread llvm/test/CodeGen/X86/vector-mul-i8-negative.ll Outdated

Comment thread llvm/lib/Target/X86/X86ISelLowering.cpp

Comment thread llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Outdated

RKSimon reviewed Jan 28, 2026

View reviewed changes

Comment thread llvm/test/CodeGen/X86/vector-mul-i8-negative.ll Outdated

RKSimon reviewed Jan 28, 2026

View reviewed changes

Comment thread llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll Outdated

fixup! [X86] Use shift+add/sub for vXi8 splat multiply

ce38724

- Second round of PR feedback

grodranlorth requested a review from RKSimon February 1, 2026 16:30

RKSimon requested changes Feb 4, 2026

View reviewed changes

grodranlorth added 2 commits February 7, 2026 19:43

fixup! [X86] Use shift+add/sub for vXi8 splat multiply

d645d4e

- remove redundant comment

Move vXi8 tests into a single file

4d22476

grodranlorth requested a review from RKSimon February 7, 2026 19:51

RKSimon requested changes Feb 17, 2026

View reviewed changes

RKSimon's test feedback

0b4d080

- Fixed tests to actually exercise the negative branch - Use "splat" - Fix test comments

grodranlorth requested a review from RKSimon March 8, 2026 19:29

RKSimon approved these changes Mar 11, 2026

View reviewed changes

Merge branch 'main' into shiftopt

3e483de

RKSimon enabled auto-merge (squash) March 11, 2026 18:10

RKSimon merged commit ed42ac3 into llvm:main Mar 11, 2026
9 of 10 checks passed

grodranlorth deleted the shiftopt branch March 21, 2026 18:22

Conversation

grodranlorth commented Dec 31, 2025 • edited by RKSimon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Dec 31, 2025

Uh oh!

llvmbot commented Dec 31, 2025

Uh oh!

llvmbot commented Dec 31, 2025

Uh oh!

github-actions Bot commented Jan 1, 2026

Uh oh!

Uh oh!

Uh oh!

diffray-bot commented Jan 2, 2026

Changes Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

diffray-bot commented Jan 2, 2026

Review Summary

Issues Found: 3

🟡 MEDIUM - New virtual method getMulByConstInfo() lacks unit tests

🔵 LOW - Power-of-2 multipliers not explicitly tested in decompose file

🔵 LOW - Test file RUN line missing documentation of attribute choice

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RKSimon Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

grodranlorth Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

grodranlorth commented Dec 31, 2025 •

edited by RKSimon

Loading