Skip to content

[X86] Use shift+add/sub for vXi8 splat multiplies#174110

Merged
RKSimon merged 8 commits intollvm:mainfrom
grodranlorth:shiftopt
Mar 11, 2026
Merged

[X86] Use shift+add/sub for vXi8 splat multiplies#174110
RKSimon merged 8 commits intollvm:mainfrom
grodranlorth:shiftopt

Conversation

@grodranlorth
Copy link
Copy Markdown
Contributor

@grodranlorth grodranlorth commented Dec 31, 2025

Fixes #164200

I will create a separate PR to the llvm-test-suite repo for the microbenchmark for this change. The benchmark is in llvm/llvm-test-suite#316

In my experiments on an EC2 c6i.4xl, the change gives a small improvement for the x86-64, x86-64-v2, and x86-64-v3 targets. It regresses performance on x86-64-v4 (in particular, when the constant decomposes into two shifts). The performance summary follows:

$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v1.json results-opt-generic-v1.json  |tail -n1
OVERALL_GEOMEAN                         -0.2846         -0.2846             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v2.json results-opt-generic-v2.json  |tail -n1
OVERALL_GEOMEAN                         -0.0907         -0.0907             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v3.json results-opt-generic-v3.json  |tail -n1
OVERALL_GEOMEAN                         -0.1821         -0.1821             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v4.json results-opt-generic-v4.json  |tail -n1
OVERALL_GEOMEAN                         +0.0190         +0.0190             0             0             0             0

@github-actions
Copy link
Copy Markdown

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added backend:X86 llvm:SelectionDAG SelectionDAGISel as well labels Dec 31, 2025
@llvmbot
Copy link
Copy Markdown
Member

llvmbot commented Dec 31, 2025

@llvm/pr-subscribers-backend-x86

Author: Cody Cutler (grodranlorth)

Changes

Issue #164200

I will create a separate PR to the llvm-test-suite repo for the microbenchmark for this change.

In my experiments on an EC2 c6i.4xl, the change gives a small improvement for the x86-64, x86-64-v2, and x86-64-v3 targets. It regresses performance on x86-64-v4 (in particular, when the constant decomposes into two shifts). The performance summary follows:

$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v1.json results-opt-generic-v1.json  |tail -n1
OVERALL_GEOMEAN                         -0.2846         -0.2846             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v2.json results-opt-generic-v2.json  |tail -n1
OVERALL_GEOMEAN                         -0.0907         -0.0907             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v3.json results-opt-generic-v3.json  |tail -n1
OVERALL_GEOMEAN                         -0.1821         -0.1821             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v4.json results-opt-generic-v4.json  |tail -n1
OVERALL_GEOMEAN                         +0.0190         +0.0190             0             0             0             0

Patch is 81.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/174110.diff

6 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+17)
  • (modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+24)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+59)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.h (+2)
  • (added) llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll (+1231)
  • (added) llvm/test/CodeGen/X86/vector-mul-i8-negative.ll (+466)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 8ad64a852b74d..7594b487b9666 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -2538,6 +2538,23 @@ class LLVM_ABI TargetLoweringBase {
     return false;
   }
 
+  /// Structure to hold detailed decomposition of multiply by constant.
+  struct MulByConstInfo {
+    bool IsDecomposable = false;
+    bool Negate = false;    // True if result should be negated
+    unsigned NumShifts = 0; // 1 or 2
+    unsigned Shift1 = 0;    // Primary shift amount
+    unsigned Shift2 = 0;    // Secondary shift amount (for 2-shift case)
+    bool IsSub = false;     // True for SUB, false for ADD (for 2-shift case)
+  };
+
+  /// Get detailed decomposition of multiply by constant if available.
+  /// Returns decomposition info if the target has a custom decomposition
+  /// for this multiply-by-constant, otherwise returns IsDecomposable = false.
+  virtual MulByConstInfo getMulByConstInfo(EVT VT, const APInt &C) const {
+    return MulByConstInfo();
+  }
+
   /// Return true if it may be profitable to transform
   /// (mul (add x, c1), c2) -> (add (mul x, c2), c1*c2).
   /// This may not be true if c1 and c2 can be represented as immediates but
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 74d00317c3649..26ff36a768167 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -4836,6 +4836,30 @@ template <class MatchContextClass> SDValue DAGCombiner::visitMUL(SDNode *N) {
   //           x * -0xf800 --> -((x << 16) - (x << 11)) ; (x << 11) - (x << 16)
   if (!UseVP && N1IsConst &&
       TLI.decomposeMulByConstant(*DAG.getContext(), VT, N1)) {
+    // First check if target has custom decomposition info
+    TargetLowering::MulByConstInfo Info =
+        TLI.getMulByConstInfo(VT, ConstValue1);
+    if (Info.IsDecomposable) {
+      // Emit custom decomposition based on target's info
+      SDValue Result;
+      if (Info.NumShifts == 1) {
+        // Single shift: result = N0 << Shift1
+        Result = DAG.getNode(ISD::SHL, DL, VT, N0,
+                             DAG.getConstant(Info.Shift1, DL, VT));
+      } else if (Info.NumShifts == 2) {
+        // Two shifts with add or sub
+        SDValue Shl1 = DAG.getNode(ISD::SHL, DL, VT, N0,
+                                   DAG.getConstant(Info.Shift1, DL, VT));
+        SDValue Shl2 = DAG.getNode(ISD::SHL, DL, VT, N0,
+                                   DAG.getConstant(Info.Shift2, DL, VT));
+        Result =
+            DAG.getNode(Info.IsSub ? ISD::SUB : ISD::ADD, DL, VT, Shl1, Shl2);
+      }
+      if (Info.Negate)
+        Result = DAG.getNegative(Result, DL, VT);
+      return Result;
+    }
+
     // TODO: We could handle more general decomposition of any constant by
     //       having the target set a limit on number of ops and making a
     //       callback to determine that sequence (similar to sqrt expansion).
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 20136ade7c317..f75ce5a53188f 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -3389,6 +3389,12 @@ bool X86TargetLowering::decomposeMulByConstant(LLVMContext &Context, EVT VT,
   if (!ISD::isConstantSplatVector(C.getNode(), MulC))
     return false;
 
+  // Check if this is an 8-bit vector multiply that can be decomposed to shifts.
+  if (VT.isVector() && VT.getScalarSizeInBits() == 8) {
+    if (getMulByConstInfo(VT, MulC).IsDecomposable)
+      return true;
+  }
+
   // Find the type this will be legalized too. Otherwise we might prematurely
   // convert this to shl+add/sub and then still have to type legalize those ops.
   // Another choice would be to defer the decision for illegal types until
@@ -3413,6 +3419,59 @@ bool X86TargetLowering::decomposeMulByConstant(LLVMContext &Context, EVT VT,
          (1 - MulC).isPowerOf2() || (-(MulC + 1)).isPowerOf2();
 }
 
+TargetLowering::MulByConstInfo
+X86TargetLowering::getMulByConstInfo(EVT VT, const APInt &Constant) const {
+  // Only handle 8-bit vector multiplies
+  if (!VT.isVector() || VT.getScalarSizeInBits() != 8 ||
+      Constant.getBitWidth() < 8)
+    return MulByConstInfo();
+
+  TargetLowering::MulByConstInfo Info;
+  int8_t SignedC = static_cast<int8_t>(Constant.getZExtValue());
+  Info.Negate = SignedC < 0;
+
+  uint32_t U = static_cast<uint8_t>(Info.Negate ? -SignedC : SignedC);
+  if (U == 0 || U == 1)
+    return Info;
+
+  // Power of 2.
+  if (isPowerOf2_32(U)) {
+    Info.Shift1 = llvm::countr_zero(U);
+    Info.NumShifts = 1;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  // Decomposition logic:
+  //   m = 2^x + 2^y  => (shl x, x) + (shl x, y)
+  //   m = 2^x - 2^y  => (shl x, x) - (shl x, y)
+  // where 2^y is the lowest set bit.
+  uint32_t LowBit = U & (0U - U);
+  unsigned Shift2 = llvm::countr_zero(LowBit);
+
+  uint32_t Rem = U - LowBit;
+  if (isPowerOf2_32(Rem)) {
+    Info.Shift1 = llvm::countr_zero(Rem);
+    Info.Shift2 = Shift2;
+    Info.IsSub = false;
+    Info.NumShifts = 2;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  uint32_t Sum = U + LowBit;
+  if (Sum <= 0xFF && isPowerOf2_32(Sum)) {
+    Info.Shift1 = llvm::countr_zero(Sum);
+    Info.Shift2 = Shift2;
+    Info.IsSub = true;
+    Info.NumShifts = 2;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  return Info;
+}
+
 bool X86TargetLowering::isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
                                                 unsigned Index) const {
   if (!isOperationLegalOrCustom(ISD::EXTRACT_SUBVECTOR, ResVT))
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index a528c311975d8..24372598aaf53 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1537,6 +1537,8 @@ namespace llvm {
     bool decomposeMulByConstant(LLVMContext &Context, EVT VT,
                                 SDValue C) const override;
 
+    MulByConstInfo getMulByConstInfo(EVT VT, const APInt &C) const override;
+
     /// Return true if EXTRACT_SUBVECTOR is cheap for this result type
     /// with this index.
     bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
diff --git a/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll b/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
new file mode 100644
index 0000000000000..9648352ebc2a9
--- /dev/null
+++ b/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
@@ -0,0 +1,1231 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift | FileCheck %s --check-prefixes=CHECK,SSE2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift,+avx2 | FileCheck %s --check-prefixes=CHECK,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift,+avx512f,+avx512bw | FileCheck %s --check-prefixes=CHECK,AVX512
+
+;; Tests vXi8 constant-multiply decomposition into shift/add/sub sequences.
+;;
+;; Examples:
+;;   6 = 2^2 + 2^1 = 4 + 2  (or 8 - 2)
+;;   10 = 2^3 + 2^1 = 8 + 2
+;;   12 = 2^3 + 2^2 = 8 + 4  (or 16 - 4)
+;;   18 = 2^4 + 2^1 = 16 + 2
+;;   20 = 2^4 + 2^2 = 16 + 4
+;;   24 = 2^4 + 2^3 = 16 + 8  (or 32 - 8)
+;;
+;; To run this test:
+;;   llvm-lit llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
+;;
+;; To regenerate CHECK lines:
+;;   python llvm/utils/update_llc_test_checks.py llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
+
+;; ============================================================================
+;; v16i8 Tests (128-bit vectors) - Sum of two powers of 2
+;; ============================================================================
+
+define <16 x i8> @mul_v16i8_const6(<16 x i8> %a) nounwind {
+; Test multiply by 6 = 4 + 2 = (1 << 2) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const6:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const6:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const6:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const10(<16 x i8> %a) nounwind {
+; Test multiply by 10 = 8 + 2 = (1 << 3) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const10:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const10:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const10:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const12(<16 x i8> %a) nounwind {
+; Test multiply by 12 = 8 + 4 = (1 << 3) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const12:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $3, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const12:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const12:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const18(<16 x i8> %a) nounwind {
+; Test multiply by 18 = 16 + 2 = (1 << 4) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const18:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const18:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const18:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const20(<16 x i8> %a) nounwind {
+; Test multiply by 20 = 16 + 4 = (1 << 4) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const20:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const20:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const20:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const24(<16 x i8> %a) nounwind {
+; Test multiply by 24 = 16 + 8 = (1 << 4) + (1 << 3)
+; SSE2-LABEL: mul_v16i8_const24:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const24:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const24:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const34(<16 x i8> %a) nounwind {
+; Test multiply by 34 = 32 + 2 = (1 << 5) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const34:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const34:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const34:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const36(<16 x i8> %a) nounwind {
+; Test multiply by 36 = 32 + 4 = (1 << 5) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const36:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const36:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const36:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const40(<16 x i8> %a) nounwind {
+; Test multiply by 40 = 32 + 8 = (1 << 5) + (1 << 3)
+; SSE2-LABEL: mul_v16i8_const40:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const40:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const40:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const48(<16 x i8> %a) nounwind {
+; Test multiply by 48 = 32 + 16 = (1 << 5) + (1 << 4)
+; SSE2-LABEL: mul_v16i8_const48:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $4, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const48:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const48:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48>
+  ret <16 x i...
[truncated]

@llvmbot
Copy link
Copy Markdown
Member

llvmbot commented Dec 31, 2025

@llvm/pr-subscribers-llvm-selectiondag

Author: Cody Cutler (grodranlorth)

Changes

Issue #164200

I will create a separate PR to the llvm-test-suite repo for the microbenchmark for this change.

In my experiments on an EC2 c6i.4xl, the change gives a small improvement for the x86-64, x86-64-v2, and x86-64-v3 targets. It regresses performance on x86-64-v4 (in particular, when the constant decomposes into two shifts). The performance summary follows:

$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v1.json results-opt-generic-v1.json  |tail -n1
OVERALL_GEOMEAN                         -0.2846         -0.2846             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v2.json results-opt-generic-v2.json  |tail -n1
OVERALL_GEOMEAN                         -0.0907         -0.0907             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v3.json results-opt-generic-v3.json  |tail -n1
OVERALL_GEOMEAN                         -0.1821         -0.1821             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v4.json results-opt-generic-v4.json  |tail -n1
OVERALL_GEOMEAN                         +0.0190         +0.0190             0             0             0             0

Patch is 81.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/174110.diff

6 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+17)
  • (modified) llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp (+24)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+59)
  • (modified) llvm/lib/Target/X86/X86ISelLowering.h (+2)
  • (added) llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll (+1231)
  • (added) llvm/test/CodeGen/X86/vector-mul-i8-negative.ll (+466)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 8ad64a852b74d..7594b487b9666 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -2538,6 +2538,23 @@ class LLVM_ABI TargetLoweringBase {
     return false;
   }
 
+  /// Structure to hold detailed decomposition of multiply by constant.
+  struct MulByConstInfo {
+    bool IsDecomposable = false;
+    bool Negate = false;    // True if result should be negated
+    unsigned NumShifts = 0; // 1 or 2
+    unsigned Shift1 = 0;    // Primary shift amount
+    unsigned Shift2 = 0;    // Secondary shift amount (for 2-shift case)
+    bool IsSub = false;     // True for SUB, false for ADD (for 2-shift case)
+  };
+
+  /// Get detailed decomposition of multiply by constant if available.
+  /// Returns decomposition info if the target has a custom decomposition
+  /// for this multiply-by-constant, otherwise returns IsDecomposable = false.
+  virtual MulByConstInfo getMulByConstInfo(EVT VT, const APInt &C) const {
+    return MulByConstInfo();
+  }
+
   /// Return true if it may be profitable to transform
   /// (mul (add x, c1), c2) -> (add (mul x, c2), c1*c2).
   /// This may not be true if c1 and c2 can be represented as immediates but
diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 74d00317c3649..26ff36a768167 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -4836,6 +4836,30 @@ template <class MatchContextClass> SDValue DAGCombiner::visitMUL(SDNode *N) {
   //           x * -0xf800 --> -((x << 16) - (x << 11)) ; (x << 11) - (x << 16)
   if (!UseVP && N1IsConst &&
       TLI.decomposeMulByConstant(*DAG.getContext(), VT, N1)) {
+    // First check if target has custom decomposition info
+    TargetLowering::MulByConstInfo Info =
+        TLI.getMulByConstInfo(VT, ConstValue1);
+    if (Info.IsDecomposable) {
+      // Emit custom decomposition based on target's info
+      SDValue Result;
+      if (Info.NumShifts == 1) {
+        // Single shift: result = N0 << Shift1
+        Result = DAG.getNode(ISD::SHL, DL, VT, N0,
+                             DAG.getConstant(Info.Shift1, DL, VT));
+      } else if (Info.NumShifts == 2) {
+        // Two shifts with add or sub
+        SDValue Shl1 = DAG.getNode(ISD::SHL, DL, VT, N0,
+                                   DAG.getConstant(Info.Shift1, DL, VT));
+        SDValue Shl2 = DAG.getNode(ISD::SHL, DL, VT, N0,
+                                   DAG.getConstant(Info.Shift2, DL, VT));
+        Result =
+            DAG.getNode(Info.IsSub ? ISD::SUB : ISD::ADD, DL, VT, Shl1, Shl2);
+      }
+      if (Info.Negate)
+        Result = DAG.getNegative(Result, DL, VT);
+      return Result;
+    }
+
     // TODO: We could handle more general decomposition of any constant by
     //       having the target set a limit on number of ops and making a
     //       callback to determine that sequence (similar to sqrt expansion).
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 20136ade7c317..f75ce5a53188f 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -3389,6 +3389,12 @@ bool X86TargetLowering::decomposeMulByConstant(LLVMContext &Context, EVT VT,
   if (!ISD::isConstantSplatVector(C.getNode(), MulC))
     return false;
 
+  // Check if this is an 8-bit vector multiply that can be decomposed to shifts.
+  if (VT.isVector() && VT.getScalarSizeInBits() == 8) {
+    if (getMulByConstInfo(VT, MulC).IsDecomposable)
+      return true;
+  }
+
   // Find the type this will be legalized too. Otherwise we might prematurely
   // convert this to shl+add/sub and then still have to type legalize those ops.
   // Another choice would be to defer the decision for illegal types until
@@ -3413,6 +3419,59 @@ bool X86TargetLowering::decomposeMulByConstant(LLVMContext &Context, EVT VT,
          (1 - MulC).isPowerOf2() || (-(MulC + 1)).isPowerOf2();
 }
 
+TargetLowering::MulByConstInfo
+X86TargetLowering::getMulByConstInfo(EVT VT, const APInt &Constant) const {
+  // Only handle 8-bit vector multiplies
+  if (!VT.isVector() || VT.getScalarSizeInBits() != 8 ||
+      Constant.getBitWidth() < 8)
+    return MulByConstInfo();
+
+  TargetLowering::MulByConstInfo Info;
+  int8_t SignedC = static_cast<int8_t>(Constant.getZExtValue());
+  Info.Negate = SignedC < 0;
+
+  uint32_t U = static_cast<uint8_t>(Info.Negate ? -SignedC : SignedC);
+  if (U == 0 || U == 1)
+    return Info;
+
+  // Power of 2.
+  if (isPowerOf2_32(U)) {
+    Info.Shift1 = llvm::countr_zero(U);
+    Info.NumShifts = 1;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  // Decomposition logic:
+  //   m = 2^x + 2^y  => (shl x, x) + (shl x, y)
+  //   m = 2^x - 2^y  => (shl x, x) - (shl x, y)
+  // where 2^y is the lowest set bit.
+  uint32_t LowBit = U & (0U - U);
+  unsigned Shift2 = llvm::countr_zero(LowBit);
+
+  uint32_t Rem = U - LowBit;
+  if (isPowerOf2_32(Rem)) {
+    Info.Shift1 = llvm::countr_zero(Rem);
+    Info.Shift2 = Shift2;
+    Info.IsSub = false;
+    Info.NumShifts = 2;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  uint32_t Sum = U + LowBit;
+  if (Sum <= 0xFF && isPowerOf2_32(Sum)) {
+    Info.Shift1 = llvm::countr_zero(Sum);
+    Info.Shift2 = Shift2;
+    Info.IsSub = true;
+    Info.NumShifts = 2;
+    Info.IsDecomposable = true;
+    return Info;
+  }
+
+  return Info;
+}
+
 bool X86TargetLowering::isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
                                                 unsigned Index) const {
   if (!isOperationLegalOrCustom(ISD::EXTRACT_SUBVECTOR, ResVT))
diff --git a/llvm/lib/Target/X86/X86ISelLowering.h b/llvm/lib/Target/X86/X86ISelLowering.h
index a528c311975d8..24372598aaf53 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.h
+++ b/llvm/lib/Target/X86/X86ISelLowering.h
@@ -1537,6 +1537,8 @@ namespace llvm {
     bool decomposeMulByConstant(LLVMContext &Context, EVT VT,
                                 SDValue C) const override;
 
+    MulByConstInfo getMulByConstInfo(EVT VT, const APInt &C) const override;
+
     /// Return true if EXTRACT_SUBVECTOR is cheap for this result type
     /// with this index.
     bool isExtractSubvectorCheap(EVT ResVT, EVT SrcVT,
diff --git a/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll b/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
new file mode 100644
index 0000000000000..9648352ebc2a9
--- /dev/null
+++ b/llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
@@ -0,0 +1,1231 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift | FileCheck %s --check-prefixes=CHECK,SSE2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift,+avx2 | FileCheck %s --check-prefixes=CHECK,AVX2
+; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=-tuning-fast-imm-vector-shift,+avx512f,+avx512bw | FileCheck %s --check-prefixes=CHECK,AVX512
+
+;; Tests vXi8 constant-multiply decomposition into shift/add/sub sequences.
+;;
+;; Examples:
+;;   6 = 2^2 + 2^1 = 4 + 2  (or 8 - 2)
+;;   10 = 2^3 + 2^1 = 8 + 2
+;;   12 = 2^3 + 2^2 = 8 + 4  (or 16 - 4)
+;;   18 = 2^4 + 2^1 = 16 + 2
+;;   20 = 2^4 + 2^2 = 16 + 4
+;;   24 = 2^4 + 2^3 = 16 + 8  (or 32 - 8)
+;;
+;; To run this test:
+;;   llvm-lit llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
+;;
+;; To regenerate CHECK lines:
+;;   python llvm/utils/update_llc_test_checks.py llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
+
+;; ============================================================================
+;; v16i8 Tests (128-bit vectors) - Sum of two powers of 2
+;; ============================================================================
+
+define <16 x i8> @mul_v16i8_const6(<16 x i8> %a) nounwind {
+; Test multiply by 6 = 4 + 2 = (1 << 2) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const6:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const6:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const6:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6, i8 6>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const10(<16 x i8> %a) nounwind {
+; Test multiply by 10 = 8 + 2 = (1 << 3) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const10:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const10:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const10:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10, i8 10>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const12(<16 x i8> %a) nounwind {
+; Test multiply by 12 = 8 + 4 = (1 << 3) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const12:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $3, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const12:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const12:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12, i8 12>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const18(<16 x i8> %a) nounwind {
+; Test multiply by 18 = 16 + 2 = (1 << 4) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const18:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const18:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const18:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18, i8 18>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const20(<16 x i8> %a) nounwind {
+; Test multiply by 20 = 16 + 4 = (1 << 4) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const20:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const20:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const20:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20, i8 20>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const24(<16 x i8> %a) nounwind {
+; Test multiply by 24 = 16 + 8 = (1 << 4) + (1 << 3)
+; SSE2-LABEL: mul_v16i8_const24:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $4, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const24:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const24:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24, i8 24>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const34(<16 x i8> %a) nounwind {
+; Test multiply by 34 = 32 + 2 = (1 << 5) + (1 << 1)
+; SSE2-LABEL: mul_v16i8_const34:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    paddb %xmm0, %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const34:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const34:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpaddb %xmm0, %xmm0, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34, i8 34>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const36(<16 x i8> %a) nounwind {
+; Test multiply by 36 = 32 + 4 = (1 << 5) + (1 << 2)
+; SSE2-LABEL: mul_v16i8_const36:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $2, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const36:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const36:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $2, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36, i8 36>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const40(<16 x i8> %a) nounwind {
+; Test multiply by 40 = 32 + 8 = (1 << 5) + (1 << 3)
+; SSE2-LABEL: mul_v16i8_const40:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $3, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const40:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const40:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $3, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40, i8 40>
+  ret <16 x i8> %result
+}
+
+define <16 x i8> @mul_v16i8_const48(<16 x i8> %a) nounwind {
+; Test multiply by 48 = 32 + 16 = (1 << 5) + (1 << 4)
+; SSE2-LABEL: mul_v16i8_const48:
+; SSE2:       # %bb.0:
+; SSE2-NEXT:    movdqa %xmm0, %xmm1
+; SSE2-NEXT:    psllw $4, %xmm1
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE2-NEXT:    psllw $5, %xmm0
+; SSE2-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE2-NEXT:    paddb %xmm1, %xmm0
+; SSE2-NEXT:    retq
+;
+; AVX2-LABEL: mul_v16i8_const48:
+; AVX2:       # %bb.0:
+; AVX2-NEXT:    vpsllw $4, %xmm0, %xmm1
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX2-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX2-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX2-NEXT:    retq
+;
+; AVX512-LABEL: mul_v16i8_const48:
+; AVX512:       # %bb.0:
+; AVX512-NEXT:    vpsllw $4, %xmm0, %xmm1
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
+; AVX512-NEXT:    vpsllw $5, %xmm0, %xmm0
+; AVX512-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; AVX512-NEXT:    vpaddb %xmm1, %xmm0, %xmm0
+; AVX512-NEXT:    retq
+  %result = mul <16 x i8> %a, <i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48, i8 48>
+  ret <16 x i...
[truncated]

@grodranlorth grodranlorth changed the title Use shift+add/sub for vXi8 splat multiplies #164200 [X86] Use shift+add/sub for vXi8 splat multiplies #164200 Dec 31, 2025
@grodranlorth grodranlorth changed the title [X86] Use shift+add/sub for vXi8 splat multiplies #164200 [X86] Use shift+add/sub for vXi8 splat multiplies Dec 31, 2025
@RKSimon RKSimon self-requested a review December 31, 2025 21:48
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jan 1, 2026

⚠️ We detected that you are using a GitHub private e-mail address to contribute to the repo.
Please turn off Keep my email addresses private setting in your account.
See LLVM Developer Policy and LLVM Discourse for more information.

Comment thread llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Outdated
Comment thread llvm/lib/Target/X86/X86ISelLowering.cpp Outdated
@diffray-bot
Copy link
Copy Markdown

Changes Summary

This PR adds a specialized optimization for X86 vector multiplication by small constants (vXi8). It introduces a new virtual method getMulByConstInfo() in the TargetLowering interface to allow targets to provide detailed decomposition information for multiply-by-constant operations, enabling X86 to decompose 8-bit vector multiplies into more efficient shift/add/sub sequences instead of using generic multiplication instructions.

Type: feature

Components Affected: X86 Code Generation (X86ISelLowering), DAG Combiner (SelectionDAG), Target Lowering Interface, LLVM CodeGen Optimization

Files Changed
File Summary Change Impact
llvm/include/llvm/CodeGen/TargetLowering.h Added MulByConstInfo structure and virtual getMulByConstInfo() method to the TargetLowering interface for targets to provide custom multiply-by-constant decomposition info ✏️ 🟡
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Enhanced MUL visitor to use target-specific getMulByConstInfo() to emit optimized shift/add/sub sequences instead of generic multiplication ✏️ 🟡
llvm/lib/Target/X86/X86ISelLowering.h Added getMulByConstInfo() override declaration for X86TargetLowering class ✏️ 🟢
llvm/lib/Target/X86/X86ISelLowering.cpp Implemented getMulByConstInfo() with decomposition logic for 8-bit vector multiplies; enhanced decomposeMulByConstant() to recognize decomposable 8-bit vector patterns ✏️ 🔴
llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll Comprehensive test suite for vXi8 multiplication decomposition with ~1200 lines covering SSE2, AVX2, and AVX512 code generation 🟡
llvm/test/CodeGen/X86/vector-mul-i8-negative.ll Test suite for edge cases and non-decomposable constants to ensure fallback to standard multiply instructions works correctly 🟡
Architecture Impact
  • New Patterns: Target-specific decomposition callback pattern via virtual method override, Shift+add and shift-subtract patterns for constant multiplication
  • Coupling: Adds new virtual method to TargetLowering base class, allowing targets to override default behavior. DAGCombiner now queries targets for decomposition info before falling back to generic logic. This maintains backward compatibility (default implementation returns empty/disabled info).

Risk Areas: The decomposition logic for 8-bit constants (getMulByConstInfo in X86ISelLowering.cpp) involves bitwise manipulation and must handle negative constants correctly - requires careful testing of edge cases, Performance regression noted on x86-64-v4 targets when constants decompose into two shifts (+0.019% geomean), indicating potential suboptimality for newer instruction sets with better multiplication capabilities, The optimization targets only vXi8 types; correctness across vector widths (v16i8, v32i8, v64i8) and different ISA extensions (SSE2, AVX2, AVX512) must be validated, DAGCombiner changes introduce new code path that runs before existing generic decomposition - subtle interactions possible with other optimization passes

Suggestions
  • Add comments explaining the decomposition algorithm's mathematical basis in getMulByConstInfo() (e.g., why the formula (U - LowBit) and (U + LowBit) patterns are valid)
  • Consider adding a target feature flag to conditionally enable/disable this optimization for different ISA levels to mitigate x86-64-v4 regression
  • Expand the implementation to handle other vector types (16-bit, 32-bit) if performance benefits are demonstrated
  • The PR description mentions microbenchmarks in llvm-test-suite - ensure those are included/linked in CI validation

Full review in progress... | Powered by diffray

Comment thread llvm/include/llvm/CodeGen/TargetLowering.h Outdated
Comment thread llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
Comment thread llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll
@diffray-bot
Copy link
Copy Markdown

Review Summary

Free public review - Want AI code reviews on your PRs? Check out diffray.ai

Validated 10 issues: 3 kept (testing gaps, legitimate documentation concern), 7 filtered (low value style/docs, speculative performance claims)

Issues Found: 3

💬 See 3 individual line comment(s) for details.

📋 Full issue list (click to expand)

🟡 MEDIUM - New virtual method getMulByConstInfo() lacks unit tests

Agent: testing

Category: testing

Why this matters: Comprehensive test coverage prevents regressions, builds confidence in changes, and reduces debugging time. Manual verification is unreliable and doesn't scale.

File: llvm/include/llvm/CodeGen/TargetLowering.h:2542-2556

Description: The new virtual method getMulByConstInfo() added to TargetLowering is tested only indirectly through CodeGen integration tests, but there are no unit tests for the method's logic, edge cases, or different input parameters.

Suggestion: Add unit tests in llvm/unittests/CodeGen/ that directly test getMulByConstInfo() for power-of-2 constants, sum/difference decomposable constants, non-decomposable constants, negative values, and boundary conditions (0, 1, -1, 255).

Confidence: 75%

Rule: testing_coverage_gaps


🔵 LOW - Power-of-2 multipliers not explicitly tested in decompose file

Agent: testing

Category: testing

Why this matters: Prevents regressions when adding features, ensures all code paths are tested, catches bugs in edge cases and configuration combinations that would otherwise only surface in production.

File: llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll:1-30

Description: The decompose test file focuses on sum/difference of powers of 2 (const6, const10, const12, etc.) but doesn't have explicit test cases for pure power-of-2 multipliers (2, 4, 8, 16, etc.) that should use single shift decomposition.

Suggestion: Add explicit test cases for power-of-2 multipliers (@mul_v16i8_const2, @mul_v16i8_const4, @mul_v16i8_const8, etc.) to verify they emit a single shift instruction rather than add/sub operations.

Confidence: 70%

Rule: test_all_variations_comprehensive


🔵 LOW - Test file RUN line missing documentation of attribute choice

Agent: testing

Category: docs

Why this matters: Minimal tests are easier to understand, debug, and maintain. They make test failures clearer by isolating exactly what broke, and reduce test execution time by avoiding unnecessary operations.

File: llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll:1-10

Description: The test uses -mattr=-tuning-fast-imm-vector-shift but doesn't document in comments why this attribute needs to be disabled. This makes the test intent unclear.

Suggestion: Add a comment explaining why the attribute is disabled: '; NOTE: -tuning-fast-imm-vector-shift is disabled to test shift+add/sub decomposition optimizations' or similar explanation.

Confidence: 60%

Rule: test_minimal_complexity


🔗 View full review details


Review ID: cd5c2c96-97d5-41f9-a889-452fd9bf5030
Rate it 👍 or 👎 to improve future reviews | Powered by diffray

Comment thread llvm/include/llvm/CodeGen/TargetLowering.h Outdated
- Remove unnecessary target API
- Use existing decomposition logic
@grodranlorth grodranlorth requested a review from RKSimon January 28, 2026 02:01
Comment thread llvm/test/CodeGen/X86/vector-mul-i8-negative.ll Outdated
Comment thread llvm/lib/Target/X86/X86ISelLowering.cpp
Comment thread llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp Outdated
Comment thread llvm/test/CodeGen/X86/vector-mul-i8-negative.ll Outdated
Comment thread llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll Outdated
@grodranlorth grodranlorth requested a review from RKSimon February 1, 2026 16:30
Copy link
Copy Markdown
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing - please can you move the negative tests to the bottom of the vector-mul-i8-decompose.ll - no need to split them

@grodranlorth grodranlorth requested a review from RKSimon February 7, 2026 19:51
Comment thread llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll Outdated
Comment thread llvm/test/CodeGen/X86/vector-mul-i8-decompose.ll Outdated
ret <16 x i8> %result
}

; Test multiply by 160 = 128 + 32 = (1 << 7) + (1 << 5)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This DOES use (v)psubb but your comment says its an add?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this was a surprising one. The optimization does decompose this case as an add, but it looks like some later pass converts it to $0 - (2^5 + 2^6) = -96 \equiv 160$ .

- Fixed tests to actually exercise the negative branch
- Use "splat"
- Fix test comments
@grodranlorth grodranlorth requested a review from RKSimon March 8, 2026 19:29
Copy link
Copy Markdown
Collaborator

@RKSimon RKSimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - cheers

@RKSimon RKSimon enabled auto-merge (squash) March 11, 2026 18:10
@RKSimon RKSimon merged commit ed42ac3 into llvm:main Mar 11, 2026
9 of 10 checks passed
@github-actions
Copy link
Copy Markdown

@grodranlorth Congratulations on having your first Pull Request (PR) merged into the LLVM Project!

Your changes will be combined with recent changes from other authors, then tested by our build bots. If there is a problem with a build, you may receive a report in an email or a comment on this PR.

Please check whether problems have been caused by your change specifically, as the builds can include changes from many authors. It is not uncommon for your change to be included in a build that fails due to someone else's changes, or infrastructure issues.

How to do this, and the rest of the post-merge process, is covered in detail here.

If your change does cause a problem, it may be reverted, or you can revert it yourself. This is a normal part of LLVM development. You can fix your changes and open a new PR to merge them again.

If you don't get any reports, no action is required from you. Your changes are working as expected, well done!

@grodranlorth grodranlorth deleted the shiftopt branch March 21, 2026 18:22
ayasin-a pushed a commit to ayasin-a/llvm-project that referenced this pull request Apr 19, 2026
Fixes llvm#164200

~~I will create a separate PR to the `llvm-test-suite` repo for the
microbenchmark for this change.~~ The benchmark is in
llvm/llvm-test-suite#316

In my experiments on an EC2 `c6i.4xl`, the change gives a small
improvement for the `x86-64`, `x86-64-v2`, and `x86-64-v3` targets. It
regresses performance on `x86-64-v4` (in particular, when the constant
decomposes into two shifts). The performance summary follows:

```
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v1.json results-opt-generic-v1.json  |tail -n1
OVERALL_GEOMEAN                         -0.2846         -0.2846             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v2.json results-opt-generic-v2.json  |tail -n1
OVERALL_GEOMEAN                         -0.0907         -0.0907             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v3.json results-opt-generic-v3.json  |tail -n1
OVERALL_GEOMEAN                         -0.1821         -0.1821             0             0             0             0
$ ../MicroBenchmarks/libs/benchmark/tools/compare.py  benchmarks results-baseline-generic-v4.json results-opt-generic-v4.json  |tail -n1
OVERALL_GEOMEAN                         +0.0190         +0.0190             0             0             0             0
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend:X86 llvm:SelectionDAG SelectionDAGISel as well

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[X86] 8-bit vector multiplication should use shift and add method for more constants

5 participants