[X86] Lower vXi8 multiplies by constant using PMADDUBSW on SSSE3+ targets #95403

RKSimon · 2024-06-13T12:38:25Z

As discussed on #90748 - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together.

This patch limits the lowering to multiply by constants, but most targets would benefit from performing this for non-constant cases as well - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/5 contention.

llvmbot · 2024-06-13T12:39:01Z

@llvm/pr-subscribers-backend-x86

Author: Simon Pilgrim (RKSimon)

Changes

As discussed on #90748 - we can avoid unpacks/extensions vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together.

This patch limits the lowering to multiply by constants, but most targets would benefit from performing this for non-constant cases as well - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/5 contention.

Patch is 141.10 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/95403.diff

18 Files Affected:

(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+27)
(modified) llvm/test/CodeGen/X86/combine-mul.ll (+6-9)
(modified) llvm/test/CodeGen/X86/gfni-shifts.ll (+121-173)
(modified) llvm/test/CodeGen/X86/pmul.ll (+83-119)
(modified) llvm/test/CodeGen/X86/srem-seteq-vec-nonsplat.ll (+69-77)
(modified) llvm/test/CodeGen/X86/vector-fshr-128.ll (+20-25)
(modified) llvm/test/CodeGen/X86/vector-fshr-256.ll (+26-40)
(modified) llvm/test/CodeGen/X86/vector-fshr-512.ll (+32-46)
(modified) llvm/test/CodeGen/X86/vector-idiv-sdiv-128.ll (+12-17)
(modified) llvm/test/CodeGen/X86/vector-idiv-sdiv-256.ll (+42-49)
(modified) llvm/test/CodeGen/X86/vector-idiv-sdiv-512.ll (+41-49)
(modified) llvm/test/CodeGen/X86/vector-idiv-udiv-128.ll (+22-21)
(modified) llvm/test/CodeGen/X86/vector-idiv-udiv-256.ll (+36-34)
(modified) llvm/test/CodeGen/X86/vector-idiv-udiv-512.ll (+32-31)
(modified) llvm/test/CodeGen/X86/vector-mul.ll (+56-55)
(modified) llvm/test/CodeGen/X86/vector-shift-shl-128.ll (+11-17)
(modified) llvm/test/CodeGen/X86/vector-shift-shl-256.ll (+41-62)
(modified) llvm/test/CodeGen/X86/vector-shift-shl-512.ll (+15-26)

diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 88c7a4159856a..a59cd712c4f9e 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -28493,6 +28493,8 @@ static SDValue LowerMUL(SDValue Op, const X86Subtarget &Subtarget,
   // vector pairs, multiply and truncate.
   if (VT == MVT::v16i8 || VT == MVT::v32i8 || VT == MVT::v64i8) {
     unsigned NumElts = VT.getVectorNumElements();
+    unsigned NumLanes = VT.getSizeInBits() / 128;
+    unsigned NumEltsPerLane = NumElts / NumLanes;
 
     if ((VT == MVT::v16i8 && Subtarget.hasInt256()) ||
         (VT == MVT::v32i8 && Subtarget.canExtendTo512BW())) {
@@ -28506,6 +28508,31 @@ static SDValue LowerMUL(SDValue Op, const X86Subtarget &Subtarget,
 
     MVT ExVT = MVT::getVectorVT(MVT::i16, NumElts / 2);
 
+    // For vXi8 mul-by-constant, try PMADDUBSW to avoid the need for extension.
+    // Don't do this if we only need to unpack one half.
+    if (Subtarget.hasSSSE3() &&
+        ISD::isBuildVectorOfConstantSDNodes(B.getNode())) {
+      bool IsLoLaneZeroOrUndef = true;
+      bool IsHiLaneZeroOrUndef = true;
+      for (auto [Idx, Val] : enumerate(B->ops())) {
+        if ((Idx % NumEltsPerLane) >= (NumEltsPerLane / 2))
+          IsHiLaneZeroOrUndef &= isNullConstantOrUndef(Val);
+        else
+          IsLoLaneZeroOrUndef &= isNullConstantOrUndef(Val);
+      }
+      if (!(IsLoLaneZeroOrUndef || IsHiLaneZeroOrUndef)) {
+        SDValue Mask = DAG.getBitcast(VT, DAG.getConstant(0x00FF, dl, ExVT));
+        SDValue BLo = DAG.getNode(ISD::AND, dl, VT, Mask, B);
+        SDValue BHi = DAG.getNode(X86ISD::ANDNP, dl, VT, Mask, B);
+        SDValue RLo = DAG.getNode(X86ISD::VPMADDUBSW, dl, ExVT, A, BLo);
+        SDValue RHi = DAG.getNode(X86ISD::VPMADDUBSW, dl, ExVT, A, BHi);
+        RLo = DAG.getNode(ISD::AND, dl, VT, DAG.getBitcast(VT, RLo), Mask);
+        RHi = DAG.getNode(X86ISD::VSHLI, dl, ExVT, RHi,
+                          DAG.getTargetConstant(8, dl, MVT::i8));
+        return VT, DAG.getNode(ISD::OR, dl, VT, RLo, DAG.getBitcast(VT, RHi));
+      }
+    }
+
     // Extract the lo/hi parts to any extend to i16.
     // We're going to mask off the low byte of each result element of the
     // pmullw, so it doesn't matter what's in the high byte of each 16-bit
diff --git a/llvm/test/CodeGen/X86/combine-mul.ll b/llvm/test/CodeGen/X86/combine-mul.ll
index 5d7bf4a2c9788..18e9cf7da110e 100644
--- a/llvm/test/CodeGen/X86/combine-mul.ll
+++ b/llvm/test/CodeGen/X86/combine-mul.ll
@@ -541,15 +541,12 @@ define i64 @combine_mul_smul_lohi_const_i64(i64 %h) {
 define <16 x i8> @PR35579(<16 x i8> %x) {
 ; SSE-LABEL: PR35579:
 ; SSE:       # %bb.0:
-; SSE-NEXT:    pmovzxbw {{.*#+}} xmm1 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
-; SSE-NEXT:    punpckhbw {{.*#+}} xmm0 = xmm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; SSE-NEXT:    pmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
-; SSE-NEXT:    pmovzxbw {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255]
-; SSE-NEXT:    pand %xmm2, %xmm0
-; SSE-NEXT:    pmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
-; SSE-NEXT:    pand %xmm2, %xmm1
-; SSE-NEXT:    packuswb %xmm0, %xmm1
-; SSE-NEXT:    movdqa %xmm1, %xmm0
+; SSE-NEXT:    movdqa %xmm0, %xmm1
+; SSE-NEXT:    pmaddubsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; SSE-NEXT:    psllw $8, %xmm1
+; SSE-NEXT:    pmaddubsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; SSE-NEXT:    por %xmm1, %xmm0
 ; SSE-NEXT:    retq
 ;
 ; AVX-LABEL: PR35579:
diff --git a/llvm/test/CodeGen/X86/gfni-shifts.ll b/llvm/test/CodeGen/X86/gfni-shifts.ll
index 6232488bea71b..c2048bccc82e1 100644
--- a/llvm/test/CodeGen/X86/gfni-shifts.ll
+++ b/llvm/test/CodeGen/X86/gfni-shifts.ll
@@ -385,27 +385,21 @@ define <16 x i8> @splatvar_ashr_v16i8(<16 x i8> %a, <16 x i8> %b) nounwind {
 define <16 x i8> @constant_shl_v16i8(<16 x i8> %a) nounwind {
 ; GFNISSE-LABEL: constant_shl_v16i8:
 ; GFNISSE:       # %bb.0:
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm1 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
-; GFNISSE-NEXT:    punpckhbw {{.*#+}} xmm0 = xmm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNISSE-NEXT:    pmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255]
-; GFNISSE-NEXT:    pand %xmm2, %xmm0
-; GFNISSE-NEXT:    pmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
-; GFNISSE-NEXT:    pand %xmm2, %xmm1
-; GFNISSE-NEXT:    packuswb %xmm0, %xmm1
-; GFNISSE-NEXT:    movdqa %xmm1, %xmm0
+; GFNISSE-NEXT:    movdqa %xmm0, %xmm1
+; GFNISSE-NEXT:    pmaddubsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1
+; GFNISSE-NEXT:    psllw $8, %xmm1
+; GFNISSE-NEXT:    pmaddubsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; GFNISSE-NEXT:    pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
+; GFNISSE-NEXT:    por %xmm1, %xmm0
 ; GFNISSE-NEXT:    retq
 ;
 ; GFNIAVX1-LABEL: constant_shl_v16i8:
 ; GFNIAVX1:       # %bb.0:
-; GFNIAVX1-NEXT:    vpunpckhbw {{.*#+}} xmm1 = xmm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNIAVX1-NEXT:    vpmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
-; GFNIAVX1-NEXT:    vbroadcastss {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255]
-; GFNIAVX1-NEXT:    vpand %xmm2, %xmm1, %xmm1
-; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
-; GFNIAVX1-NEXT:    vpmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
-; GFNIAVX1-NEXT:    vpand %xmm2, %xmm0, %xmm0
-; GFNIAVX1-NEXT:    vpackuswb %xmm1, %xmm0, %xmm0
+; GFNIAVX1-NEXT:    vpmaddubsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm1
+; GFNIAVX1-NEXT:    vpsllw $8, %xmm1, %xmm1
+; GFNIAVX1-NEXT:    vpmaddubsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; GFNIAVX1-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm0
+; GFNIAVX1-NEXT:    vpor %xmm1, %xmm0, %xmm0
 ; GFNIAVX1-NEXT:    retq
 ;
 ; GFNIAVX2-LABEL: constant_shl_v16i8:
@@ -1224,72 +1218,57 @@ define <32 x i8> @splatvar_ashr_v32i8(<32 x i8> %a, <32 x i8> %b) nounwind {
 define <32 x i8> @constant_shl_v32i8(<32 x i8> %a) nounwind {
 ; GFNISSE-LABEL: constant_shl_v32i8:
 ; GFNISSE:       # %bb.0:
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm2 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
-; GFNISSE-NEXT:    punpckhbw {{.*#+}} xmm0 = xmm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm4 = [128,64,32,16,8,4,2,1]
-; GFNISSE-NEXT:    pmullw %xmm4, %xmm0
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm5 = [255,255,255,255,255,255,255,255]
-; GFNISSE-NEXT:    pand %xmm5, %xmm0
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm6 = [1,2,4,8,16,32,64,128]
-; GFNISSE-NEXT:    pmullw %xmm6, %xmm2
-; GFNISSE-NEXT:    pand %xmm5, %xmm2
-; GFNISSE-NEXT:    packuswb %xmm0, %xmm2
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm3 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
-; GFNISSE-NEXT:    punpckhbw {{.*#+}} xmm1 = xmm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNISSE-NEXT:    pmullw %xmm4, %xmm1
-; GFNISSE-NEXT:    pand %xmm5, %xmm1
-; GFNISSE-NEXT:    pmullw %xmm6, %xmm3
-; GFNISSE-NEXT:    pand %xmm5, %xmm3
-; GFNISSE-NEXT:    packuswb %xmm1, %xmm3
-; GFNISSE-NEXT:    movdqa %xmm2, %xmm0
-; GFNISSE-NEXT:    movdqa %xmm3, %xmm1
+; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm2 = [1,4,16,64,128,32,8,2]
+; GFNISSE-NEXT:    movdqa %xmm0, %xmm3
+; GFNISSE-NEXT:    pmaddubsw %xmm2, %xmm3
+; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm4 = [255,255,255,255,255,255,255,255]
+; GFNISSE-NEXT:    pand %xmm4, %xmm3
+; GFNISSE-NEXT:    movdqa {{.*#+}} xmm5 = [0,2,0,8,0,32,0,128,0,64,0,16,0,4,0,1]
+; GFNISSE-NEXT:    pmaddubsw %xmm5, %xmm0
+; GFNISSE-NEXT:    psllw $8, %xmm0
+; GFNISSE-NEXT:    por %xmm3, %xmm0
+; GFNISSE-NEXT:    movdqa %xmm1, %xmm3
+; GFNISSE-NEXT:    pmaddubsw %xmm2, %xmm3
+; GFNISSE-NEXT:    pand %xmm4, %xmm3
+; GFNISSE-NEXT:    pmaddubsw %xmm5, %xmm1
+; GFNISSE-NEXT:    psllw $8, %xmm1
+; GFNISSE-NEXT:    por %xmm3, %xmm1
 ; GFNISSE-NEXT:    retq
 ;
 ; GFNIAVX1-LABEL: constant_shl_v32i8:
 ; GFNIAVX1:       # %bb.0:
 ; GFNIAVX1-NEXT:    vextractf128 $1, %ymm0, %xmm1
-; GFNIAVX1-NEXT:    vpunpckhbw {{.*#+}} xmm2 = xmm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm3 = [128,64,32,16,8,4,2,1]
-; GFNIAVX1-NEXT:    vpmullw %xmm3, %xmm2, %xmm2
+; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm2 = [1,4,16,64,128,32,8,2]
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm2, %xmm1, %xmm3
 ; GFNIAVX1-NEXT:    vbroadcastss {{.*#+}} xmm4 = [255,255,255,255,255,255,255,255]
+; GFNIAVX1-NEXT:    vpand %xmm4, %xmm3, %xmm3
+; GFNIAVX1-NEXT:    vmovdqa {{.*#+}} xmm5 = [0,2,0,8,0,32,0,128,0,64,0,16,0,4,0,1]
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm5, %xmm1, %xmm1
+; GFNIAVX1-NEXT:    vpsllw $8, %xmm1, %xmm1
+; GFNIAVX1-NEXT:    vpor %xmm1, %xmm3, %xmm1
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm2, %xmm0, %xmm2
 ; GFNIAVX1-NEXT:    vpand %xmm4, %xmm2, %xmm2
-; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
-; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm5 = [1,2,4,8,16,32,64,128]
-; GFNIAVX1-NEXT:    vpmullw %xmm5, %xmm1, %xmm1
-; GFNIAVX1-NEXT:    vpand %xmm4, %xmm1, %xmm1
-; GFNIAVX1-NEXT:    vpackuswb %xmm2, %xmm1, %xmm1
-; GFNIAVX1-NEXT:    vpunpckhbw {{.*#+}} xmm2 = xmm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNIAVX1-NEXT:    vpmullw %xmm3, %xmm2, %xmm2
-; GFNIAVX1-NEXT:    vpand %xmm4, %xmm2, %xmm2
-; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
-; GFNIAVX1-NEXT:    vpmullw %xmm5, %xmm0, %xmm0
-; GFNIAVX1-NEXT:    vpand %xmm4, %xmm0, %xmm0
-; GFNIAVX1-NEXT:    vpackuswb %xmm2, %xmm0, %xmm0
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm5, %xmm0, %xmm0
+; GFNIAVX1-NEXT:    vpsllw $8, %xmm0, %xmm0
+; GFNIAVX1-NEXT:    vpor %xmm0, %xmm2, %xmm0
 ; GFNIAVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
 ; GFNIAVX1-NEXT:    retq
 ;
 ; GFNIAVX2-LABEL: constant_shl_v32i8:
 ; GFNIAVX2:       # %bb.0:
-; GFNIAVX2-NEXT:    vpunpckhbw {{.*#+}} ymm1 = ymm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31]
-; GFNIAVX2-NEXT:    vpmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm1
-; GFNIAVX2-NEXT:    vpbroadcastw {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
-; GFNIAVX2-NEXT:    vpand %ymm2, %ymm1, %ymm1
-; GFNIAVX2-NEXT:    vpunpcklbw {{.*#+}} ymm0 = ymm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23,23]
-; GFNIAVX2-NEXT:    vpmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
-; GFNIAVX2-NEXT:    vpand %ymm2, %ymm0, %ymm0
-; GFNIAVX2-NEXT:    vpackuswb %ymm1, %ymm0, %ymm0
+; GFNIAVX2-NEXT:    vpmaddubsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm1
+; GFNIAVX2-NEXT:    vpsllw $8, %ymm1, %ymm1
+; GFNIAVX2-NEXT:    vpmaddubsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; GFNIAVX2-NEXT:    vpand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; GFNIAVX2-NEXT:    vpor %ymm1, %ymm0, %ymm0
 ; GFNIAVX2-NEXT:    retq
 ;
 ; GFNIAVX512VL-LABEL: constant_shl_v32i8:
 ; GFNIAVX512VL:       # %bb.0:
-; GFNIAVX512VL-NEXT:    vpunpckhbw {{.*#+}} ymm1 = ymm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31]
-; GFNIAVX512VL-NEXT:    vpmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm1, %ymm1
-; GFNIAVX512VL-NEXT:    vpbroadcastd {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
-; GFNIAVX512VL-NEXT:    vpand %ymm2, %ymm1, %ymm1
-; GFNIAVX512VL-NEXT:    vpunpcklbw {{.*#+}} ymm0 = ymm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23,23]
-; GFNIAVX512VL-NEXT:    vpmullw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
-; GFNIAVX512VL-NEXT:    vpand %ymm2, %ymm0, %ymm0
-; GFNIAVX512VL-NEXT:    vpackuswb %ymm1, %ymm0, %ymm0
+; GFNIAVX512VL-NEXT:    vpmaddubsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm1
+; GFNIAVX512VL-NEXT:    vpmaddubsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
+; GFNIAVX512VL-NEXT:    vpsllw $8, %ymm0, %ymm0
+; GFNIAVX512VL-NEXT:    vpternlogd $248, {{\.?LCPI[0-9]+_[0-9]+}}(%rip){1to8}, %ymm1, %ymm0
 ; GFNIAVX512VL-NEXT:    retq
 ;
 ; GFNIAVX512BW-LABEL: constant_shl_v32i8:
@@ -2588,140 +2567,109 @@ define <64 x i8> @splatvar_ashr_v64i8(<64 x i8> %a, <64 x i8> %b) nounwind {
 define <64 x i8> @constant_shl_v64i8(<64 x i8> %a) nounwind {
 ; GFNISSE-LABEL: constant_shl_v64i8:
 ; GFNISSE:       # %bb.0:
-; GFNISSE-NEXT:    movdqa %xmm1, %xmm4
-; GFNISSE-NEXT:    movdqa %xmm0, %xmm1
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
-; GFNISSE-NEXT:    punpckhbw {{.*#+}} xmm1 = xmm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm7 = [128,64,32,16,8,4,2,1]
-; GFNISSE-NEXT:    pmullw %xmm7, %xmm1
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm6 = [255,255,255,255,255,255,255,255]
-; GFNISSE-NEXT:    pand %xmm6, %xmm1
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm8 = [1,2,4,8,16,32,64,128]
-; GFNISSE-NEXT:    pmullw %xmm8, %xmm0
-; GFNISSE-NEXT:    pand %xmm6, %xmm0
-; GFNISSE-NEXT:    packuswb %xmm1, %xmm0
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm1 = xmm4[0],zero,xmm4[1],zero,xmm4[2],zero,xmm4[3],zero,xmm4[4],zero,xmm4[5],zero,xmm4[6],zero,xmm4[7],zero
-; GFNISSE-NEXT:    punpckhbw {{.*#+}} xmm4 = xmm4[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNISSE-NEXT:    pmullw %xmm7, %xmm4
-; GFNISSE-NEXT:    pand %xmm6, %xmm4
-; GFNISSE-NEXT:    pmullw %xmm8, %xmm1
-; GFNISSE-NEXT:    pand %xmm6, %xmm1
-; GFNISSE-NEXT:    packuswb %xmm4, %xmm1
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm4 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
-; GFNISSE-NEXT:    punpckhbw {{.*#+}} xmm2 = xmm2[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNISSE-NEXT:    pmullw %xmm7, %xmm2
-; GFNISSE-NEXT:    pand %xmm6, %xmm2
-; GFNISSE-NEXT:    pmullw %xmm8, %xmm4
-; GFNISSE-NEXT:    pand %xmm6, %xmm4
-; GFNISSE-NEXT:    packuswb %xmm2, %xmm4
-; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm5 = xmm3[0],zero,xmm3[1],zero,xmm3[2],zero,xmm3[3],zero,xmm3[4],zero,xmm3[5],zero,xmm3[6],zero,xmm3[7],zero
-; GFNISSE-NEXT:    punpckhbw {{.*#+}} xmm3 = xmm3[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNISSE-NEXT:    pmullw %xmm7, %xmm3
-; GFNISSE-NEXT:    pand %xmm6, %xmm3
-; GFNISSE-NEXT:    pmullw %xmm8, %xmm5
-; GFNISSE-NEXT:    pand %xmm6, %xmm5
-; GFNISSE-NEXT:    packuswb %xmm3, %xmm5
-; GFNISSE-NEXT:    movdqa %xmm4, %xmm2
-; GFNISSE-NEXT:    movdqa %xmm5, %xmm3
+; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm4 = [1,4,16,64,128,32,8,2]
+; GFNISSE-NEXT:    movdqa %xmm0, %xmm6
+; GFNISSE-NEXT:    pmaddubsw %xmm4, %xmm6
+; GFNISSE-NEXT:    pmovzxbw {{.*#+}} xmm5 = [255,255,255,255,255,255,255,255]
+; GFNISSE-NEXT:    pand %xmm5, %xmm6
+; GFNISSE-NEXT:    movdqa {{.*#+}} xmm7 = [0,2,0,8,0,32,0,128,0,64,0,16,0,4,0,1]
+; GFNISSE-NEXT:    pmaddubsw %xmm7, %xmm0
+; GFNISSE-NEXT:    psllw $8, %xmm0
+; GFNISSE-NEXT:    por %xmm6, %xmm0
+; GFNISSE-NEXT:    movdqa %xmm1, %xmm6
+; GFNISSE-NEXT:    pmaddubsw %xmm4, %xmm6
+; GFNISSE-NEXT:    pand %xmm5, %xmm6
+; GFNISSE-NEXT:    pmaddubsw %xmm7, %xmm1
+; GFNISSE-NEXT:    psllw $8, %xmm1
+; GFNISSE-NEXT:    por %xmm6, %xmm1
+; GFNISSE-NEXT:    movdqa %xmm2, %xmm6
+; GFNISSE-NEXT:    pmaddubsw %xmm4, %xmm6
+; GFNISSE-NEXT:    pand %xmm5, %xmm6
+; GFNISSE-NEXT:    pmaddubsw %xmm7, %xmm2
+; GFNISSE-NEXT:    psllw $8, %xmm2
+; GFNISSE-NEXT:    por %xmm6, %xmm2
+; GFNISSE-NEXT:    movdqa %xmm3, %xmm6
+; GFNISSE-NEXT:    pmaddubsw %xmm4, %xmm6
+; GFNISSE-NEXT:    pand %xmm5, %xmm6
+; GFNISSE-NEXT:    pmaddubsw %xmm7, %xmm3
+; GFNISSE-NEXT:    psllw $8, %xmm3
+; GFNISSE-NEXT:    por %xmm6, %xmm3
 ; GFNISSE-NEXT:    retq
 ;
 ; GFNIAVX1-LABEL: constant_shl_v64i8:
 ; GFNIAVX1:       # %bb.0:
 ; GFNIAVX1-NEXT:    vextractf128 $1, %ymm0, %xmm2
-; GFNIAVX1-NEXT:    vpunpckhbw {{.*#+}} xmm3 = xmm2[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm4 = [128,64,32,16,8,4,2,1]
-; GFNIAVX1-NEXT:    vpmullw %xmm4, %xmm3, %xmm3
+; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm3 = [1,4,16,64,128,32,8,2]
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm3, %xmm2, %xmm4
 ; GFNIAVX1-NEXT:    vbroadcastss {{.*#+}} xmm5 = [255,255,255,255,255,255,255,255]
-; GFNIAVX1-NEXT:    vpand %xmm5, %xmm3, %xmm3
-; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
-; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm6 = [1,2,4,8,16,32,64,128]
-; GFNIAVX1-NEXT:    vpmullw %xmm6, %xmm2, %xmm2
-; GFNIAVX1-NEXT:    vpand %xmm5, %xmm2, %xmm2
-; GFNIAVX1-NEXT:    vpackuswb %xmm3, %xmm2, %xmm2
-; GFNIAVX1-NEXT:    vpunpckhbw {{.*#+}} xmm3 = xmm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNIAVX1-NEXT:    vpmullw %xmm4, %xmm3, %xmm3
-; GFNIAVX1-NEXT:    vpand %xmm5, %xmm3, %xmm3
-; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
-; GFNIAVX1-NEXT:    vpmullw %xmm6, %xmm0, %xmm0
-; GFNIAVX1-NEXT:    vpand %xmm5, %xmm0, %xmm0
-; GFNIAVX1-NEXT:    vpackuswb %xmm3, %xmm0, %xmm0
+; GFNIAVX1-NEXT:    vpand %xmm5, %xmm4, %xmm4
+; GFNIAVX1-NEXT:    vmovdqa {{.*#+}} xmm6 = [0,2,0,8,0,32,0,128,0,64,0,16,0,4,0,1]
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm6, %xmm2, %xmm2
+; GFNIAVX1-NEXT:    vpsllw $8, %xmm2, %xmm2
+; GFNIAVX1-NEXT:    vpor %xmm2, %xmm4, %xmm2
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm3, %xmm0, %xmm4
+; GFNIAVX1-NEXT:    vpand %xmm5, %xmm4, %xmm4
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm6, %xmm0, %xmm0
+; GFNIAVX1-NEXT:    vpsllw $8, %xmm0, %xmm0
+; GFNIAVX1-NEXT:    vpor %xmm0, %xmm4, %xmm0
 ; GFNIAVX1-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
 ; GFNIAVX1-NEXT:    vextractf128 $1, %ymm1, %xmm2
-; GFNIAVX1-NEXT:    vpunpckhbw {{.*#+}} xmm3 = xmm2[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNIAVX1-NEXT:    vpmullw %xmm4, %xmm3, %xmm3
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm3, %xmm2, %xmm4
+; GFNIAVX1-NEXT:    vpand %xmm5, %xmm4, %xmm4
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm6, %xmm2, %xmm2
+; GFNIAVX1-NEXT:    vpsllw $8, %xmm2, %xmm2
+; GFNIAVX1-NEXT:    vpor %xmm2, %xmm4, %xmm2
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm3, %xmm1, %xmm3
 ; GFNIAVX1-NEXT:    vpand %xmm5, %xmm3, %xmm3
-; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm2 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero,xmm2[4],zero,xmm2[5],zero,xmm2[6],zero,xmm2[7],zero
-; GFNIAVX1-NEXT:    vpmullw %xmm6, %xmm2, %xmm2
-; GFNIAVX1-NEXT:    vpand %xmm5, %xmm2, %xmm2
-; GFNIAVX1-NEXT:    vpackuswb %xmm3, %xmm2, %xmm2
-; GFNIAVX1-NEXT:    vpunpckhbw {{.*#+}} xmm3 = xmm1[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
-; GFNIAVX1-NEXT:    vpmullw %xmm4, %xmm3, %xmm3
-; GFNIAVX1-NEXT:    vpand %xmm5, %xmm3, %xmm3
-; GFNIAVX1-NEXT:    vpmovzxbw {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
-; GFNIAVX1-NEXT:    vpmullw %xmm6, %xmm1, %xmm1
-; GFNIAVX1-NEXT:    vpand %xmm5, %xmm1, %xmm1
-; GFNIAVX1-NEXT:    vpackuswb %xmm3, %xmm1, %xmm1
+; GFNIAVX1-NEXT:    vpmaddubsw %xmm6, %xmm1, %xmm1
+; GFNIAVX1-NEXT:    vpsllw $8, %xmm1, %xmm1
+; GFNIAVX1-NEXT:    vpor %xmm1, %xmm3, %xmm1
 ; GFNIAVX1-NEXT:    vinsertf128 $1, %xmm2, %ymm1, %ymm1
 ; GFNIAVX1-NEXT:    retq
 ;
 ; GFNIAVX2-LABEL: constant_shl_v64i8:
 ; GFNIAVX2:       # %bb.0:
-; GFNIAVX2-NEXT:    vpunpckhbw {{.*#+}} ymm2 = ymm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31]
-; GFNIAVX2-NEXT:    vbroadcasti128 {{.*#+}} ymm3 = [128,64,32,16,8,4,2,1,128,64,32,16,8,4,2,1]
-; GFNIAVX2-NEXT:    # ymm3 = mem[0,1,0,1]
-; GFNIAVX2-NEXT:    vpmullw %ymm3, %ymm2, %ymm2
+; GFNIAVX2-NEXT:    vbroadcasti128 {{.*#+}} ymm2 = [1,0,4,0,16,0,64,0,128,0,32,0,8,0,2,0,1,0,4,0,16,0,64,0,128,0,32,0,8,0,2,0]
+; GFNIAVX2-NEXT:    # ymm2 = mem[0,1,0,1]
+; GFNIAVX2-NEXT:    vpmaddubsw %ymm2, %ymm0, %ymm3
 ; GFNIAVX2-NEXT:    vpbroadcastw {{.*#+}} ymm4 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
-; GFNIAVX2-NEXT:    vpand %ymm4, %y...
[truncated]

llvm/lib/Target/X86/X86ISelLowering.cpp

phoebewang · 2024-06-13T13:19:01Z

llvm/test/CodeGen/X86/combine-mul.ll

-; SSE-NEXT:    packuswb %xmm0, %xmm1
-; SSE-NEXT:    movdqa %xmm1, %xmm0
+; SSE-NEXT:    movdqa %xmm0, %xmm1
+; SSE-NEXT:    pmaddubsw {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1


It took me some second to guess the data. Would be helpful if we can show it in comment like
constant = 0, 1, 0, 1...

Yes, I'd like to get that done - let me see what I can put together in AsmPrinter. I did a lot of the heavy lifting for X86FixupVectorConstants.

I think we considered adding asm comments like this a few years ago but it never went anywhere - the last I remember was discussion about whether float numbers should always be fixed width hex.... @topperc might remember better.

RKSimon · 2024-06-13T13:49:24Z

@phoebewang Do you have any thoughts on non-constants? I could add a IsPMADDUBSWSlow tuning tag for SandyBridge, but it'd be annoying if we had to set it for any of the generic x86-64-v levels as well.

KanRobert

TEST changes LGTM

phoebewang · 2024-06-13T14:25:55Z

This patch limits the lowering to multiply by constants, but most targets would benefit from performing this for non-constant cases as well - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/5 contention.

Do you mean the problem that early Core CPUs' TP = 1? https://uops.info/table.html?search=PMADDUBSW&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SNB=on&cb_IVB=on&cb_SKL=on&cb_AMT=on&cb_ZEN3=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_avx=on&cb_avx2=on&cb_sse=on
Or pand/pandn's TP=0.33? https://uops.info/table.html?search=pand&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SNB=on&cb_IVB=on&cb_SKL=on&cb_ADLP=on&cb_ZEN3=on&cb_measurements=on&cb_doc=on&cb_base=on&cb_avx=on&cb_avx2=on&cb_sse=on

The latter seems in latest CPUs as well.

RKSimon · 2024-06-13T14:37:42Z

AFAICT this used to put all the PMADDUBSW and PSLLW instructions on Port0, while the older PMULLW+UNPCK/PACK is spread between Port0 and Port15

phoebewang · 2024-06-13T14:44:59Z

Oh, I forgot PSLLW. Do you take PAND/PANDN into account? I think non-constant cases need them, right?

RKSimon · 2024-06-13T14:49:59Z

This is the breakdown I created for #90748 https://llvm.godbolt.org/z/9361GKrds

phoebewang · 2024-06-14T03:00:15Z

@phoebewang Do you have any thoughts on non-constants? I could add a IsPMADDUBSWSlow tuning tag for SandyBridge, but it'd be annoying if we had to set it for any of the generic x86-64-v levels as well.

I see the cycles is still slightly better on SandyBridge. I'm ok to start without a tuning tag. We can revisit it if we notice notable performance drop.

RKSimon · 2024-06-14T11:26:19Z

Thanks - I'm going to get the constant comment code done first, then get this patch committed, then I'll raise a new PR for non-constant multiplies.

Based on feedback from #95403 - we use multiply by constant for various lowerings (shifts, division etc.), so its very useful to printout the constants to help understand the transform involved. vXi16 multiplies are the easiest to add for this initial commit, but we can add other arithmetic instructions as follow ups when the need arises (I intend to add PMADDUBSW handling for #95403 next). I've done my best to update all test checks but there are bound to be ones that got missed that will only appear when the file is regenerated.

phoebewang

LGTM.

…gets As discussed on llvm#90748

Extends llvm#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together. Most targets would benefit from performing this for non-constant cases as well - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention. Fixes llvm#90748

Extends #95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together. Most targets benefit from performing this for non-constant cases - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention (but lower instruction count). Fixes #90748

…5690) Extends llvm#95403 to handle non-constant cases - we can avoid unpacks/extensions from vXi8 to vXi16 by using PMADDUBSW instead and truncating the vXi16 results back together. Most targets benefit from performing this for non-constant cases - its just Intel Core/SandyBridge era CPUs that might experience additional Port0/15 contention (but lower instruction count). Fixes llvm#90748

RKSimon requested review from phoebewang, KanRobert and goldsteinn June 13, 2024 12:38

llvmbot added the backend:X86 label Jun 13, 2024

phoebewang reviewed Jun 13, 2024

View reviewed changes

llvm/lib/Target/X86/X86ISelLowering.cpp Show resolved Hide resolved

phoebewang reviewed Jun 13, 2024

View reviewed changes

KanRobert approved these changes Jun 13, 2024

View reviewed changes

RKSimon force-pushed the x86-pmaddubsw branch from afa8d0d to 6ae75c2 Compare June 14, 2024 15:14

phoebewang approved these changes Jun 15, 2024

View reviewed changes

RKSimon added 3 commits June 15, 2024 12:59

[X86] Lower vXi8 multiplies by constant using PMADDUBSW on SSSE3+ tar…

8a78d14

…gets As discussed on llvm#90748

[MC][X86] Add (V)PMADDUBSW constant comment handling

dcc3abd

[X86] Rename IsLaneZeroOrUndef variables

6f6fa70

RKSimon force-pushed the x86-pmaddubsw branch from 180459b to 6f6fa70 Compare June 15, 2024 11:59

RKSimon merged commit 9476671 into llvm:main Jun 15, 2024
4 of 7 checks passed

RKSimon deleted the x86-pmaddubsw branch June 15, 2024 13:07

RKSimon mentioned this pull request Jun 16, 2024

[X86] Lower vXi8 multiplies using PMADDUBSW on SSSE3+ targets #95690

Merged

jpienaar mentioned this pull request Jun 17, 2024

[mlirc] Add missing extern C #95829

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[X86] Lower vXi8 multiplies by constant using PMADDUBSW on SSSE3+ targets #95403

[X86] Lower vXi8 multiplies by constant using PMADDUBSW on SSSE3+ targets #95403

Uh oh!

RKSimon commented Jun 13, 2024 •

edited

Loading

Uh oh!

llvmbot commented Jun 13, 2024

Uh oh!

Uh oh!

phoebewang Jun 13, 2024

Uh oh!

RKSimon Jun 13, 2024

Uh oh!

RKSimon commented Jun 13, 2024

Uh oh!

KanRobert left a comment

Uh oh!

phoebewang commented Jun 13, 2024

Uh oh!

RKSimon commented Jun 13, 2024

Uh oh!

phoebewang commented Jun 13, 2024

Uh oh!

RKSimon commented Jun 13, 2024

Uh oh!

phoebewang commented Jun 14, 2024 •

edited

Loading

Uh oh!

RKSimon commented Jun 14, 2024

Uh oh!

phoebewang left a comment

Uh oh!

Uh oh!

Uh oh!

[X86] Lower vXi8 multiplies by constant using PMADDUBSW on SSSE3+ targets #95403

[X86] Lower vXi8 multiplies by constant using PMADDUBSW on SSSE3+ targets #95403

Uh oh!

Conversation

RKSimon commented Jun 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Jun 13, 2024

Uh oh!

Uh oh!

phoebewang Jun 13, 2024

Choose a reason for hiding this comment

Uh oh!

RKSimon Jun 13, 2024

Choose a reason for hiding this comment

Uh oh!

RKSimon commented Jun 13, 2024

Uh oh!

KanRobert left a comment

Choose a reason for hiding this comment

Uh oh!

phoebewang commented Jun 13, 2024

Uh oh!

RKSimon commented Jun 13, 2024

Uh oh!

phoebewang commented Jun 13, 2024

Uh oh!

RKSimon commented Jun 13, 2024

Uh oh!

phoebewang commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RKSimon commented Jun 14, 2024

Uh oh!

phoebewang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

RKSimon commented Jun 13, 2024 •

edited

Loading

phoebewang commented Jun 14, 2024 •

edited

Loading