[SelectionDAG] Fix and improve TargetLowering::SimplifySetCC #87646

bjope · 2024-04-04T15:38:46Z

The load narrowing part of TargetLowering::SimplifySetCC is updated
according to this:

The offset calculation (for big endian) did not work properly for
non byte-sized types. This is basically solved by an early exit
if the memory type isn't byte-sized. But the code is also corrected
to use the store size when calculating the offset.
To still allow some optimizations for non-byte-sized types the
TargetLowering::isPaddedAtMostSignificantBitsWhenStored hook is
added. By default it assumes that scalar integer types are padded
starting at the most significant bits, if the type needs padding
when being stored to memory.
Allow optimizing when isPaddedAtMostSignificantBitsWhenStored is
true, as that hook makes it possible for TargetLowering to know
how the non byte-sized value is aligned in memory.
Update the algorithm to always search for a narrowed load with
a power-of-2 byte-sized type. In the past the algorithm started
with the the width of the original load, and then divided it by
two for each iteration. But for a type such as i48 that would
just end up trying to narrow the load into a i24 or i12 load,
and then we would fail sooner or later due to not finding a
newVT that fulfilled newVT.isRound().
With this new approach we can narrow the i48 load into either
an i8, i16 or i32 load. By checking if such a load is allowed
(e.g. alignment wise) for any "multiple of 8 offset", then we can find
more opportunities for the optimization to trigger. So even for a
byte-sized type such as i32 we may now end up narrowing the load
into loading the 16 bits starting at offset 8 (if that is allowed
by the target). The old algorithm did not even consider that case.
Also start using getObjectPtrOffset instead of getMemBasePlusOffset
when creating the new ptr. This way we get "nsw" on the add.

llvmbot · 2024-04-04T15:39:16Z

@llvm/pr-subscribers-backend-arm

@llvm/pr-subscribers-backend-powerpc

Author: Björn Pettersson (bjope)

Changes

The load narrowing part of TargetLowering::SimplifySetCC is updated
according to this:

The offset calculation (for big endian) did not work properly for
non byte-sized types. This is basically solved by an early exit
if the memory type isn't byte-sized. But the code is also corrected
to use the store size when calculating the offset.
To still allow some optimizations for non-byte-sized types the
TargetLowering::isPaddedAtMostSignificantBitsWhenStored hook is
added. By default it assumes that scalar integer types are padded
starting at the most significant bits, if the type needs padding
when being stored to memory.
Allow optimizing when isPaddedAtMostSignificantBitsWhenStored is
true, as that hook makes it possible for TargetLowering to know
how the non byte-sized value is aligned in memory.
Update the algorithm to always search for a narrowed load with
a power-of-2 byte-sized type. In the past the algorithm started
with the the width of the original load, and then divided it by
two for each iteration. But for a type such as i48 that would
just end up trying to narrow the load into a i24 or i12 load,
and then we would fail sooner or later due to not finding a
newVT that fulfilled newVT.isRound().
With this new approach we can narrow the i48 load into either
an i8, i16 or i32 load. And by checking if such load is allowed
(e.g. alignment wise) for any multiple of 8 offset we can find
more opportunities for the optimization to trigger. Even for a
byte-sized type such as i32 we may now end up narrowing the load
into loading the 16 bits starting at offset 8 (if that is allowed
by the target). The old algorithm did not even consider that case.
Also start using getObjectPtrOffset instead of getMemBasePlusOffset
when creating the new ptr. This way we get "nsw" on the add.

Patch is 27.71 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/87646.diff

4 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+7)
(modified) llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp (+42-27)
(added) llvm/test/CodeGen/ARM/simplifysetcc_narrow_load.ll (+401)
(added) llvm/test/CodeGen/PowerPC/simplifysetcc_narrow_load.ll (+382)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index a4dc097446186a..980d75ad91a0de 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -1804,6 +1804,13 @@ class TargetLoweringBase {
   /// where the sext is redundant, and use x directly.
   virtual bool shouldRemoveRedundantExtend(SDValue Op) const { return true; }
 
+  /// Indicates if any padding is guaranteed to go at the most significant bits
+  /// when storing the type to memory and the type size isn't equal to the store
+  /// size.
+  bool isPaddedAtMostSignificantBitsWhenStored(EVT VT) const {
+    return VT.isScalarInteger() && !VT.isByteSized();
+  }
+
   /// When splitting a value of the specified type into parts, does the Lo
   /// or Hi part come first?  This usually follows the endianness, except
   /// for ppcf128, where the Hi part always comes first.
diff --git a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
index 409d66adfd67d1..d36837f98beae4 100644
--- a/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -4621,7 +4621,10 @@ SDValue TargetLowering::SimplifySetCC(EVT VT, SDValue N0, SDValue N1,
       LoadSDNode *Lod = cast<LoadSDNode>(N0.getOperand(0));
       APInt bestMask;
       unsigned bestWidth = 0, bestOffset = 0;
-      if (Lod->isSimple() && Lod->isUnindexed()) {
+      if (Lod->isSimple() && Lod->isUnindexed() &&
+          (Lod->getMemoryVT().isByteSized() ||
+           isPaddedAtMostSignificantBitsWhenStored(Lod->getMemoryVT()))) {
+        unsigned memWidth = Lod->getMemoryVT().getStoreSizeInBits();
         unsigned origWidth = N0.getValueSizeInBits();
         unsigned maskWidth = origWidth;
         // We can narrow (e.g.) 16-bit extending loads on 32-bit target to
@@ -4629,40 +4632,52 @@ SDValue TargetLowering::SimplifySetCC(EVT VT, SDValue N0, SDValue N1,
         if (Lod->getExtensionType() != ISD::NON_EXTLOAD)
           origWidth = Lod->getMemoryVT().getSizeInBits();
         const APInt &Mask = N0.getConstantOperandAPInt(1);
-        for (unsigned width = origWidth / 2; width>=8; width /= 2) {
+        // Only consider power-of-2 widths (and at least one byte) as candiates
+        // for the narrowed load.
+        for (unsigned width = 8; width < origWidth; width *= 2) {
+          EVT newVT = EVT::getIntegerVT(*DAG.getContext(), width);
+          if (!shouldReduceLoadWidth(Lod, ISD::NON_EXTLOAD, newVT))
+            continue;
           APInt newMask = APInt::getLowBitsSet(maskWidth, width);
-          for (unsigned offset=0; offset<origWidth/width; offset++) {
+          // Avoid accessing any padding here for now (we could use memWidth
+          // instead of origWidth here otherwise).
+          unsigned maxOffset = origWidth - width;
+          for (unsigned offset = 0; offset <= maxOffset; offset += 8) {
             if (Mask.isSubsetOf(newMask)) {
-              if (Layout.isLittleEndian())
-                bestOffset = (uint64_t)offset * (width/8);
-              else
-                bestOffset = (origWidth/width - offset - 1) * (width/8);
-              bestMask = Mask.lshr(offset * (width/8) * 8);
-              bestWidth = width;
-              break;
+              unsigned ptrOffset =
+                  Layout.isLittleEndian() ? offset : memWidth - width - offset;
+              unsigned IsFast = 0;
+              Align NewAlign = commonAlignment(Lod->getAlign(), ptrOffset / 8);
+              if (allowsMemoryAccess(*DAG.getContext(), Layout, newVT,
+                                     Lod->getAddressSpace(), NewAlign,
+                                     Lod->getMemOperand()->getFlags(),
+                                     &IsFast) &&
+                  IsFast) {
+                bestOffset = ptrOffset / 8;
+                bestMask = Mask.lshr(offset);
+                bestWidth = width;
+                break;
+              }
             }
-            newMask <<= width;
+            newMask <<= 8;
           }
+          if (bestWidth)
+            break;
         }
       }
       if (bestWidth) {
         EVT newVT = EVT::getIntegerVT(*DAG.getContext(), bestWidth);
-        if (newVT.isRound() &&
-            shouldReduceLoadWidth(Lod, ISD::NON_EXTLOAD, newVT)) {
-          SDValue Ptr = Lod->getBasePtr();
-          if (bestOffset != 0)
-            Ptr = DAG.getMemBasePlusOffset(Ptr, TypeSize::getFixed(bestOffset),
-                                           dl);
-          SDValue NewLoad =
-              DAG.getLoad(newVT, dl, Lod->getChain(), Ptr,
-                          Lod->getPointerInfo().getWithOffset(bestOffset),
-                          Lod->getOriginalAlign());
-          return DAG.getSetCC(dl, VT,
-                              DAG.getNode(ISD::AND, dl, newVT, NewLoad,
-                                      DAG.getConstant(bestMask.trunc(bestWidth),
-                                                      dl, newVT)),
-                              DAG.getConstant(0LL, dl, newVT), Cond);
-        }
+        SDValue Ptr = Lod->getBasePtr();
+        if (bestOffset != 0)
+          Ptr = DAG.getObjectPtrOffset(dl, Ptr, TypeSize::getFixed(bestOffset));
+        SDValue NewLoad =
+            DAG.getLoad(newVT, dl, Lod->getChain(), Ptr,
+                        Lod->getPointerInfo().getWithOffset(bestOffset),
+                        Lod->getOriginalAlign());
+        SDValue And = DAG.getNode(ISD::AND, dl, newVT, NewLoad,
+                                  DAG.getConstant(bestMask.trunc(bestWidth),
+                                                  dl, newVT));
+        return DAG.getSetCC(dl, VT, And, DAG.getConstant(0LL, dl, newVT), Cond);
       }
     }
 
diff --git a/llvm/test/CodeGen/ARM/simplifysetcc_narrow_load.ll b/llvm/test/CodeGen/ARM/simplifysetcc_narrow_load.ll
new file mode 100644
index 00000000000000..72a90c54823d8f
--- /dev/null
+++ b/llvm/test/CodeGen/ARM/simplifysetcc_narrow_load.ll
@@ -0,0 +1,401 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -O1 -mtriple arm -o - %s | FileCheck --check-prefix CHECK-LE %s
+; RUN: llc -O1 -mtriple armeb -o - %s | FileCheck --check-prefix CHECK-BE %s
+
+; A collection of regression tests to verify the load-narrowing part of
+; TargetLowering::SimplifySetCC (and/or other similar rewrites such as
+; combining AND+LOAD into ZEXTLOAD).
+
+
+;--------------------------------------------------------------------------
+; Test non byte-sized types.
+;
+; As long as LLVM IR isn't defining where the padding goes we can't really
+; optimize these (without adding a target lowering hook that can inform
+; ISel about which bits are padding).
+; --------------------------------------------------------------------------
+
+define i1 @test_129_15_0(ptr %y) {
+; CHECK-LE-LABEL: test_129_15_0:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrh r0, [r0]
+; CHECK-LE-NEXT:    mov r1, #255
+; CHECK-LE-NEXT:    orr r1, r1, #32512
+; CHECK-LE-NEXT:    ands r0, r0, r1
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_129_15_0:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldr r1, [r0, #12]
+; CHECK-BE-NEXT:    ldrb r0, [r0, #16]
+; CHECK-BE-NEXT:    orr r0, r0, r1, lsl #8
+; CHECK-BE-NEXT:    mov r1, #255
+; CHECK-BE-NEXT:    orr r1, r1, #32512
+; CHECK-BE-NEXT:    ands r0, r0, r1
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i129, ptr %y
+  %b = and i129 %a, u0x7fff
+  %cmp = icmp ne i129 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_126_20_4(ptr %y) {
+; CHECK-LE-LABEL: test_126_20_4:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldr r0, [r0]
+; CHECK-LE-NEXT:    mvn r1, #15
+; CHECK-LE-NEXT:    sub r1, r1, #-16777216
+; CHECK-LE-NEXT:    ands r0, r0, r1
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_126_20_4:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldr r0, [r0, #12]
+; CHECK-BE-NEXT:    mvn r1, #15
+; CHECK-BE-NEXT:    sub r1, r1, #-16777216
+; CHECK-BE-NEXT:    ands r0, r0, r1
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i126, ptr %y
+  %b = and i126 %a, u0xfffff0
+  %cmp = icmp ne i126 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_33_8_0(ptr %y) {
+; CHECK-LE-LABEL: test_33_8_0:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrb r0, [r0]
+; CHECK-LE-NEXT:    cmp r0, #0
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_33_8_0:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrb r0, [r0, #4]
+; CHECK-BE-NEXT:    cmp r0, #0
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i33, ptr %y
+  %b = and i33 %a, u0xff
+  %cmp = icmp ne i33 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_33_1_32(ptr %y) {
+; CHECK-LE-LABEL: test_33_1_32:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrb r0, [r0, #4]
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_33_1_32:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldr r0, [r0]
+; CHECK-BE-NEXT:    lsr r0, r0, #24
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i33, ptr %y
+  %b = and i33 %a, u0x100000000
+  %cmp = icmp ne i33 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_33_1_31(ptr %y) {
+; CHECK-LE-LABEL: test_33_1_31:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrb r0, [r0, #3]
+; CHECK-LE-NEXT:    lsr r0, r0, #7
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_33_1_31:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrb r0, [r0, #1]
+; CHECK-BE-NEXT:    lsr r0, r0, #7
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i33, ptr %y
+  %b = and i33 %a, u0x80000000
+  %cmp = icmp ne i33 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_33_1_0(ptr %y) {
+; CHECK-LE-LABEL: test_33_1_0:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrb r0, [r0]
+; CHECK-LE-NEXT:    and r0, r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_33_1_0:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrb r0, [r0, #4]
+; CHECK-BE-NEXT:    and r0, r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i33, ptr %y
+  %b = and i33 %a, u0x1
+  %cmp = icmp ne i33 %b, 0
+  ret i1 %cmp
+}
+
+;--------------------------------------------------------------------------
+; Test byte-sized types.
+;--------------------------------------------------------------------------
+
+
+define i1 @test_128_20_4(ptr %y) {
+; CHECK-LE-LABEL: test_128_20_4:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldr r0, [r0]
+; CHECK-LE-NEXT:    mvn r1, #15
+; CHECK-LE-NEXT:    sub r1, r1, #-16777216
+; CHECK-LE-NEXT:    ands r0, r0, r1
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_128_20_4:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldr r0, [r0, #12]
+; CHECK-BE-NEXT:    mvn r1, #15
+; CHECK-BE-NEXT:    sub r1, r1, #-16777216
+; CHECK-BE-NEXT:    ands r0, r0, r1
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i128, ptr %y
+  %b = and i128 %a, u0xfffff0
+  %cmp = icmp ne i128 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_48_16_0(ptr %y) {
+; CHECK-LE-LABEL: test_48_16_0:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrh r0, [r0]
+; CHECK-LE-NEXT:    cmp r0, #0
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_48_16_0:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrh r0, [r0, #4]
+; CHECK-BE-NEXT:    cmp r0, #0
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i48, ptr %y
+  %b = and i48 %a, u0xffff
+  %cmp = icmp ne i48 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_48_16_8(ptr %y) {
+; CHECK-LE-LABEL: test_48_16_8:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrh r0, [r0, #1]
+; CHECK-LE-NEXT:    lsls r0, r0, #8
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_48_16_8:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrh r0, [r0, #3]
+; CHECK-BE-NEXT:    cmp r0, #0
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i48, ptr %y
+  %b = and i48 %a, u0xffff00
+  %cmp = icmp ne i48 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_48_16_16(ptr %y) {
+; CHECK-LE-LABEL: test_48_16_16:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrh r0, [r0, #2]
+; CHECK-LE-NEXT:    cmp r0, #0
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_48_16_16:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrh r0, [r0, #2]
+; CHECK-BE-NEXT:    cmp r0, #0
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i48, ptr %y
+  %b = and i48 %a, u0xffff0000
+  %cmp = icmp ne i48 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_48_16_32(ptr %y) {
+; CHECK-LE-LABEL: test_48_16_32:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrh r0, [r0, #4]
+; CHECK-LE-NEXT:    cmp r0, #0
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_48_16_32:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrh r0, [r0]
+; CHECK-BE-NEXT:    cmp r0, #0
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i48, ptr %y
+  %b = and i48 %a, u0xffff00000000
+  %cmp = icmp ne i48 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_48_17_0(ptr %y) {
+; CHECK-LE-LABEL: test_48_17_0:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldr r0, [r0]
+; CHECK-LE-NEXT:    ldr r1, .LCPI11_0
+; CHECK-LE-NEXT:    ands r0, r0, r1
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+; CHECK-LE-NEXT:    .p2align 2
+; CHECK-LE-NEXT:  @ %bb.1:
+; CHECK-LE-NEXT:  .LCPI11_0:
+; CHECK-LE-NEXT:    .long 131071 @ 0x1ffff
+;
+; CHECK-BE-LABEL: test_48_17_0:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldr r1, [r0]
+; CHECK-BE-NEXT:    ldrh r0, [r0, #4]
+; CHECK-BE-NEXT:    orr r0, r0, r1, lsl #16
+; CHECK-BE-NEXT:    ldr r1, .LCPI11_0
+; CHECK-BE-NEXT:    ands r0, r0, r1
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+; CHECK-BE-NEXT:    .p2align 2
+; CHECK-BE-NEXT:  @ %bb.1:
+; CHECK-BE-NEXT:  .LCPI11_0:
+; CHECK-BE-NEXT:    .long 131071 @ 0x1ffff
+  %a = load i48, ptr %y
+  %b = and i48 %a, u0x1ffff
+  %cmp = icmp ne i48 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_40_16_0(ptr %y) {
+; CHECK-LE-LABEL: test_40_16_0:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrh r0, [r0]
+; CHECK-LE-NEXT:    cmp r0, #0
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_40_16_0:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrh r0, [r0, #3]
+; CHECK-BE-NEXT:    cmp r0, #0
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i40, ptr %y
+  %b = and i40 %a, u0xffff
+  %cmp = icmp ne i40 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_40_1_32(ptr %y) {
+; CHECK-LE-LABEL: test_40_1_32:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrb r0, [r0, #4]
+; CHECK-LE-NEXT:    and r0, r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_40_1_32:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrb r0, [r0]
+; CHECK-BE-NEXT:    and r0, r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i40, ptr %y
+  %b = and i40 %a, u0x100000000
+  %cmp = icmp ne i40 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_24_16_0(ptr %y) {
+; CHECK-LE-LABEL: test_24_16_0:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrh r0, [r0]
+; CHECK-LE-NEXT:    cmp r0, #0
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_24_16_0:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrh r0, [r0, #1]
+; CHECK-BE-NEXT:    cmp r0, #0
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i24, ptr %y
+  %b = and i24 %a, u0xffff
+  %cmp = icmp ne i24 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_24_8_8(ptr %y) {
+; CHECK-LE-LABEL: test_24_8_8:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrb r0, [r0, #1]
+; CHECK-LE-NEXT:    lsls r0, r0, #8
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_24_8_8:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrb r0, [r0, #1]
+; CHECK-BE-NEXT:    lsls r0, r0, #8
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i24, ptr %y
+  %b = and i24 %a, u0xff00
+  %cmp = icmp ne i24 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_24_8_12(ptr %y) {
+; CHECK-LE-LABEL: test_24_8_12:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrb r1, [r0, #2]
+; CHECK-LE-NEXT:    ldrh r0, [r0]
+; CHECK-LE-NEXT:    orr r0, r0, r1, lsl #16
+; CHECK-LE-NEXT:    ands r0, r0, #1044480
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_24_8_12:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrh r0, [r0]
+; CHECK-BE-NEXT:    mov r1, #1044480
+; CHECK-BE-NEXT:    ands r0, r1, r0, lsl #8
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i24, ptr %y
+  %b = and i24 %a, u0xff000
+  %cmp = icmp ne i24 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_24_8_16(ptr %y) {
+; CHECK-LE-LABEL: test_24_8_16:
+; CHECK-LE:       @ %bb.0:
+; CHECK-LE-NEXT:    ldrb r0, [r0, #2]
+; CHECK-LE-NEXT:    lsls r0, r0, #16
+; CHECK-LE-NEXT:    movne r0, #1
+; CHECK-LE-NEXT:    mov pc, lr
+;
+; CHECK-BE-LABEL: test_24_8_16:
+; CHECK-BE:       @ %bb.0:
+; CHECK-BE-NEXT:    ldrb r0, [r0]
+; CHECK-BE-NEXT:    lsls r0, r0, #16
+; CHECK-BE-NEXT:    movne r0, #1
+; CHECK-BE-NEXT:    mov pc, lr
+  %a = load i24, ptr %y
+  %b = and i24 %a, u0xff0000
+  %cmp = icmp ne i24 %b, 0
+  ret i1 %cmp
+}
diff --git a/llvm/test/CodeGen/PowerPC/simplifysetcc_narrow_load.ll b/llvm/test/CodeGen/PowerPC/simplifysetcc_narrow_load.ll
new file mode 100644
index 00000000000000..49b8e2bc2f7b48
--- /dev/null
+++ b/llvm/test/CodeGen/PowerPC/simplifysetcc_narrow_load.ll
@@ -0,0 +1,382 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -O1 -mtriple ppc32le -o - %s | FileCheck --check-prefix CHECK-LE %s
+; RUN: llc -O1 -mtriple ppc32 -o - %s | FileCheck --check-prefix CHECK-BE %s
+
+; A collection of regression tests to verify the load-narrowing part of
+; TargetLowering::SimplifySetCC (and/or other similar rewrites such as
+; combining AND+LOAD into ZEXTLOAD).
+
+
+;--------------------------------------------------------------------------
+; Test non byte-sized types.
+;
+; As long as LLVM IR isn't defining where the padding goes we can't really
+; optimize these (without adding a target lowering hook that can inform
+; ISel about which bits are padding).
+; --------------------------------------------------------------------------
+
+define i1 @test_129_15_0(ptr %y) {
+; CHECK-LE-LABEL: test_129_15_0:
+; CHECK-LE:       # %bb.0:
+; CHECK-LE-NEXT:    lhz 3, 0(3)
+; CHECK-LE-NEXT:    clrlwi 3, 3, 17
+; CHECK-LE-NEXT:    addic 4, 3, -1
+; CHECK-LE-NEXT:    subfe 3, 4, 3
+; CHECK-LE-NEXT:    blr
+;
+; CHECK-BE-LABEL: test_129_15_0:
+; CHECK-BE:       # %bb.0:
+; CHECK-BE-NEXT:    lhz 3, 15(3)
+; CHECK-BE-NEXT:    clrlwi 3, 3, 17
+; CHECK-BE-NEXT:    addic 4, 3, -1
+; CHECK-BE-NEXT:    subfe 3, 4, 3
+; CHECK-BE-NEXT:    blr
+  %a = load i129, ptr %y
+  %b = and i129 %a, u0x7fff
+  %cmp = icmp ne i129 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_126_20_4(ptr %y) {
+; CHECK-LE-LABEL: test_126_20_4:
+; CHECK-LE:       # %bb.0:
+; CHECK-LE-NEXT:    lwz 3, 0(3)
+; CHECK-LE-NEXT:    rlwinm 3, 3, 0, 8, 27
+; CHECK-LE-NEXT:    addic 4, 3, -1
+; CHECK-LE-NEXT:    subfe 3, 4, 3
+; CHECK-LE-NEXT:    blr
+;
+; CHECK-BE-LABEL: test_126_20_4:
+; CHECK-BE:       # %bb.0:
+; CHECK-BE-NEXT:    lwz 3, 12(3)
+; CHECK-BE-NEXT:    rlwinm 3, 3, 0, 8, 27
+; CHECK-BE-NEXT:    addic 4, 3, -1
+; CHECK-BE-NEXT:    subfe 3, 4, 3
+; CHECK-BE-NEXT:    blr
+  %a = load i126, ptr %y
+  %b = and i126 %a, u0xfffff0
+  %cmp = icmp ne i126 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_33_8_0(ptr %y) {
+; CHECK-LE-LABEL: test_33_8_0:
+; CHECK-LE:       # %bb.0:
+; CHECK-LE-NEXT:    lbz 3, 0(3)
+; CHECK-LE-NEXT:    addic 4, 3, -1
+; CHECK-LE-NEXT:    subfe 3, 4, 3
+; CHECK-LE-NEXT:    blr
+;
+; CHECK-BE-LABEL: test_33_8_0:
+; CHECK-BE:       # %bb.0:
+; CHECK-BE-NEXT:    lbz 3, 4(3)
+; CHECK-BE-NEXT:    addic 4, 3, -1
+; CHECK-BE-NEXT:    subfe 3, 4, 3
+; CHECK-BE-NEXT:    blr
+  %a = load i33, ptr %y
+  %b = and i33 %a, u0xff
+  %cmp = icmp ne i33 %b, 0
+  ret i1 %cmp
+}
+
+define i1 @test_33_1_32(ptr %y) {
+; CHECK-...
[truncated]

github-actions · 2024-04-04T15:41:40Z

✅ With the latest revision this PR passed the C/C++ code formatter.

AZero13

Good start, but there appears to be a couple of regressions for some of tests here.

bjope · 2024-04-04T20:09:30Z

llvm/test/CodeGen/ARM/simplifysetcc_narrow_load.ll

@@ -27,7 +27,9 @@ define i1 @test_129_15_0(ptr %y) {
 ;
 ; CHECK-BE-LABEL: test_129_15_0:
 ; CHECK-BE:       @ %bb.0:
-; CHECK-BE-NEXT:    ldrh r0, [r0, #14]
+; CHECK-BE-NEXT:    ldr r1, [r0, #12]


This is not a regression. It is supposed to be fixing a miscompile.

The 15bits masked out with the AND is at [r0, #15] and [r0, #16]. I think the optimization to narrow to a 16-bit load is blocked due to loading 16 bits starting at [r0, #15] would be an unaligned load.

bjope · 2024-04-04T20:55:18Z

llvm/test/CodeGen/ARM/simplifysetcc_narrow_load.ll

@@ -189,7 +191,7 @@ define i1 @test_48_16_8(ptr %y) {
 ; CHECK-LE-LABEL: test_48_16_8:
 ; CHECK-LE:       @ %bb.0:
 ; CHECK-LE-NEXT:    ldrh r0, [r0, #1]
-; CHECK-LE-NEXT:    cmp r0, #0
+; CHECK-LE-NEXT:    lsls r0, r0, #8


I don't know if lsls or cmp is better than the other here for ARM?

What happens is that after load narrowing we get:

t17: i32,ch = load<(load (s32) from %ir.y, align 8)> t0, t2, undef:i32 t19: i32 = and t17, Constant:i32<16776960> t21: i1 = setcc t19, Constant:i32<0>, setne:ch

and then the DAG combiner triggers on the AND and changes it into

t23: i32 = add nuw t2, Constant:i32<1> t24: i32,ch = load<(load (s16) from %ir.y + 1, align 1, basealign 8), zext from i16> t0, t23, undef:i32 t26: i32 = shl t24, Constant:i32<8> t21: i1 = setcc t26, Constant:i32<0>, setne:ch

I think the optimization in this patch first avoids introducing a misaligned 16-bit load from %ir.y + 1 and instead uses a 32-bit load. But then some other DAG combine is narrowing the load a second time, resulting in the unaligned load, but also introducing an SHL that isn't needed for the comparison with 0.

Makes me wonder if the 16 bit load with align 1 a bad thing here?

It also seems like we lack some optimization that removes the redundant SHL :-(

Maybe the difference is that this patch checks the Fast result from the allowsMemoryAccess check. But DAGCombiner::isLegalNarrowLdSt is only checking if it is legal (not if the narrowed, not naturally aligned, access also is considered as Fast).
So the new optimization in this patch is a bit more strict, and then we suffer from DAGCombiner::reduceLoadWidth introducing a SHL that isn't really needed.

Have you looked into this any further?

Have you looked into this any further?

See 5e9f960#r1552445909

After that I also made sure the test is run both with -mtriple arm and -mtriple armv7 to show the difference in ARMTargetLowering::allowsMisalignedMemoryAccesses depending on the actual triple.
My new code is adhering to the "Fast" settings in ARMTargetLowering, while DAGCombiner::isLegalNarrowLdSt isn't. But maybe we want to do these transforms regardless if the unaligned accesses are considered as "Fast"? It felt wrong to me to make it that aggressive.

I'm also not sure why this code in DAGCombiner.cpp

// Ensure that this isn't going to produce an unsupported memory access. if (ShAmt) { assert(ShAmt % 8 == 0 && "ShAmt is byte offset"); const unsigned ByteShAmt = ShAmt / 8; const Align LDSTAlign = LDST->getAlign(); const Align NarrowAlign = commonAlignment(LDSTAlign, ByteShAmt); if (!TLI.allowsMemoryAccess(*DAG.getContext(), DAG.getDataLayout(), MemVT, LDST->getAddressSpace(), NarrowAlign, LDST->getMemOperand()->getFlags())) return false; }

is skipping to consider the "Fast" result from allowsMemoryAccess that would be given if using the extra bool argument.

I kind of think the new behavior is more correct in some sense (taking the "Fast" part of the TLI hook into account). But then I guess we want to do that in DAGCombiner::isLegalNarrowLdSt as well (potentially getting even more diffs that looks like "regressions", but that in fact might be optimizations in case unaligned accesses are costly).

bjope · 2024-04-11T15:31:18Z

@RKSimon , I see that you added yourself as reviewer for this one. Did you take a look already?

Since it currently is miscompiling for big-endian it would be nice with some way forward. Maybe it had been better to split it up in a quick fix, and a follow up with new optimizations (but sometime my miscompile fixes get stuck on causing regressions, so this time I tried to avoid regressions directly).
I also happend to notice that there seem to be some partial overlap between the TargetLowering::SimplifySetCC and some of the reduceLoadWidth optimizations in DAGCombiner. Not sure if there is a long term plan related to that. At this point I was mostly interested in getting rid of the miscompiles.

Btw, I think the miscompiles also only happens when there are load/stores with sizes that aren't a multiple of the byte size. That is probably limiting how likely the miscompiles would be for lots of targets a lot. Downstream we do get such load/stores, specially with the introduction of _BitInt. The problem was detected when using _BitInt in combination with a big-endian target.

RKSimon

No objections from me, although nailing down the ARM regression would be nice.

RKSimon · 2024-04-12T09:40:12Z

llvm/test/CodeGen/ARM/simplifysetcc_narrow_load.ll

@@ -189,7 +191,7 @@ define i1 @test_48_16_8(ptr %y) {
 ; CHECK-LE-LABEL: test_48_16_8:
 ; CHECK-LE:       @ %bb.0:
 ; CHECK-LE-NEXT:    ldrh r0, [r0, #1]
-; CHECK-LE-NEXT:    cmp r0, #0
+; CHECK-LE-NEXT:    lsls r0, r0, #8


Have you looked into this any further?

RKSimon

LGTM

The load narrowing part of TargetLowering::SimplifySetCC is updated according to this: 1) The offset calculation (for big endian) did not work properly for non byte-sized types. This is basically solved by an early exit if the memory type isn't byte-sized. But the code is also corrected to use the store size when calculating the offset. 2) To still allow some optimizations for non-byte-sized types the TargetLowering::isPaddedAtMostSignificantBitsWhenStored hook is added. By default it assumes that scalar integer types are padded starting at the most significant bits, if the type needs padding when being stored to memory. 3) Allow optimizing when isPaddedAtMostSignificantBitsWhenStored is true, as that hook makes it possible for TargetLowering to know how the non byte-sized value is aligned in memory. 4) Update the algorithm to always search for a narrowed load with a power-of-2 byte-sized type. In the past the algorithm started with the the width of the original load, and then divided it by two for each iteration. But for a type such as i48 that would just end up trying to narrow the load into a i24 or i12 load, and then we would fail sooner or later due to not finding a newVT that fulfilled newVT.isRound(). With this new approach we can narrow the i48 load into either an i8, i16 or i32 load. We also try to find more opportunities for optimization by checking if such a load is allowed (e.g. alignment wise) for any multiple of 8 offset. So even for a byte-sized type such as i32 we may now end up narrowing the load into loading the 16 bits starting at offset 8 (if that is allowed by the target). The old algorithm did not even consider that case. 5) Also start using getObjectPtrOffset instead of getMemBasePlusOffset when creating the new ptr. This way we get "nsw" on the add.

…ing::SimplifySetCC These test cases show some miscomplies for big-endian when dealing with non byte-sized loads. One part of the problem is that LLVM IR isn't really telling where the padding goes for non byte-sized loads/stores. So currently TargetLowering::SimplifySetCC can't assume anything about it. But the implementation also do not consider that the TypeStoreSize could be larger than the TypeSize, resulting in the offset calculation being wrong for big-endian. Pre-commit for #87646

AZero13 · 2024-04-12T14:21:41Z

For the record, the arm regression is code correct codegen, right?

bjope · 2024-04-12T14:24:14Z

For the record, the arm regression is code correct codegen, right?

Could you be more specific about which diff you are talking about.

AZero13 · 2024-04-12T14:54:16Z

For the record, the arm regression is code correct codegen, right?

Could you be more specific about which diff you are talking about.

The lsls r0, r0, #8 one mentioned previously.

bjope · 2024-04-12T15:34:23Z

For the record, the arm regression is code correct codegen, right?

Could you be more specific about which diff you are talking about.

The lsls r0, r0, #8 one mentioned previously.

I described a bit regarding why that happens here: #87646 (comment)

The test case is showing that there is a difference between -mtriple=arm and -mtriple=armv7 regarding if they consider misaligned accesses to be "Fast" or not. The improved TargetLowering::SimplifySetCC now consider that misaligned accesses are considered slow for -mtriple=arm and avoids introducing the narrowed load for that triple. But then later other optimizations trigger in DAGCombine that does not check if the misaligned access is considered as "Fast". So now there is a different DAG combine that triggers the misaligned narrowed load. That DAG combine also introduces an SHL operation that isn't really needed.
I don't know that enough about ARM to know if lsls is much worse than the cmp. But you say that it is a regression, so in what way? Is it a cycle regression or code size regression? Maybe it is even worse that there is a load narrowing that is resulting in a possibly misaligned load compared to using lsls instead of cmp?

AZero13 · 2024-04-12T16:03:38Z

I'm just asking if this is a miscompile or not just to make sure

bjope · 2024-04-12T17:26:32Z

I'm just asking if this is a miscompile or not just to make sure

No, that particular diff is not due to any miscompile.
The DAG used to be something like this:

    t24: i32,ch = load<(load (s16) from %ir.y + 1, align 1, basealign 8), zext from i16> t0, t23, undef:i32
    t21: i1 = setcc t24, Constant:i32<0>, setne:ch

and now we get

    t24: i32,ch = load<(load (s16) from %ir.y + 1, align 1, basealign 8), zext from i16> t0, t23, undef:i32
    t26: i32 = shl t24, Constant:i32<8>
    t21: i1 = setcc t26, Constant:i32<0>, setne:ch

So it just shifts out bits that are known to be zero before the compare. Without deeper knowledge about ARM, the result would be equivalent here.

llvmbot added backend:ARM backend:PowerPC llvm:SelectionDAG SelectionDAGISel as well labels Apr 4, 2024

AZero13 suggested changes Apr 4, 2024

View reviewed changes

bjope commented Apr 4, 2024

View reviewed changes

bjope force-pushed the setcccombine branch from 7dd86d7 to ade4ce0 Compare April 5, 2024 08:19

bjope requested a review from AZero13 April 5, 2024 08:23

RKSimon requested review from RKSimon and phoebewang and removed request for AZero13 April 5, 2024 10:02

bjope force-pushed the setcccombine branch from ade4ce0 to 5e9f960 Compare April 9, 2024 13:45

RKSimon reviewed Apr 12, 2024

View reviewed changes

RKSimon approved these changes Apr 12, 2024

View reviewed changes

bjope force-pushed the setcccombine branch from 5e9f960 to 34507e6 Compare April 12, 2024 14:17

bjope merged commit 33e6b48 into llvm:main Apr 12, 2024

bjope mentioned this pull request Apr 15, 2024

[SelectionDAG] Fold (icmp eq/ne (shift X, C), 0) -> (icmp eq/ne X, 0) #88801

Open

[SelectionDAG] Fix and improve TargetLowering::SimplifySetCC #87646

[SelectionDAG] Fix and improve TargetLowering::SimplifySetCC #87646

Uh oh!

Conversation

bjope commented Apr 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Apr 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AZero13 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjope Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

bjope Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

bjope Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

RKSimon Apr 12, 2024

Choose a reason for hiding this comment

Uh oh!

bjope Apr 12, 2024

Choose a reason for hiding this comment

Uh oh!

bjope commented Apr 11, 2024

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

RKSimon Apr 12, 2024

Choose a reason for hiding this comment

Uh oh!

RKSimon left a comment

Choose a reason for hiding this comment

Uh oh!

AZero13 commented Apr 12, 2024

Uh oh!

bjope commented Apr 12, 2024

Uh oh!

AZero13 commented Apr 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjope commented Apr 12, 2024

Uh oh!

AZero13 commented Apr 12, 2024

Uh oh!

bjope commented Apr 12, 2024

Uh oh!

Uh oh!

bjope commented Apr 4, 2024 •

edited

Loading

llvmbot commented Apr 4, 2024 •

edited

Loading

github-actions bot commented Apr 4, 2024 •

edited

Loading

AZero13 left a comment •

edited

Loading

AZero13 commented Apr 12, 2024 •

edited

Loading