[SIFoldOperands] Folding immediate into a copy invalidates candidates in the fold list #148187

jmmartinez · 2025-07-11T10:24:32Z

When we fold a constant operand of a binary operator into a copy, both operands of the operator may be in the fold-list.

This change works around this by removing the folds of the use converted into copy from the list.

The pattern that triggers the issue is:

%22:sreg_32 = S_MOV_B32 0
%23:sreg_64 = REG_SEQUENCE %22, %subreg.sub0, %22, %subreg.sub1
%2:sreg_64 = COPY %23
%26:sreg_32 = COPY %2.sub0
%27:sreg_32 = COPY %2.sub1
%28:sreg_32 = S_OR_B32 killed %26, killed %27, implicit-def dead $scc

%28:sreg_32 = S_OR_B32 is folded into a constant, and the fold attempt for the second operand triggers a segfault.

Things missing:

I'm working on reducing the test case into MIR, but for some reason I cannot trigger it with the MIR obtained with --stop-before=si-fold-operands.
The final generated code misses some other optimizations. We have s_mov_b32 s0, 0; s_cmp_lg_u32 s0, 0; and the result of this is still unused.

Fixes SWDEV-542372

llvmbot · 2025-07-11T10:25:03Z

@llvm/pr-subscribers-backend-amdgpu

Author: Juan Manuel Martinez Caamaño (jmmartinez)

Changes

When we fold a constant operand of a binary operator into a copy, both operands of the operator may be in the fold-list.

This change works around this by removing the folds of the use converted into copy from the list.

The pattern that triggers the issue is:

%22:sreg_32 = S_MOV_B32 0
%23:sreg_64 = REG_SEQUENCE %22, %subreg.sub0, %22, %subreg.sub1
%2:sreg_64 = COPY %23
%26:sreg_32 = COPY %2.sub0
%27:sreg_32 = COPY %2.sub1
%28:sreg_32 = S_OR_B32 killed %26, killed %27, implicit-def dead $scc

%28:sreg_32 = S_OR_B32 is folded into a constant, and the fold attempt for the second operand triggers a segfault.

Things missing:

I'm working on reducing the test case into MIR, but for some reason I cannot trigger it with the MIR obtained with --stop-before=si-fold-operands.
The final generated code misses some other optimizations. We have s_mov_b32 s0, 0; s_cmp_lg_u32 s0, 0; and the result of this is still unused.

Fixes SWDEV-542372

Full diff: https://github.com/llvm/llvm-project/pull/148187.diff

2 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIFoldOperands.cpp (+8-2)
(added) llvm/test/CodeGen/AMDGPU/si-fold-operands-swdev-542372.ll (+60)

diff --git a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
index 0ed06c37507af..2675e4e9d02e4 100644
--- a/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
+++ b/llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
@@ -1761,7 +1761,9 @@ bool SIFoldOperandsImpl::foldInstOperand(MachineInstr &MI,
   for (MachineInstr *Copy : CopiesToReplace)
     Copy->addImplicitDefUseOperands(*MF);
 
-  for (FoldCandidate &Fold : FoldList) {
+  for (auto FoldIt = FoldList.begin(), End = FoldList.end(); FoldIt != End;
+       ++FoldIt) {
+    FoldCandidate &Fold = *FoldIt;
     assert(!Fold.isReg() || Fold.Def.OpToFold);
     if (Fold.isReg() && Fold.getReg().isVirtual()) {
       Register Reg = Fold.getReg();
@@ -1785,9 +1787,13 @@ bool SIFoldOperandsImpl::foldInstOperand(MachineInstr &MI,
 
       if (Fold.isImm() && tryConstantFoldOp(Fold.UseMI)) {
         LLVM_DEBUG(dbgs() << "Constant folded " << *Fold.UseMI);
+        // The instruction was folded into a copy, we have to skip any other
+        // occurence of UseMI in the fold list
+        End = std::remove_if(FoldIt + 1, End, [&](const FoldCandidate &F) {
+          return F.UseMI == Fold.UseMI;
+        });
         Changed = true;
       }
-
     } else if (Fold.Commuted) {
       // Restoring instruction's original operand order if fold has failed.
       TII->commuteInstruction(*Fold.UseMI, false);
diff --git a/llvm/test/CodeGen/AMDGPU/si-fold-operands-swdev-542372.ll b/llvm/test/CodeGen/AMDGPU/si-fold-operands-swdev-542372.ll
new file mode 100644
index 0000000000000..2b6f6ba6a4de8
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/si-fold-operands-swdev-542372.ll
@@ -0,0 +1,60 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -O3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a %s -o - | FileCheck %s
+
+define amdgpu_kernel void @kernel() {
+; CHECK-LABEL: kernel:
+; CHECK:       ; %bb.0: ; %entry
+; CHECK-NEXT:    v_mov_b32_e32 v0, 0
+; CHECK-NEXT:    ds_read_b64 v[2:3], v0
+; CHECK-NEXT:    ds_write_b32 v0, v0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(1)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 2, v2
+; CHECK-NEXT:    s_cbranch_vccnz .LBB0_3
+; CHECK-NEXT:  ; %bb.1: ; %land.rhs49
+; CHECK-NEXT:    ds_read_b64 v[0:1], v0
+; CHECK-NEXT:    s_waitcnt lgkmcnt(0)
+; CHECK-NEXT:    v_cmp_ne_u32_e32 vcc, 1, v0
+; CHECK-NEXT:    s_cbranch_vccnz .LBB0_3
+; CHECK-NEXT:  ; %bb.2: ; %land.rhs57
+; CHECK-NEXT:    s_mov_b32 s0, 0
+; CHECK-NEXT:    s_mov_b32 s1, s0
+; CHECK-NEXT:    v_mov_b32_e32 v2, 0
+; CHECK-NEXT:    v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
+; CHECK-NEXT:    s_cmp_lg_u32 s0, 0
+; CHECK-NEXT:    ds_write_b64 v2, v[0:1]
+; CHECK-NEXT:  .LBB0_3: ; %land.end59
+; CHECK-NEXT:    s_endpgm
+entry:
+  %0 = load <2 x i32>, ptr addrspace(3) null, align 8
+  %vecext.i55 = extractelement <2 x i32> %0, i64 0
+  %cmp3.i57 = icmp eq i32 %vecext.i55, 2
+  store i32 0, ptr addrspace(3) null, align 8
+  br i1 %cmp3.i57, label %land.rhs49, label %land.end59
+
+land.rhs49:                                       ; preds = %entry
+  %1 = load <2 x i32>, ptr addrspace(3) null, align 8
+  %vecext.i67 = extractelement <2 x i32> %1, i64 0
+  %cmp3.i69 = icmp eq i32 %vecext.i67, 1
+  br i1 %cmp3.i69, label %land.rhs57, label %land.end59
+
+land.rhs57:                                       ; preds = %land.rhs49
+  %rem.i.i.i = srem <2 x i32> %0, %1
+  %ref.tmp.sroa.0.0.vec.extract.i.i = extractelement <2 x i32> %rem.i.i.i, i64 0
+  store i32 %ref.tmp.sroa.0.0.vec.extract.i.i, ptr addrspace(3) null, align 8
+  store i32 %ref.tmp.sroa.0.0.vec.extract.i.i, ptr addrspace(3) getelementptr inbounds nuw (i8, ptr addrspace(3) null, i32 4), align 4
+  %2 = load <2 x i32>, ptr addrspace(3) null, align 8
+  %vecext.i.i.i = extractelement <2 x i32> %2, i64 0
+  %cmp3.i.i.i = icmp ne i32 %vecext.i.i.i, 0
+  %vecext.1.i.i.i = extractelement <2 x i32> %2, i64 1
+  %cmp3.1.i.i.i = icmp ne i32 %vecext.1.i.i.i, 0
+  %.not.i.i = select i1 %cmp3.i.i.i, i1 true, i1 %cmp3.1.i.i.i
+  br i1 %.not.i.i, label %land.end59, label %if.end.i
+
+if.end.i:                                         ; preds = %land.rhs57
+  %and.i.i.i = and <2 x i32> %2, splat (i32 1)
+  %ref.tmp.sroa.0.0.vec.extract.i20.i = extractelement <2 x i32> %and.i.i.i, i64 0
+  br label %land.end59
+
+land.end59:                                       ; preds = %if.end.i, %land.rhs57, %land.rhs49, %entry
+  ret void
+}

jmmartinez · 2025-07-11T10:26:06Z

I'm not convinced about this change. I'm mostly looking for advice. There seems to be an issue with the iteration order of this pass and how/which instructions get modified.

This reverts commit 80064b6. The patch triggers a crash when the folded use can have 2 operands in the fold list. See llvm#148187 for more info. SWDEV-542372

llvm/test/CodeGen/AMDGPU/si-fold-operands-swdev-542372.ll

arsenm · 2025-07-11T10:50:22Z

llvm/test/CodeGen/AMDGPU/si-fold-operands-swdev-542372.ll

@@ -0,0 +1,60 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -O3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a %s -o - | FileCheck %s


Does this really need -O3

It seems. I cannot trigger the bug with O0/1/2. And I was not able on the mir.

arsenm · 2025-07-11T10:51:33Z

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp

@@ -1785,9 +1787,13 @@ bool SIFoldOperandsImpl::foldInstOperand(MachineInstr &MI,

      if (Fold.isImm() && tryConstantFoldOp(Fold.UseMI)) {
        LLVM_DEBUG(dbgs() << "Constant folded " << *Fold.UseMI);
+        // The instruction was folded into a copy, we have to skip any other
+        // occurence of UseMI in the fold list
+        End = std::remove_if(FoldIt + 1, End, [&](const FoldCandidate &F) {


We really need to rewrite this pass. The fold list is garbage. It would be a lot simpler to just track a map of foldable sources like peephole opt does

This reverts commit 80064b6. The patch triggers a crash when the folded use can have 2 operands in the fold list. See llvm#148187 for more info. SWDEV-542372

…)" This reverts commit 063b053.

… in the fold list Fixes SWDEV-542372

jmmartinez · 2025-07-15T14:48:02Z

Not needed anymore once #148205 lands

jmmartinez requested a review from arsenm July 11, 2025 10:24

jmmartinez self-assigned this Jul 11, 2025

llvmbot added the backend:AMDGPU label Jul 11, 2025

jmmartinez changed the title ~~[SIFoldOperands] Folding immediate into a copy invalidates candidates in the fold list~~ [WIP][SIFoldOperands] Folding immediate into a copy invalidates candidates in the fold list Jul 11, 2025

jmmartinez mentioned this pull request Jul 11, 2025

Revert "AMDGPU: Try constant fold after folding immediate (#141862)" #148197

Closed

jmmartinez force-pushed the swdev-542372 branch from dfca436 to 062d197 Compare July 11, 2025 10:36

jmmartinez requested a review from shiltian July 11, 2025 10:37

jmmartinez changed the title ~~[WIP][SIFoldOperands] Folding immediate into a copy invalidates candidates in the fold list~~ [SIFoldOperands] Folding immediate into a copy invalidates candidates in the fold list Jul 11, 2025

arsenm reviewed Jul 11, 2025

View reviewed changes

jmmartinez mentioned this pull request Jul 11, 2025

AMDGPU: Fix assert when multi operands to update after folding imm #148205

Merged

jmmartinez force-pushed the swdev-542372 branch from 062d197 to 1da5c51 Compare July 11, 2025 12:33

Pre-Commit test: SWDEV-542372: crash in si-fold-operands

e8f1125

jmmartinez added 3 commits July 11, 2025 15:18

Revert "AMDGPU: Try constant fold after folding immediate (llvm#141862)"

2f8a888

This reverts commit 80064b6. The patch triggers a crash when the folded use can have 2 operands in the fold list. See llvm#148187 for more info. SWDEV-542372

Reapply "AMDGPU: Try constant fold after folding immediate (llvm#141862…

b919f2e

…)" This reverts commit 063b053.

[SIFoldOperands] Folding immediate into a copy invalidates candidates…

6de3d1d

… in the fold list Fixes SWDEV-542372

jmmartinez force-pushed the swdev-542372 branch from 1da5c51 to 6de3d1d Compare July 11, 2025 13:26

jmmartinez requested a review from arsenm July 11, 2025 13:26

ronlieb self-requested a review July 11, 2025 16:14

jmmartinez closed this Jul 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SIFoldOperands] Folding immediate into a copy invalidates candidates in the fold list #148187

[SIFoldOperands] Folding immediate into a copy invalidates candidates in the fold list #148187

Uh oh!

jmmartinez commented Jul 11, 2025

Uh oh!

llvmbot commented Jul 11, 2025

Uh oh!

jmmartinez commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arsenm Jul 11, 2025

Uh oh!

jmmartinez Jul 11, 2025

Uh oh!

arsenm Jul 11, 2025

Uh oh!

jmmartinez commented Jul 15, 2025

Uh oh!

Uh oh!

		@@ -0,0 +1,60 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
		; RUN: llc -O3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx90a %s -o - \| FileCheck %s

[SIFoldOperands] Folding immediate into a copy invalidates candidates in the fold list #148187

[SIFoldOperands] Folding immediate into a copy invalidates candidates in the fold list #148187

Uh oh!

Conversation

jmmartinez commented Jul 11, 2025

Uh oh!

llvmbot commented Jul 11, 2025

Uh oh!

jmmartinez commented Jul 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arsenm Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

jmmartinez Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

arsenm Jul 11, 2025

Choose a reason for hiding this comment

Uh oh!

jmmartinez commented Jul 15, 2025

Uh oh!

Uh oh!