Skip to content

[CodeGen] [AMDGPU] Attempt DAGCombine for fmul with select to ldexp #111109

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 9, 2024

Conversation

vg0204
Copy link
Contributor

@vg0204 vg0204 commented Oct 4, 2024

The materialization cost of 32-bit non-inline in case of fmul is quite relatively more, rather than if possible to combine it into ldexp instruction for specific scenarios (for datatypes like f64, f32 and f16) as this is being handled here :

The dag combine for any pair of select values which are exact exponent of 2.

fmul x, select(y, A, B)       -> ldexp (x, select i32 (y, a, b))
fmul x, select(y, -A, -B)    -> ldexp ((fneg x), select i32 (y, a, b))

where, A=2^a & B=2^b ; a and b are integers.  

This dagCombine is handled separately in fmulCombine (newly defined in SIIselLowering), targeting fmul fusing it with select type operand into ldexp.

Thus, it fixes #104900.

@llvmbot
Copy link
Member

llvmbot commented Oct 4, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Vikash Gupta (vg0204)

Changes

For the f32/f16, this combine does no improvements, but for f64 this specific case of fmul with select is more costly to materialize as compared to ldexp, so the following dag combine does the magic.

fmul x, select(y, 2.0, 1.0) -> ldexp x, zext(i1 y) 
fmul x, selcet(y, 0.5, 1.0) -> ldexp x, sext(i1 y)

Thus, it solves the issue #104900.


Full diff: https://github.com/llvm/llvm-project/pull/111109.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+55)
  • (modified) llvm/lib/Target/AMDGPU/SIISelLowering.h (+1)
  • (added) llvm/test/CodeGen/AMDGPU/combine-fmul-sel.ll (+280)
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 5e4cf705cc9e47..ec707fde24e20c 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -899,6 +899,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
                        ISD::FADD,
                        ISD::FSUB,
                        ISD::FDIV,
+                       ISD::FMUL,
                        ISD::FMINNUM,
                        ISD::FMAXNUM,
                        ISD::FMINNUM_IEEE,
@@ -14476,6 +14477,58 @@ SDValue SITargetLowering::performFDivCombine(SDNode *N,
   return SDValue();
 }
 
+SDValue SITargetLowering::performFMulCombine(SDNode *N,
+                                             DAGCombinerInfo &DCI) const {
+  SelectionDAG &DAG = DCI.DAG;
+  EVT VT = N->getValueType(0);
+
+  SDLoc SL(N);
+  SDValue LHS = N->getOperand(0);
+  SDValue RHS = N->getOperand(1);
+
+  // ldexp(x, zext(i1 y)) -> fmul x, (select y, 2.0, 1.0)
+  // ldexp(x, sext(i1 y)) -> fmul x, (select y, 0.5, 1.0)
+  //
+  // The above mentioned ldexp folding works fine for
+  // f16/f32, but as for f64 it creates f64 select which
+  // is costly to materealize as compared to f64 ldexp
+  // so here we undo the transform for f64 as follows :
+  //
+  // fmul x, (select y, 2.0, 1.0) -> ldexp(x, zext(i1 y))
+  // fmul x, (select y, 0.5, 1.0) -> ldexp(x, sext(i1 y))
+  // TODO : Need to handle vector of f64 type.
+  if (VT == MVT::f64) {
+    if (RHS.hasOneUse() && RHS.getOpcode() == ISD::SELECT) {
+      const ConstantFPSDNode *TrueNode =
+          dyn_cast<ConstantFPSDNode>(RHS.getOperand(1));
+      const ConstantFPSDNode *FalseNode =
+          dyn_cast<ConstantFPSDNode>(RHS.getOperand(2));
+
+      if (!TrueNode || !FalseNode)
+        return SDValue();
+
+      const double TrueVal = TrueNode->getValueAPF().convertToDouble();
+      const double FalseVal = FalseNode->getValueAPF().convertToDouble();
+      unsigned ExtOp;
+
+      if (FalseVal == 1.0) {
+        if (TrueVal == 2.0)
+          ExtOp = ISD::ZERO_EXTEND;
+        else if (TrueVal == 0.5)
+          ExtOp = ISD::SIGN_EXTEND;
+        else
+          return SDValue();
+
+        SDValue ExtNode =
+            DAG.getNode(ExtOp, SL, MVT::i32, RHS.getOperand(0));
+        return DAG.getNode(ISD::FLDEXP, SL, MVT::f64, LHS, ExtNode);
+      }
+    }
+  }
+
+  return SDValue();
+}
+
 SDValue SITargetLowering::performFMACombine(SDNode *N,
                                             DAGCombinerInfo &DCI) const {
   SelectionDAG &DAG = DCI.DAG;
@@ -14765,6 +14818,8 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
     return performFSubCombine(N, DCI);
   case ISD::FDIV:
     return performFDivCombine(N, DCI);
+  case ISD::FMUL:
+    return performFMulCombine(N, DCI);
   case ISD::SETCC:
     return performSetCCCombine(N, DCI);
   case ISD::FMAXNUM:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index 6c3edf37945e24..1ead2e4fc916bb 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -218,6 +218,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
   SDValue performFAddCombine(SDNode *N, DAGCombinerInfo &DCI) const;
   SDValue performFSubCombine(SDNode *N, DAGCombinerInfo &DCI) const;
   SDValue performFDivCombine(SDNode *N, DAGCombinerInfo &DCI) const;
+  SDValue performFMulCombine(SDNode *N, DAGCombinerInfo &DCI) const;
   SDValue performFMACombine(SDNode *N, DAGCombinerInfo &DCI) const;
   SDValue performSetCCCombine(SDNode *N, DAGCombinerInfo &DCI) const;
   SDValue performCvtF32UByteNCombine(SDNode *N, DAGCombinerInfo &DCI) const;
diff --git a/llvm/test/CodeGen/AMDGPU/combine-fmul-sel.ll b/llvm/test/CodeGen/AMDGPU/combine-fmul-sel.ll
new file mode 100644
index 00000000000000..0748aa0a0abec0
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/combine-fmul-sel.ll
@@ -0,0 +1,280 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+;RUN: llc < %s -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs | FileCheck -check-prefix=GFX9 %s
+;RUN: llc < %s -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 -verify-machineinstrs | FileCheck -check-prefix=GFX1030 %s
+;RUN: llc < %s -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 -verify-machineinstrs | FileCheck -check-prefix=GFX1100 %s
+
+define float @fmul_select_f32_test1(float %x, i1 %bool) {
+; GFX9-LABEL: fmul_select_f32_test1:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    v_and_b32_e32 v1, 1, v1
+; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v1
+; GFX9-NEXT:    v_cndmask_b32_e64 v1, 1.0, 2.0, vcc
+; GFX9-NEXT:    v_mul_f32_e32 v0, v0, v1
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_f32_test1:
+; GFX1030:       ; %bb.0:
+; GFX1030-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT:    v_and_b32_e32 v1, 1, v1
+; GFX1030-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v1
+; GFX1030-NEXT:    v_cndmask_b32_e64 v1, 1.0, 2.0, vcc_lo
+; GFX1030-NEXT:    v_mul_f32_e32 v0, v0, v1
+; GFX1030-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_f32_test1:
+; GFX1100:       ; %bb.0:
+; GFX1100-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT:    v_and_b32_e32 v1, 1, v1
+; GFX1100-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v1
+; GFX1100-NEXT:    v_cndmask_b32_e64 v1, 1.0, 2.0, vcc_lo
+; GFX1100-NEXT:    v_mul_f32_e32 v0, v0, v1
+; GFX1100-NEXT:    s_setpc_b64 s[30:31]
+  %1 = select i1 %bool, float 2.000000e+00, float 1.000000e+00
+  %ldexp = fmul float %x, %1
+  ret float %ldexp
+}
+
+define float @fmul_select_f32_test2(float %x, i1 %bool) {
+; GFX9-LABEL: fmul_select_f32_test2:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    v_and_b32_e32 v1, 1, v1
+; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v1
+; GFX9-NEXT:    v_cndmask_b32_e64 v1, 1.0, 0.5, vcc
+; GFX9-NEXT:    v_mul_f32_e32 v0, v0, v1
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_f32_test2:
+; GFX1030:       ; %bb.0:
+; GFX1030-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT:    v_and_b32_e32 v1, 1, v1
+; GFX1030-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v1
+; GFX1030-NEXT:    v_cndmask_b32_e64 v1, 1.0, 0.5, vcc_lo
+; GFX1030-NEXT:    v_mul_f32_e32 v0, v0, v1
+; GFX1030-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_f32_test2:
+; GFX1100:       ; %bb.0:
+; GFX1100-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT:    v_and_b32_e32 v1, 1, v1
+; GFX1100-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v1
+; GFX1100-NEXT:    v_cndmask_b32_e64 v1, 1.0, 0.5, vcc_lo
+; GFX1100-NEXT:    v_mul_f32_e32 v0, v0, v1
+; GFX1100-NEXT:    s_setpc_b64 s[30:31]
+  %1 = select i1 %bool, float 0.500000e+00, float 1.000000e+00
+  %ldexp = fmul float %x, %1
+  ret float %ldexp
+}
+
+define <2 x float> @fmul_select_v2f32_test1(<2 x float> %x, <2 x i1> %bool) {
+; GFX9-LABEL: fmul_select_v2f32_test1:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    v_and_b32_e32 v3, 1, v3
+; GFX9-NEXT:    v_and_b32_e32 v2, 1, v2
+; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v3
+; GFX9-NEXT:    v_cndmask_b32_e64 v3, 1.0, 2.0, vcc
+; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v2
+; GFX9-NEXT:    v_cndmask_b32_e64 v2, 1.0, 2.0, vcc
+; GFX9-NEXT:    v_mul_f32_e32 v0, v0, v2
+; GFX9-NEXT:    v_mul_f32_e32 v1, v1, v3
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_v2f32_test1:
+; GFX1030:       ; %bb.0:
+; GFX1030-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT:    v_and_b32_e32 v2, 1, v2
+; GFX1030-NEXT:    v_and_b32_e32 v3, 1, v3
+; GFX1030-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v2
+; GFX1030-NEXT:    v_cndmask_b32_e64 v2, 1.0, 2.0, vcc_lo
+; GFX1030-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v3
+; GFX1030-NEXT:    v_mul_f32_e32 v0, v0, v2
+; GFX1030-NEXT:    v_cndmask_b32_e64 v3, 1.0, 2.0, vcc_lo
+; GFX1030-NEXT:    v_mul_f32_e32 v1, v1, v3
+; GFX1030-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_v2f32_test1:
+; GFX1100:       ; %bb.0:
+; GFX1100-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT:    v_and_b32_e32 v2, 1, v2
+; GFX1100-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v2
+; GFX1100-NEXT:    v_cndmask_b32_e64 v2, 1.0, 2.0, vcc_lo
+; GFX1100-NEXT:    v_dual_mul_f32 v0, v0, v2 :: v_dual_and_b32 v3, 1, v3
+; GFX1100-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v3
+; GFX1100-NEXT:    v_cndmask_b32_e64 v3, 1.0, 2.0, vcc_lo
+; GFX1100-NEXT:    v_mul_f32_e32 v1, v1, v3
+; GFX1100-NEXT:    s_setpc_b64 s[30:31]
+  %1 = select <2 x i1> %bool, <2 x float> <float 2.000000e+00, float 2.000000e+00>, <2 x float> <float 1.000000e+00, float 1.000000e+00>
+  %ldexp = fmul <2 x float> %x, %1
+  ret <2 x float> %ldexp
+}
+
+define <2 x float> @fmul_select_v2f32_test2(<2 x float> %x, <2 x i1> %bool) {
+; GFX9-LABEL: fmul_select_v2f32_test2:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    v_and_b32_e32 v3, 1, v3
+; GFX9-NEXT:    v_and_b32_e32 v2, 1, v2
+; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v3
+; GFX9-NEXT:    v_cndmask_b32_e64 v3, 1.0, 0.5, vcc
+; GFX9-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v2
+; GFX9-NEXT:    v_cndmask_b32_e64 v2, 1.0, 0.5, vcc
+; GFX9-NEXT:    v_mul_f32_e32 v0, v0, v2
+; GFX9-NEXT:    v_mul_f32_e32 v1, v1, v3
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_v2f32_test2:
+; GFX1030:       ; %bb.0:
+; GFX1030-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT:    v_and_b32_e32 v2, 1, v2
+; GFX1030-NEXT:    v_and_b32_e32 v3, 1, v3
+; GFX1030-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v2
+; GFX1030-NEXT:    v_cndmask_b32_e64 v2, 1.0, 0.5, vcc_lo
+; GFX1030-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v3
+; GFX1030-NEXT:    v_mul_f32_e32 v0, v0, v2
+; GFX1030-NEXT:    v_cndmask_b32_e64 v3, 1.0, 0.5, vcc_lo
+; GFX1030-NEXT:    v_mul_f32_e32 v1, v1, v3
+; GFX1030-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_v2f32_test2:
+; GFX1100:       ; %bb.0:
+; GFX1100-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT:    v_and_b32_e32 v2, 1, v2
+; GFX1100-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v2
+; GFX1100-NEXT:    v_cndmask_b32_e64 v2, 1.0, 0.5, vcc_lo
+; GFX1100-NEXT:    v_dual_mul_f32 v0, v0, v2 :: v_dual_and_b32 v3, 1, v3
+; GFX1100-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT:    v_cmp_eq_u32_e32 vcc_lo, 1, v3
+; GFX1100-NEXT:    v_cndmask_b32_e64 v3, 1.0, 0.5, vcc_lo
+; GFX1100-NEXT:    v_mul_f32_e32 v1, v1, v3
+; GFX1100-NEXT:    s_setpc_b64 s[30:31]
+  %1 = select <2 x i1> %bool, <2 x float> <float 0.500000e+00, float 0.500000e+00>, <2 x float> <float 1.000000e+00, float 1.000000e+00>
+  %ldexp = fmul <2 x float> %x, %1
+  ret <2 x float> %ldexp
+}
+
+define double @fmul_select_f64_test1(double %x, i1 %bool) {
+; GFX9-LABEL: fmul_select_f64_test1:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    v_and_b32_e32 v2, 1, v2
+; GFX9-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_f64_test1:
+; GFX1030:       ; %bb.0:
+; GFX1030-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT:    v_and_b32_e32 v2, 1, v2
+; GFX1030-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX1030-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_f64_test1:
+; GFX1100:       ; %bb.0:
+; GFX1100-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT:    v_and_b32_e32 v2, 1, v2
+; GFX1100-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1100-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX1100-NEXT:    s_setpc_b64 s[30:31]
+  %1 = select i1 %bool, double 2.000000e+00, double 1.000000e+00
+  %ldexp = fmul double %x, %1
+  ret double %ldexp
+}
+
+define double @fmul_select_f64_test2(double %x, i1 %bool) {
+; GFX9-LABEL: fmul_select_f64_test2:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    v_bfe_i32 v2, v2, 0, 1
+; GFX9-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_f64_test2:
+; GFX1030:       ; %bb.0:
+; GFX1030-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT:    v_bfe_i32 v2, v2, 0, 1
+; GFX1030-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX1030-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_f64_test2:
+; GFX1100:       ; %bb.0:
+; GFX1100-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT:    v_bfe_i32 v2, v2, 0, 1
+; GFX1100-NEXT:    s_delay_alu instid0(VALU_DEP_1)
+; GFX1100-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX1100-NEXT:    s_setpc_b64 s[30:31]
+  %1 = select i1 %bool, double 0.500000e+00, double 1.000000e+00
+  %ldexp = fmul double %x, %1
+  ret double %ldexp
+}
+
+define <2 x double> @fmul_select_v2f64_test1(<2 x double> %x, <2 x i1> %bool) {
+; GFX9-LABEL: fmul_select_v2f64_test1:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    v_and_b32_e32 v4, 1, v4
+; GFX9-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX9-NEXT:    v_and_b32_e32 v4, 1, v5
+; GFX9-NEXT:    v_ldexp_f64 v[2:3], v[2:3], v4
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_v2f64_test1:
+; GFX1030:       ; %bb.0:
+; GFX1030-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT:    v_and_b32_e32 v4, 1, v4
+; GFX1030-NEXT:    v_and_b32_e32 v5, 1, v5
+; GFX1030-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX1030-NEXT:    v_ldexp_f64 v[2:3], v[2:3], v5
+; GFX1030-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_v2f64_test1:
+; GFX1100:       ; %bb.0:
+; GFX1100-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT:    v_and_b32_e32 v4, 1, v4
+; GFX1100-NEXT:    v_and_b32_e32 v5, 1, v5
+; GFX1100-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX1100-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX1100-NEXT:    v_ldexp_f64 v[2:3], v[2:3], v5
+; GFX1100-NEXT:    s_setpc_b64 s[30:31]
+  %1 = select <2 x i1> %bool, <2 x double> <double 2.000000e+00, double 2.000000e+00>, <2 x double> <double 1.000000e+00, double 1.000000e+00>
+  %ldexp = fmul <2 x double> %x, %1
+  ret <2 x double> %ldexp
+}
+
+define <2 x double> @fmul_select_v2f64_test2(<2 x double> %x, <2 x i1> %bool) {
+; GFX9-LABEL: fmul_select_v2f64_test2:
+; GFX9:       ; %bb.0:
+; GFX9-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT:    v_bfe_i32 v4, v4, 0, 1
+; GFX9-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX9-NEXT:    v_bfe_i32 v4, v5, 0, 1
+; GFX9-NEXT:    v_ldexp_f64 v[2:3], v[2:3], v4
+; GFX9-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_v2f64_test2:
+; GFX1030:       ; %bb.0:
+; GFX1030-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT:    v_bfe_i32 v4, v4, 0, 1
+; GFX1030-NEXT:    v_bfe_i32 v5, v5, 0, 1
+; GFX1030-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX1030-NEXT:    v_ldexp_f64 v[2:3], v[2:3], v5
+; GFX1030-NEXT:    s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_v2f64_test2:
+; GFX1100:       ; %bb.0:
+; GFX1100-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT:    v_bfe_i32 v4, v4, 0, 1
+; GFX1100-NEXT:    v_bfe_i32 v5, v5, 0, 1
+; GFX1100-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX1100-NEXT:    v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX1100-NEXT:    v_ldexp_f64 v[2:3], v[2:3], v5
+; GFX1100-NEXT:    s_setpc_b64 s[30:31]
+  %1 = select <2 x i1> %bool, <2 x double> <double 0.500000e+00, double 0.500000e+00>, <2 x double> <double 1.000000e+00, double 1.000000e+00>
+  %ldexp = fmul <2 x double> %x, %1
+  ret <2 x double> %ldexp
+}

@vg0204 vg0204 force-pushed the vg0204/fmul-dag-combine-to-ldexp branch from 86a3023 to 4b0885d Compare October 4, 2024 07:34
@vg0204 vg0204 self-assigned this Oct 4, 2024
@vg0204 vg0204 requested a review from arsenm October 4, 2024 07:36
isConstOrConstSplatFP(RHS.getOperand(2));
bool isNeg;

if (!TrueNode || !FalseNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early exit after first one fails before checking the second. Also check RHS first

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't do the early exit

else
return SDValue();

SDValue ExtNode = DAG.getNode(ExtOp, SL, MVT::i32, RHS.getOperand(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can generalize this to select of integer, and handle any pair of powers of 2


if (TrueNode->isNegative() != FalseNode->isNegative())
return SDValue();
bool isNeg = TrueNode->isNegative();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need isNeg variable

Comment on lines 14515 to 14526
if (FalseNode->isExactlyValue(1.0) || FalseNode->isExactlyValue(-1.0)) {
if (TrueNode->isExactlyValue(2.0) || TrueNode->isExactlyValue(-2.0))
ExtOp = ISD::ZERO_EXTEND;
else if (TrueNode->isExactlyValue(0.5) ||
TrueNode->isExactlyValue(-0.5))
ExtOp = ISD::SIGN_EXTEND;
else
return SDValue();
if (TrueNode->isExactlyValue(2.0) || TrueNode->isExactlyValue(-2.0)) {
SDValue ZExtNode =
DAG.getNode(ISD::ZERO_EXTEND, SL, i32VT, RHS.getOperand(0));
return DAG.getNode(ISD::FLDEXP, SL, VT, LHS, ZExtNode);
} else if (TrueNode->isExactlyValue(0.5) ||
TrueNode->isExactlyValue(-0.5)) {
SDValue SExtNode =
DAG.getNode(ISD::SIGN_EXTEND, SL, i32VT, RHS.getOperand(0));
return DAG.getNode(ISD::FLDEXP, SL, VT, LHS, SExtNode);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need the special case for the extensions up here. We select the extensions into a select anyway

@vg0204 vg0204 force-pushed the vg0204/fmul-dag-combine-to-ldexp branch from 6906e39 to de6e306 Compare October 15, 2024 07:07
isConstOrConstSplatFP(RHS.getOperand(2));
bool isNeg;

if (!TrueNode || !FalseNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't do the early exit

Comment on lines 14503 to 14504
// Note : It takes care of generic scenario which covers undoing
// of special case(zext/sext) as mentioned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Note : It takes care of generic scenario which covers undoing
// of special case(zext/sext) as mentioned.

Comment on lines 14493 to 14498
// The above mentioned ldexp folding works fine for
// bf16/f32, but as for f64 it creates f64 select which
// is costly to materialize as compared to f64 ldexp
// so here we undo the transform for f64 datatype.
// Also in case of f16, its cheaper to materialize inline
// 32 bit-constant (via ldexp use) rather than using fmul.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is verbose. Just state that it's cheaper to use i32 inline constants than to materialize f16 or f64 values

// fmul x, (select y, -A, -B) -> ldexp( (fneg x), (select i32 y, a, b) )
// Note : It takes care of generic scenario which covers undoing
// of special case(zext/sext) as mentioned.
if (VT.getScalarType() == MVT::f64 || VT.getScalarType() == MVT::f16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the non-inline constant cases, we should do this for f32 too.

Comment on lines 14490 to 14491
// ldexp(x, zext(i1 y)) -> fmul x, (select y, 2.0, 1.0)
// ldexp(x, sext(i1 y)) -> fmul x, (select y, 0.5, 1.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain where these transformations are done. Is it another DAG combine somewhere? If so, what stops your DAG combine from fighting with the other one?

@jayfoad
Copy link
Contributor

jayfoad commented Oct 15, 2024

Thus, it solves the issue #104900.

If you write this as just "Fixes #104900" then github will recognize it and close the issue for you when you merge this patch: https://docs.github.com/en/issues/tracking-your-work-with-issues/using-issues/linking-a-pull-request-to-an-issue

DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;
EVT VT = N->getValueType(0);
EVT intVT = VT.changeElementType(MVT::i32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating myself, variable names should start with a capital letter.

Comment on lines 14514 to 14516
if (VT.getScalarType() == MVT::f32)
if (TII->isInlineConstant(TrueNode->getValueAPF()) &&
TII->isInlineConstant(FalseNode->getValueAPF()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can merge into one if. Also sink this below the isNegative early exit below

@arsenm
Copy link
Contributor

arsenm commented Oct 16, 2024

Missing test updates?

@vg0204 vg0204 force-pushed the vg0204/fmul-dag-combine-to-ldexp branch from e2fd76a to 4b9d3a0 Compare October 21, 2024 06:37
@vg0204
Copy link
Contributor Author

vg0204 commented Oct 21, 2024

Missing test updates?

Addressed!

Comment on lines 14503 to 14629
const ConstantFPSDNode *TrueNode =
isConstOrConstSplatFP(RHS.getOperand(1));
const ConstantFPSDNode *FalseNode =
isConstOrConstSplatFP(RHS.getOperand(2));

bool AreNodesFP = TrueNode && FalseNode;
if (!AreNodesFP)
return SDValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const ConstantFPSDNode *TrueNode =
isConstOrConstSplatFP(RHS.getOperand(1));
const ConstantFPSDNode *FalseNode =
isConstOrConstSplatFP(RHS.getOperand(2));
bool AreNodesFP = TrueNode && FalseNode;
if (!AreNodesFP)
return SDValue();
const ConstantFPSDNode *FalseNode =
isConstOrConstSplatFP(RHS.getOperand(2));
if (!FalseNode)
return SDValue();
const ConstantFPSDNode *TrueNode =
isConstOrConstSplatFP(RHS.getOperand(1));
if (!TrueNode)
return SDValue();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping here

Comment on lines 14526 to 14527
int FalseNodeExpVal = FalseNode->getValueAPF().getExactLog2Abs();
if (TrueNodeExpVal != INT_MIN && FalseNodeExpVal != INT_MIN) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early exit after one fails

;RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 < %s | FileCheck -check-prefix=GFX1030 %s
;RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX1100 %s

define float @fmul_select_f32_test1(float %x, i32 %bool.arg1, i32 %bool.arg2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rebase this to be on top of the test add

@vg0204
Copy link
Contributor Author

vg0204 commented Oct 28, 2024

Kind Ping!

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be easier to see diff on top of the baseline tests

Comment on lines 14503 to 14629
const ConstantFPSDNode *TrueNode =
isConstOrConstSplatFP(RHS.getOperand(1));
const ConstantFPSDNode *FalseNode =
isConstOrConstSplatFP(RHS.getOperand(2));

bool AreNodesFP = TrueNode && FalseNode;
if (!AreNodesFP)
return SDValue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping here

Comment on lines 14500 to 14501
if (VT.getScalarType() == MVT::f64 || VT.getScalarType() == MVT::f32 ||
VT.getScalarType() == MVT::f16) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Factor out repeated getScalarType call

@vg0204 vg0204 force-pushed the vg0204/fmul-dag-combine-to-ldexp branch from c27b98e to 832eb0f Compare November 4, 2024 08:18
@@ -0,0 +1,2616 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
;RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 < %s | FileCheck -check-prefix=GFX7 %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should rebase on top of the test base. Patch lgtm with nit but I can't easily check the effect in the output here

arsenm pushed a commit that referenced this pull request Nov 25, 2024
This adds the f32/f64/f16/bf16 test cases for below pattern :

`fmul x, select(y, A, B)`
with just one use of select Inst above.

It acts as pre-commit tests for dagCombining above pattern into cheaper
ldexp in case of non-inlline 32 bit-constants. (#111109)
@vg0204 vg0204 force-pushed the vg0204/fmul-dag-combine-to-ldexp branch 2 times, most recently from 75176be to b250aba Compare November 26, 2024 07:38
For the f32/f16, this combine does no improvements, but for f64 this
specific case of fmul with select is more costly to materialize as
compared to ldexp, so the following dag combine does the magic.

fmul x, select(y, 2.0, 1.0) -> ldexp x, zext(i1 y)
fmul x, selcet(y, 0.5, 1.0) -> ldexp x, sext(i1 y)
@vg0204 vg0204 force-pushed the vg0204/fmul-dag-combine-to-ldexp branch from b250aba to b1fdd89 Compare November 26, 2024 07:39
@vg0204
Copy link
Contributor Author

vg0204 commented Nov 26, 2024

I updated with new test cases as well as the old ones that needs changes, but for some reason, the only below testcase

llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll

It is giving invalid bitcast error "from i16 to i32" while trying to update with update_llc_tests_check.py script. But, bizarelly when running the command sequence of the test in terminal its working fine, generating assembly as supposed to be with dagcombine happening. I am perplexed!!

@arsenm
Copy link
Contributor

arsenm commented Nov 27, 2024

I updated with new test cases as well as the old ones that needs changes, but for some reason, the only below testcase

llvm/test/CodeGen/AMDGPU/amdgpu-simplify-libcall-pow-codegen.ll

It is giving invalid bitcast error "from i16 to i32" while trying to update with update_llc_tests_check.py script. But, bizarelly when running the command sequence of the test in terminal its working fine, generating assembly as supposed to be with dagcombine happening. I am perplexed!!

I don't know how you could hit this bitcast issue, but should fix however it is getting it.

This test is odd in that it has an llc line with an opt invocation, which update_llc_test_checks does not understand. In any case, there should be no path to hitting an invalid bitcast

@vg0204
Copy link
Contributor Author

vg0204 commented Nov 28, 2024

@arsenm, Still figuring out; they used opt's generated output file o be fed to llc using pipe command

Just to point out that when executing the commmands directly mentioned in lit test, its working fine but getting messed up via updation using llc_check_tests script. Could you comment on that !!

@vg0204
Copy link
Contributor Author

vg0204 commented Nov 28, 2024

@arsenm, as you were saying there is no issue with bitcast, but owing to some bizarre docker issue, that came. On baremetal machine, as you suspected opt entir line(a preprocess command invoked by lllc-update python script) is creating problem, as it is entirely executed as it is on terminal, thus unable to locate opt binary, as no stripping has been done for it like llc tool! Tested this by hardcoding the right opt path

@vg0204 vg0204 force-pushed the vg0204/fmul-dag-combine-to-ldexp branch from 1af8afc to be435f8 Compare November 28, 2024 16:37
@vg0204
Copy link
Contributor Author

vg0204 commented Nov 28, 2024

@arsenm, as you were saying there is no issue with bitcast, but owing to some bizarre docker issue, that came. On baremetal machine, as you suspected opt entir line(a preprocess command invoked by lllc-update python script) is creating problem, as it is entirely executed as it is on terminal, thus unable to locate opt binary path, as in case of llc tool!

I handled it, by letting hardcoding the opt path, followed by updating it via script & reverting the opt path back to original as was in test. But, in future it would be a problem if someone else tries to run update-llc script again & need to again do the same hack as I did now!

@vg0204 vg0204 requested review from jayfoad and arsenm November 29, 2024 09:23
Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please implement the same for globalisel in a follow up

Comment on lines +3321 to +3324
; GFX9-GISEL-NEXT: v_mov_b32_e32 v2, 0x42800000
; GFX9-GISEL-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0
; GFX9-GISEL-NEXT: v_cndmask_b32_e32 v0, 1.0, v2, vcc
; GFX9-GISEL-NEXT: v_mul_f32_e32 v0, v1, v0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Implement this for globalisel

@vg0204 vg0204 merged commit 0b0d9a3 into llvm:main Dec 9, 2024
5 of 8 checks passed
vg0204 added a commit that referenced this pull request Jan 6, 2025
#120104)

This combine pattern perform the below transformation.

fmul x, select(y, A, B)      -> fldexp (x, select i32 (y, a, b))
fmul x, select(y, -A, -B)   -> fldexp ((fneg x), select i32 (y, a, b))

where, A=2^a & B=2^b ; a and b are integers.

It is a follow-up PR to implement the above combine for globalIsel, as
the corresponding DAG combine has been done for SelectionDAG Isel
(#111109)
paulhuggett pushed a commit to paulhuggett/llvm-project that referenced this pull request Jan 7, 2025
llvm#120104)

This combine pattern perform the below transformation.

fmul x, select(y, A, B)      -> fldexp (x, select i32 (y, a, b))
fmul x, select(y, -A, -B)   -> fldexp ((fneg x), select i32 (y, a, b))

where, A=2^a & B=2^b ; a and b are integers.

It is a follow-up PR to implement the above combine for globalIsel, as
the corresponding DAG combine has been done for SelectionDAG Isel
(llvm#111109)
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 13, 2025
This adds the f32/f64/f16/bf16 test cases for below pattern :

`fmul x, select(y, A, B)`
with just one use of select Inst above.

It acts as pre-commit tests for dagCombining above pattern into cheaper
ldexp in case of non-inlline 32 bit-constants. (llvm#111109)

Change-Id: Ia6a3bb41b25ca8fb3d3f5bc67c183c168d8f4ba8
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Mar 13, 2025
…lvm#111109)

The materialization cost of 32-bit non-inline in case of fmul is quite
relatively more, rather than if possible to combine it into ldexp
instruction for specific scenarios (for datatypes like f64, f32 and f16)
as this is being handled here :

The dag combine for any pair of select values which are exact exponent
of 2.

```
fmul x, select(y, A, B)       -> ldexp (x, select i32 (y, a, b))
fmul x, select(y, -A, -B)    -> ldexp ((fneg x), select i32 (y, a, b))

where, A=2^a & B=2^b ; a and b are integers.
```

This dagCombine is handled separately in fmulCombine (newly defined in
SIIselLowering), targeting fmul fusing it with select type operand into
ldexp.

Thus, it fixes llvm#104900.

Change-Id: I1e76049d8de218329efac4c62ee3c52cd824258c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AMDGPU] InstCombine results in performance drop in ROCM's rocRAND library in MI100
4 participants