-
Notifications
You must be signed in to change notification settings - Fork 14.5k
[CodeGen] [AMDGPU] Attempt DAGCombine for fmul with select to ldexp #111109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Vikash Gupta (vg0204) ChangesFor the f32/f16, this combine does no improvements, but for f64 this specific case of fmul with select is more costly to materialize as compared to ldexp, so the following dag combine does the magic.
Thus, it solves the issue #104900. Full diff: https://github.com/llvm/llvm-project/pull/111109.diff 3 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 5e4cf705cc9e47..ec707fde24e20c 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -899,6 +899,7 @@ SITargetLowering::SITargetLowering(const TargetMachine &TM,
ISD::FADD,
ISD::FSUB,
ISD::FDIV,
+ ISD::FMUL,
ISD::FMINNUM,
ISD::FMAXNUM,
ISD::FMINNUM_IEEE,
@@ -14476,6 +14477,58 @@ SDValue SITargetLowering::performFDivCombine(SDNode *N,
return SDValue();
}
+SDValue SITargetLowering::performFMulCombine(SDNode *N,
+ DAGCombinerInfo &DCI) const {
+ SelectionDAG &DAG = DCI.DAG;
+ EVT VT = N->getValueType(0);
+
+ SDLoc SL(N);
+ SDValue LHS = N->getOperand(0);
+ SDValue RHS = N->getOperand(1);
+
+ // ldexp(x, zext(i1 y)) -> fmul x, (select y, 2.0, 1.0)
+ // ldexp(x, sext(i1 y)) -> fmul x, (select y, 0.5, 1.0)
+ //
+ // The above mentioned ldexp folding works fine for
+ // f16/f32, but as for f64 it creates f64 select which
+ // is costly to materealize as compared to f64 ldexp
+ // so here we undo the transform for f64 as follows :
+ //
+ // fmul x, (select y, 2.0, 1.0) -> ldexp(x, zext(i1 y))
+ // fmul x, (select y, 0.5, 1.0) -> ldexp(x, sext(i1 y))
+ // TODO : Need to handle vector of f64 type.
+ if (VT == MVT::f64) {
+ if (RHS.hasOneUse() && RHS.getOpcode() == ISD::SELECT) {
+ const ConstantFPSDNode *TrueNode =
+ dyn_cast<ConstantFPSDNode>(RHS.getOperand(1));
+ const ConstantFPSDNode *FalseNode =
+ dyn_cast<ConstantFPSDNode>(RHS.getOperand(2));
+
+ if (!TrueNode || !FalseNode)
+ return SDValue();
+
+ const double TrueVal = TrueNode->getValueAPF().convertToDouble();
+ const double FalseVal = FalseNode->getValueAPF().convertToDouble();
+ unsigned ExtOp;
+
+ if (FalseVal == 1.0) {
+ if (TrueVal == 2.0)
+ ExtOp = ISD::ZERO_EXTEND;
+ else if (TrueVal == 0.5)
+ ExtOp = ISD::SIGN_EXTEND;
+ else
+ return SDValue();
+
+ SDValue ExtNode =
+ DAG.getNode(ExtOp, SL, MVT::i32, RHS.getOperand(0));
+ return DAG.getNode(ISD::FLDEXP, SL, MVT::f64, LHS, ExtNode);
+ }
+ }
+ }
+
+ return SDValue();
+}
+
SDValue SITargetLowering::performFMACombine(SDNode *N,
DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;
@@ -14765,6 +14818,8 @@ SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
return performFSubCombine(N, DCI);
case ISD::FDIV:
return performFDivCombine(N, DCI);
+ case ISD::FMUL:
+ return performFMulCombine(N, DCI);
case ISD::SETCC:
return performSetCCCombine(N, DCI);
case ISD::FMAXNUM:
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.h b/llvm/lib/Target/AMDGPU/SIISelLowering.h
index 6c3edf37945e24..1ead2e4fc916bb 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.h
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.h
@@ -218,6 +218,7 @@ class SITargetLowering final : public AMDGPUTargetLowering {
SDValue performFAddCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performFSubCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performFDivCombine(SDNode *N, DAGCombinerInfo &DCI) const;
+ SDValue performFMulCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performFMACombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performSetCCCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performCvtF32UByteNCombine(SDNode *N, DAGCombinerInfo &DCI) const;
diff --git a/llvm/test/CodeGen/AMDGPU/combine-fmul-sel.ll b/llvm/test/CodeGen/AMDGPU/combine-fmul-sel.ll
new file mode 100644
index 00000000000000..0748aa0a0abec0
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/combine-fmul-sel.ll
@@ -0,0 +1,280 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+;RUN: llc < %s -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs | FileCheck -check-prefix=GFX9 %s
+;RUN: llc < %s -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 -verify-machineinstrs | FileCheck -check-prefix=GFX1030 %s
+;RUN: llc < %s -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 -verify-machineinstrs | FileCheck -check-prefix=GFX1100 %s
+
+define float @fmul_select_f32_test1(float %x, i1 %bool) {
+; GFX9-LABEL: fmul_select_f32_test1:
+; GFX9: ; %bb.0:
+; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT: v_and_b32_e32 v1, 1, v1
+; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v1
+; GFX9-NEXT: v_cndmask_b32_e64 v1, 1.0, 2.0, vcc
+; GFX9-NEXT: v_mul_f32_e32 v0, v0, v1
+; GFX9-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_f32_test1:
+; GFX1030: ; %bb.0:
+; GFX1030-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT: v_and_b32_e32 v1, 1, v1
+; GFX1030-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v1
+; GFX1030-NEXT: v_cndmask_b32_e64 v1, 1.0, 2.0, vcc_lo
+; GFX1030-NEXT: v_mul_f32_e32 v0, v0, v1
+; GFX1030-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_f32_test1:
+; GFX1100: ; %bb.0:
+; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT: v_and_b32_e32 v1, 1, v1
+; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v1
+; GFX1100-NEXT: v_cndmask_b32_e64 v1, 1.0, 2.0, vcc_lo
+; GFX1100-NEXT: v_mul_f32_e32 v0, v0, v1
+; GFX1100-NEXT: s_setpc_b64 s[30:31]
+ %1 = select i1 %bool, float 2.000000e+00, float 1.000000e+00
+ %ldexp = fmul float %x, %1
+ ret float %ldexp
+}
+
+define float @fmul_select_f32_test2(float %x, i1 %bool) {
+; GFX9-LABEL: fmul_select_f32_test2:
+; GFX9: ; %bb.0:
+; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT: v_and_b32_e32 v1, 1, v1
+; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v1
+; GFX9-NEXT: v_cndmask_b32_e64 v1, 1.0, 0.5, vcc
+; GFX9-NEXT: v_mul_f32_e32 v0, v0, v1
+; GFX9-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_f32_test2:
+; GFX1030: ; %bb.0:
+; GFX1030-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT: v_and_b32_e32 v1, 1, v1
+; GFX1030-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v1
+; GFX1030-NEXT: v_cndmask_b32_e64 v1, 1.0, 0.5, vcc_lo
+; GFX1030-NEXT: v_mul_f32_e32 v0, v0, v1
+; GFX1030-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_f32_test2:
+; GFX1100: ; %bb.0:
+; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT: v_and_b32_e32 v1, 1, v1
+; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v1
+; GFX1100-NEXT: v_cndmask_b32_e64 v1, 1.0, 0.5, vcc_lo
+; GFX1100-NEXT: v_mul_f32_e32 v0, v0, v1
+; GFX1100-NEXT: s_setpc_b64 s[30:31]
+ %1 = select i1 %bool, float 0.500000e+00, float 1.000000e+00
+ %ldexp = fmul float %x, %1
+ ret float %ldexp
+}
+
+define <2 x float> @fmul_select_v2f32_test1(<2 x float> %x, <2 x i1> %bool) {
+; GFX9-LABEL: fmul_select_v2f32_test1:
+; GFX9: ; %bb.0:
+; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT: v_and_b32_e32 v3, 1, v3
+; GFX9-NEXT: v_and_b32_e32 v2, 1, v2
+; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v3
+; GFX9-NEXT: v_cndmask_b32_e64 v3, 1.0, 2.0, vcc
+; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v2
+; GFX9-NEXT: v_cndmask_b32_e64 v2, 1.0, 2.0, vcc
+; GFX9-NEXT: v_mul_f32_e32 v0, v0, v2
+; GFX9-NEXT: v_mul_f32_e32 v1, v1, v3
+; GFX9-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_v2f32_test1:
+; GFX1030: ; %bb.0:
+; GFX1030-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT: v_and_b32_e32 v2, 1, v2
+; GFX1030-NEXT: v_and_b32_e32 v3, 1, v3
+; GFX1030-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v2
+; GFX1030-NEXT: v_cndmask_b32_e64 v2, 1.0, 2.0, vcc_lo
+; GFX1030-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v3
+; GFX1030-NEXT: v_mul_f32_e32 v0, v0, v2
+; GFX1030-NEXT: v_cndmask_b32_e64 v3, 1.0, 2.0, vcc_lo
+; GFX1030-NEXT: v_mul_f32_e32 v1, v1, v3
+; GFX1030-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_v2f32_test1:
+; GFX1100: ; %bb.0:
+; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT: v_and_b32_e32 v2, 1, v2
+; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v2
+; GFX1100-NEXT: v_cndmask_b32_e64 v2, 1.0, 2.0, vcc_lo
+; GFX1100-NEXT: v_dual_mul_f32 v0, v0, v2 :: v_dual_and_b32 v3, 1, v3
+; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v3
+; GFX1100-NEXT: v_cndmask_b32_e64 v3, 1.0, 2.0, vcc_lo
+; GFX1100-NEXT: v_mul_f32_e32 v1, v1, v3
+; GFX1100-NEXT: s_setpc_b64 s[30:31]
+ %1 = select <2 x i1> %bool, <2 x float> <float 2.000000e+00, float 2.000000e+00>, <2 x float> <float 1.000000e+00, float 1.000000e+00>
+ %ldexp = fmul <2 x float> %x, %1
+ ret <2 x float> %ldexp
+}
+
+define <2 x float> @fmul_select_v2f32_test2(<2 x float> %x, <2 x i1> %bool) {
+; GFX9-LABEL: fmul_select_v2f32_test2:
+; GFX9: ; %bb.0:
+; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT: v_and_b32_e32 v3, 1, v3
+; GFX9-NEXT: v_and_b32_e32 v2, 1, v2
+; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v3
+; GFX9-NEXT: v_cndmask_b32_e64 v3, 1.0, 0.5, vcc
+; GFX9-NEXT: v_cmp_eq_u32_e32 vcc, 1, v2
+; GFX9-NEXT: v_cndmask_b32_e64 v2, 1.0, 0.5, vcc
+; GFX9-NEXT: v_mul_f32_e32 v0, v0, v2
+; GFX9-NEXT: v_mul_f32_e32 v1, v1, v3
+; GFX9-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_v2f32_test2:
+; GFX1030: ; %bb.0:
+; GFX1030-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT: v_and_b32_e32 v2, 1, v2
+; GFX1030-NEXT: v_and_b32_e32 v3, 1, v3
+; GFX1030-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v2
+; GFX1030-NEXT: v_cndmask_b32_e64 v2, 1.0, 0.5, vcc_lo
+; GFX1030-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v3
+; GFX1030-NEXT: v_mul_f32_e32 v0, v0, v2
+; GFX1030-NEXT: v_cndmask_b32_e64 v3, 1.0, 0.5, vcc_lo
+; GFX1030-NEXT: v_mul_f32_e32 v1, v1, v3
+; GFX1030-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_v2f32_test2:
+; GFX1100: ; %bb.0:
+; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT: v_and_b32_e32 v2, 1, v2
+; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v2
+; GFX1100-NEXT: v_cndmask_b32_e64 v2, 1.0, 0.5, vcc_lo
+; GFX1100-NEXT: v_dual_mul_f32 v0, v0, v2 :: v_dual_and_b32 v3, 1, v3
+; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
+; GFX1100-NEXT: v_cmp_eq_u32_e32 vcc_lo, 1, v3
+; GFX1100-NEXT: v_cndmask_b32_e64 v3, 1.0, 0.5, vcc_lo
+; GFX1100-NEXT: v_mul_f32_e32 v1, v1, v3
+; GFX1100-NEXT: s_setpc_b64 s[30:31]
+ %1 = select <2 x i1> %bool, <2 x float> <float 0.500000e+00, float 0.500000e+00>, <2 x float> <float 1.000000e+00, float 1.000000e+00>
+ %ldexp = fmul <2 x float> %x, %1
+ ret <2 x float> %ldexp
+}
+
+define double @fmul_select_f64_test1(double %x, i1 %bool) {
+; GFX9-LABEL: fmul_select_f64_test1:
+; GFX9: ; %bb.0:
+; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT: v_and_b32_e32 v2, 1, v2
+; GFX9-NEXT: v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX9-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_f64_test1:
+; GFX1030: ; %bb.0:
+; GFX1030-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT: v_and_b32_e32 v2, 1, v2
+; GFX1030-NEXT: v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX1030-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_f64_test1:
+; GFX1100: ; %bb.0:
+; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT: v_and_b32_e32 v2, 1, v2
+; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1100-NEXT: v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX1100-NEXT: s_setpc_b64 s[30:31]
+ %1 = select i1 %bool, double 2.000000e+00, double 1.000000e+00
+ %ldexp = fmul double %x, %1
+ ret double %ldexp
+}
+
+define double @fmul_select_f64_test2(double %x, i1 %bool) {
+; GFX9-LABEL: fmul_select_f64_test2:
+; GFX9: ; %bb.0:
+; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT: v_bfe_i32 v2, v2, 0, 1
+; GFX9-NEXT: v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX9-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_f64_test2:
+; GFX1030: ; %bb.0:
+; GFX1030-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT: v_bfe_i32 v2, v2, 0, 1
+; GFX1030-NEXT: v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX1030-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_f64_test2:
+; GFX1100: ; %bb.0:
+; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT: v_bfe_i32 v2, v2, 0, 1
+; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_1)
+; GFX1100-NEXT: v_ldexp_f64 v[0:1], v[0:1], v2
+; GFX1100-NEXT: s_setpc_b64 s[30:31]
+ %1 = select i1 %bool, double 0.500000e+00, double 1.000000e+00
+ %ldexp = fmul double %x, %1
+ ret double %ldexp
+}
+
+define <2 x double> @fmul_select_v2f64_test1(<2 x double> %x, <2 x i1> %bool) {
+; GFX9-LABEL: fmul_select_v2f64_test1:
+; GFX9: ; %bb.0:
+; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT: v_and_b32_e32 v4, 1, v4
+; GFX9-NEXT: v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX9-NEXT: v_and_b32_e32 v4, 1, v5
+; GFX9-NEXT: v_ldexp_f64 v[2:3], v[2:3], v4
+; GFX9-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_v2f64_test1:
+; GFX1030: ; %bb.0:
+; GFX1030-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT: v_and_b32_e32 v4, 1, v4
+; GFX1030-NEXT: v_and_b32_e32 v5, 1, v5
+; GFX1030-NEXT: v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX1030-NEXT: v_ldexp_f64 v[2:3], v[2:3], v5
+; GFX1030-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_v2f64_test1:
+; GFX1100: ; %bb.0:
+; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT: v_and_b32_e32 v4, 1, v4
+; GFX1100-NEXT: v_and_b32_e32 v5, 1, v5
+; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX1100-NEXT: v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX1100-NEXT: v_ldexp_f64 v[2:3], v[2:3], v5
+; GFX1100-NEXT: s_setpc_b64 s[30:31]
+ %1 = select <2 x i1> %bool, <2 x double> <double 2.000000e+00, double 2.000000e+00>, <2 x double> <double 1.000000e+00, double 1.000000e+00>
+ %ldexp = fmul <2 x double> %x, %1
+ ret <2 x double> %ldexp
+}
+
+define <2 x double> @fmul_select_v2f64_test2(<2 x double> %x, <2 x i1> %bool) {
+; GFX9-LABEL: fmul_select_v2f64_test2:
+; GFX9: ; %bb.0:
+; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX9-NEXT: v_bfe_i32 v4, v4, 0, 1
+; GFX9-NEXT: v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX9-NEXT: v_bfe_i32 v4, v5, 0, 1
+; GFX9-NEXT: v_ldexp_f64 v[2:3], v[2:3], v4
+; GFX9-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1030-LABEL: fmul_select_v2f64_test2:
+; GFX1030: ; %bb.0:
+; GFX1030-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1030-NEXT: v_bfe_i32 v4, v4, 0, 1
+; GFX1030-NEXT: v_bfe_i32 v5, v5, 0, 1
+; GFX1030-NEXT: v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX1030-NEXT: v_ldexp_f64 v[2:3], v[2:3], v5
+; GFX1030-NEXT: s_setpc_b64 s[30:31]
+;
+; GFX1100-LABEL: fmul_select_v2f64_test2:
+; GFX1100: ; %bb.0:
+; GFX1100-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
+; GFX1100-NEXT: v_bfe_i32 v4, v4, 0, 1
+; GFX1100-NEXT: v_bfe_i32 v5, v5, 0, 1
+; GFX1100-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX1100-NEXT: v_ldexp_f64 v[0:1], v[0:1], v4
+; GFX1100-NEXT: v_ldexp_f64 v[2:3], v[2:3], v5
+; GFX1100-NEXT: s_setpc_b64 s[30:31]
+ %1 = select <2 x i1> %bool, <2 x double> <double 0.500000e+00, double 0.500000e+00>, <2 x double> <double 1.000000e+00, double 1.000000e+00>
+ %ldexp = fmul <2 x double> %x, %1
+ ret <2 x double> %ldexp
+}
|
86a3023
to
4b0885d
Compare
isConstOrConstSplatFP(RHS.getOperand(2)); | ||
bool isNeg; | ||
|
||
if (!TrueNode || !FalseNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early exit after first one fails before checking the second. Also check RHS first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't do the early exit
else | ||
return SDValue(); | ||
|
||
SDValue ExtNode = DAG.getNode(ExtOp, SL, MVT::i32, RHS.getOperand(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can generalize this to select of integer, and handle any pair of powers of 2
|
||
if (TrueNode->isNegative() != FalseNode->isNegative()) | ||
return SDValue(); | ||
bool isNeg = TrueNode->isNegative(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need isNeg variable
if (FalseNode->isExactlyValue(1.0) || FalseNode->isExactlyValue(-1.0)) { | ||
if (TrueNode->isExactlyValue(2.0) || TrueNode->isExactlyValue(-2.0)) | ||
ExtOp = ISD::ZERO_EXTEND; | ||
else if (TrueNode->isExactlyValue(0.5) || | ||
TrueNode->isExactlyValue(-0.5)) | ||
ExtOp = ISD::SIGN_EXTEND; | ||
else | ||
return SDValue(); | ||
if (TrueNode->isExactlyValue(2.0) || TrueNode->isExactlyValue(-2.0)) { | ||
SDValue ZExtNode = | ||
DAG.getNode(ISD::ZERO_EXTEND, SL, i32VT, RHS.getOperand(0)); | ||
return DAG.getNode(ISD::FLDEXP, SL, VT, LHS, ZExtNode); | ||
} else if (TrueNode->isExactlyValue(0.5) || | ||
TrueNode->isExactlyValue(-0.5)) { | ||
SDValue SExtNode = | ||
DAG.getNode(ISD::SIGN_EXTEND, SL, i32VT, RHS.getOperand(0)); | ||
return DAG.getNode(ISD::FLDEXP, SL, VT, LHS, SExtNode); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need the special case for the extensions up here. We select the extensions into a select anyway
6906e39
to
de6e306
Compare
isConstOrConstSplatFP(RHS.getOperand(2)); | ||
bool isNeg; | ||
|
||
if (!TrueNode || !FalseNode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't do the early exit
// Note : It takes care of generic scenario which covers undoing | ||
// of special case(zext/sext) as mentioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Note : It takes care of generic scenario which covers undoing | |
// of special case(zext/sext) as mentioned. |
// The above mentioned ldexp folding works fine for | ||
// bf16/f32, but as for f64 it creates f64 select which | ||
// is costly to materialize as compared to f64 ldexp | ||
// so here we undo the transform for f64 datatype. | ||
// Also in case of f16, its cheaper to materialize inline | ||
// 32 bit-constant (via ldexp use) rather than using fmul. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is verbose. Just state that it's cheaper to use i32 inline constants than to materialize f16 or f64 values
// fmul x, (select y, -A, -B) -> ldexp( (fneg x), (select i32 y, a, b) ) | ||
// Note : It takes care of generic scenario which covers undoing | ||
// of special case(zext/sext) as mentioned. | ||
if (VT.getScalarType() == MVT::f64 || VT.getScalarType() == MVT::f16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the non-inline constant cases, we should do this for f32 too.
// ldexp(x, zext(i1 y)) -> fmul x, (select y, 2.0, 1.0) | ||
// ldexp(x, sext(i1 y)) -> fmul x, (select y, 0.5, 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain where these transformations are done. Is it another DAG combine somewhere? If so, what stops your DAG combine from fighting with the other one?
If you write this as just "Fixes #104900" then github will recognize it and close the issue for you when you merge this patch: https://docs.github.com/en/issues/tracking-your-work-with-issues/using-issues/linking-a-pull-request-to-an-issue |
DAGCombinerInfo &DCI) const { | ||
SelectionDAG &DAG = DCI.DAG; | ||
EVT VT = N->getValueType(0); | ||
EVT intVT = VT.changeElementType(MVT::i32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeating myself, variable names should start with a capital letter.
if (VT.getScalarType() == MVT::f32) | ||
if (TII->isInlineConstant(TrueNode->getValueAPF()) && | ||
TII->isInlineConstant(FalseNode->getValueAPF())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can merge into one if. Also sink this below the isNegative early exit below
Missing test updates? |
e2fd76a
to
4b9d3a0
Compare
Addressed! |
const ConstantFPSDNode *TrueNode = | ||
isConstOrConstSplatFP(RHS.getOperand(1)); | ||
const ConstantFPSDNode *FalseNode = | ||
isConstOrConstSplatFP(RHS.getOperand(2)); | ||
|
||
bool AreNodesFP = TrueNode && FalseNode; | ||
if (!AreNodesFP) | ||
return SDValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const ConstantFPSDNode *TrueNode = | |
isConstOrConstSplatFP(RHS.getOperand(1)); | |
const ConstantFPSDNode *FalseNode = | |
isConstOrConstSplatFP(RHS.getOperand(2)); | |
bool AreNodesFP = TrueNode && FalseNode; | |
if (!AreNodesFP) | |
return SDValue(); | |
const ConstantFPSDNode *FalseNode = | |
isConstOrConstSplatFP(RHS.getOperand(2)); | |
if (!FalseNode) | |
return SDValue(); | |
const ConstantFPSDNode *TrueNode = | |
isConstOrConstSplatFP(RHS.getOperand(1)); | |
if (!TrueNode) | |
return SDValue(); | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping here
int FalseNodeExpVal = FalseNode->getValueAPF().getExactLog2Abs(); | ||
if (TrueNodeExpVal != INT_MIN && FalseNodeExpVal != INT_MIN) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early exit after one fails
;RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 < %s | FileCheck -check-prefix=GFX1030 %s | ||
;RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1100 < %s | FileCheck -check-prefix=GFX1100 %s | ||
|
||
define float @fmul_select_f32_test1(float %x, i32 %bool.arg1, i32 %bool.arg2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rebase this to be on top of the test add
Kind Ping! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be easier to see diff on top of the baseline tests
const ConstantFPSDNode *TrueNode = | ||
isConstOrConstSplatFP(RHS.getOperand(1)); | ||
const ConstantFPSDNode *FalseNode = | ||
isConstOrConstSplatFP(RHS.getOperand(2)); | ||
|
||
bool AreNodesFP = TrueNode && FalseNode; | ||
if (!AreNodesFP) | ||
return SDValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping here
if (VT.getScalarType() == MVT::f64 || VT.getScalarType() == MVT::f32 || | ||
VT.getScalarType() == MVT::f16) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Factor out repeated getScalarType call
c27b98e
to
832eb0f
Compare
@@ -0,0 +1,2616 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | |||
;RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 < %s | FileCheck -check-prefix=GFX7 %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should rebase on top of the test base. Patch lgtm with nit but I can't easily check the effect in the output here
This adds the f32/f64/f16/bf16 test cases for below pattern : `fmul x, select(y, A, B)` with just one use of select Inst above. It acts as pre-commit tests for dagCombining above pattern into cheaper ldexp in case of non-inlline 32 bit-constants. (#111109)
75176be
to
b250aba
Compare
For the f32/f16, this combine does no improvements, but for f64 this specific case of fmul with select is more costly to materialize as compared to ldexp, so the following dag combine does the magic. fmul x, select(y, 2.0, 1.0) -> ldexp x, zext(i1 y) fmul x, selcet(y, 0.5, 1.0) -> ldexp x, sext(i1 y)
b250aba
to
b1fdd89
Compare
I updated with new test cases as well as the old ones that needs changes, but for some reason, the only below testcase
It is giving invalid bitcast error "from i16 to i32" while trying to update with update_llc_tests_check.py script. But, bizarelly when running the command sequence of the test in terminal its working fine, generating assembly as supposed to be with dagcombine happening. I am perplexed!! |
I don't know how you could hit this bitcast issue, but should fix however it is getting it. This test is odd in that it has an llc line with an opt invocation, which update_llc_test_checks does not understand. In any case, there should be no path to hitting an invalid bitcast |
@arsenm, Still figuring out; they used opt's generated output file o be fed to llc using pipe command Just to point out that when executing the commmands directly mentioned in lit test, its working fine but getting messed up via updation using llc_check_tests script. Could you comment on that !! |
@arsenm, as you were saying there is no issue with bitcast, but owing to some bizarre docker issue, that came. On baremetal machine, as you suspected opt entir line(a preprocess command invoked by lllc-update python script) is creating problem, as it is entirely executed as it is on terminal, thus unable to locate opt binary, as no stripping has been done for it like llc tool! Tested this by hardcoding the right opt path |
1af8afc
to
be435f8
Compare
I handled it, by letting hardcoding the opt path, followed by updating it via script & reverting the opt path back to original as was in test. But, in future it would be a problem if someone else tries to run update-llc script again & need to again do the same hack as I did now! |
…reating ldexp dagnode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Please implement the same for globalisel in a follow up
; GFX9-GISEL-NEXT: v_mov_b32_e32 v2, 0x42800000 | ||
; GFX9-GISEL-NEXT: v_cmp_eq_u32_e32 vcc, 0, v0 | ||
; GFX9-GISEL-NEXT: v_cndmask_b32_e32 v0, 1.0, v2, vcc | ||
; GFX9-GISEL-NEXT: v_mul_f32_e32 v0, v1, v0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: Implement this for globalisel
#120104) This combine pattern perform the below transformation. fmul x, select(y, A, B) -> fldexp (x, select i32 (y, a, b)) fmul x, select(y, -A, -B) -> fldexp ((fneg x), select i32 (y, a, b)) where, A=2^a & B=2^b ; a and b are integers. It is a follow-up PR to implement the above combine for globalIsel, as the corresponding DAG combine has been done for SelectionDAG Isel (#111109)
llvm#120104) This combine pattern perform the below transformation. fmul x, select(y, A, B) -> fldexp (x, select i32 (y, a, b)) fmul x, select(y, -A, -B) -> fldexp ((fneg x), select i32 (y, a, b)) where, A=2^a & B=2^b ; a and b are integers. It is a follow-up PR to implement the above combine for globalIsel, as the corresponding DAG combine has been done for SelectionDAG Isel (llvm#111109)
This adds the f32/f64/f16/bf16 test cases for below pattern : `fmul x, select(y, A, B)` with just one use of select Inst above. It acts as pre-commit tests for dagCombining above pattern into cheaper ldexp in case of non-inlline 32 bit-constants. (llvm#111109) Change-Id: Ia6a3bb41b25ca8fb3d3f5bc67c183c168d8f4ba8
…lvm#111109) The materialization cost of 32-bit non-inline in case of fmul is quite relatively more, rather than if possible to combine it into ldexp instruction for specific scenarios (for datatypes like f64, f32 and f16) as this is being handled here : The dag combine for any pair of select values which are exact exponent of 2. ``` fmul x, select(y, A, B) -> ldexp (x, select i32 (y, a, b)) fmul x, select(y, -A, -B) -> ldexp ((fneg x), select i32 (y, a, b)) where, A=2^a & B=2^b ; a and b are integers. ``` This dagCombine is handled separately in fmulCombine (newly defined in SIIselLowering), targeting fmul fusing it with select type operand into ldexp. Thus, it fixes llvm#104900. Change-Id: I1e76049d8de218329efac4c62ee3c52cd824258c
The materialization cost of 32-bit non-inline in case of fmul is quite relatively more, rather than if possible to combine it into ldexp instruction for specific scenarios (for datatypes like f64, f32 and f16) as this is being handled here :
The dag combine for any pair of select values which are exact exponent of 2.
This dagCombine is handled separately in fmulCombine (newly defined in SIIselLowering), targeting fmul fusing it with select type operand into ldexp.
Thus, it fixes #104900.