Skip to content

[AMDGPU] Move S_BFE lowering into RegBankCombiner #141589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: users/pierre-vh/gi-kb-sbfe
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion llvm/lib/Target/AMDGPU/AMDGPUCombine.td
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,17 @@ def zext_of_shift_amount_combines : GICombineGroup<[
canonicalize_zext_lshr, canonicalize_zext_ashr, canonicalize_zext_shl
]>;

// Early select of uniform BFX into S_BFE instructions.
// These instructions encode the offset/width in a way that requires using
// bitwise operations. Selecting these instructions early allow the combiner
// to potentially fold these.
class lower_uniform_bfx<Instruction bfx> : GICombineRule<
(defs root:$bfx),
(combine (bfx $dst, $src, $o, $w):$bfx, [{ return lowerUniformBFX(*${bfx}); }])>;

def lower_uniform_sbfx : lower_uniform_bfx<G_SBFX>;
def lower_uniform_ubfx : lower_uniform_bfx<G_UBFX>;
Comment on lines +154 to +163
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs more elaboration; needs to be clear that this can't be a mandatory lowering performed in a combiner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be clear that this can't be a mandatory lowering performed in a combiner

What do you mean? I don't understand
Do you mean the combine name should avoid "lower" and be named something like bfx_to_s_bfe instead?

I just thought about this but should I also make sure ISel doesn't crash on scalar BFXs by making it select the vector version all the time & inserting copies if the inputs are SGPRs? It's not a good result but at least our ISel wouldn't crash if that combine is skipped for some reason.


let Predicates = [Has16BitInsts, NotHasMed3_16] in {
// For gfx8, expand f16-fmed3-as-f32 into a min/max f16 sequence. This
// saves one instruction compared to the promotion.
Expand Down Expand Up @@ -198,5 +209,6 @@ def AMDGPURegBankCombiner : GICombiner<
zext_trunc_fold, int_minmax_to_med3, ptr_add_immed_chain,
fp_minmax_to_clamp, fp_minmax_to_med3, fmed3_intrinsic_to_clamp,
identity_combines, redundant_and, constant_fold_cast_op,
cast_of_cast_combines, sext_trunc, zext_of_shift_amount_combines]> {
cast_of_cast_combines, sext_trunc, zext_of_shift_amount_combines,
lower_uniform_sbfx, lower_uniform_ubfx]> {
}
52 changes: 52 additions & 0 deletions llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,8 @@ class AMDGPURegBankCombinerImpl : public Combiner {

void applyCanonicalizeZextShiftAmt(MachineInstr &MI, MachineInstr &Ext) const;

bool lowerUniformBFX(MachineInstr &MI) const;

private:
SIModeRegisterDefaults getMode() const;
bool getIEEE() const;
Expand Down Expand Up @@ -392,6 +394,56 @@ void AMDGPURegBankCombinerImpl::applyCanonicalizeZextShiftAmt(
MI.eraseFromParent();
}

bool AMDGPURegBankCombinerImpl::lowerUniformBFX(MachineInstr &MI) const {
assert(MI.getOpcode() == TargetOpcode::G_UBFX ||
MI.getOpcode() == TargetOpcode::G_SBFX);
const bool Signed = (MI.getOpcode() == TargetOpcode::G_SBFX);

Register DstReg = MI.getOperand(0).getReg();
const RegisterBank *RB = RBI.getRegBank(DstReg, MRI, TRI);
assert(RB && "No RB?");
if (RB->getID() != AMDGPU::SGPRRegBankID)
return false;

Register SrcReg = MI.getOperand(1).getReg();
Register OffsetReg = MI.getOperand(2).getReg();
Register WidthReg = MI.getOperand(3).getReg();

const LLT S32 = LLT::scalar(32);
Copy link
Contributor

@shiltian shiltian May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it necessary to use const local scalar variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean why is the const necessary? I tend to favor putting const whenever a variable won't change, even though we don't always do it

Or why using a S32 variable at all? It's to avoid the repetition of LLT::scalar(32) all over the place as I'm using that type in multiple places here. Note that this is a type (i32) and not a value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having S32 is fine. My preference is not to use const on variables, but I don't know if that's actually in the coding standards.

LLT Ty = MRI.getType(DstReg);

const unsigned Opc = Ty == S32
? (Signed ? AMDGPU::S_BFE_I32 : AMDGPU::S_BFE_U32)
: (Signed ? AMDGPU::S_BFE_I64 : AMDGPU::S_BFE_U64);

// Pack the offset and width of a BFE into
// the format expected by the S_BFE_I32 / S_BFE_U32. In the second
// source, bits [5:0] contain the offset and bits [22:16] the width.

// Ensure the high bits are clear to insert the offset.
auto OffsetMask = B.buildConstant(S32, maskTrailingOnes<unsigned>(6));
auto ClampOffset = B.buildAnd(S32, OffsetReg, OffsetMask);
Comment on lines +423 to +425
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to do this, since the original G_*BFX node would have undefined result if the offset was out of range. The valid ranges are mentioned here:

// 0 <= lsb < lsb + width <= src bitwidth, where all values are unsigned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doing it so the high bits are zero and the width doesn't get overwritten, not really to sanitize the BFX's operand
Though I don't feel strongly about it, it's indeed UB if the value is out of range so we don't really have to promise anything

Do you prefer if I remove the AND ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I'd prefer removing the AND.


// Zeros out the low bits, so don't bother clamping the input value.
auto ShiftAmt = B.buildConstant(S32, 16);
auto ShiftWidth = B.buildShl(S32, WidthReg, ShiftAmt);

auto MergedInputs = B.buildOr(S32, ClampOffset, ShiftWidth);

MRI.setRegBank(OffsetMask.getReg(0), *RB);
MRI.setRegBank(ClampOffset.getReg(0), *RB);
MRI.setRegBank(ShiftAmt.getReg(0), *RB);
MRI.setRegBank(ShiftWidth.getReg(0), *RB);
MRI.setRegBank(MergedInputs.getReg(0), *RB);

auto MIB = B.buildInstr(Opc, {DstReg}, {SrcReg, MergedInputs});
if (!constrainSelectedInstRegOperands(*MIB, TII, TRI, RBI))
llvm_unreachable("failed to constrain BFE");

MI.eraseFromParent();
return true;
}

SIModeRegisterDefaults AMDGPURegBankCombinerImpl::getMode() const {
return MF.getInfo<SIMachineFunctionInfo>()->getMode();
}
Expand Down
125 changes: 55 additions & 70 deletions llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1492,88 +1492,73 @@ bool AMDGPURegisterBankInfo::applyMappingBFE(MachineIRBuilder &B,
Register WidthReg = MI.getOperand(FirstOpnd + 2).getReg();

const RegisterBank *DstBank =
OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;
if (DstBank == &AMDGPU::VGPRRegBank) {
if (Ty == S32)
return true;

// There is no 64-bit vgpr bitfield extract instructions so the operation
// is expanded to a sequence of instructions that implement the operation.
ApplyRegBankMapping ApplyBank(B, *this, MRI, &AMDGPU::VGPRRegBank);
OpdMapper.getInstrMapping().getOperandMapping(0).BreakDown[0].RegBank;

const LLT S64 = LLT::scalar(64);
// Shift the source operand so that extracted bits start at bit 0.
auto ShiftOffset = Signed ? B.buildAShr(S64, SrcReg, OffsetReg)
: B.buildLShr(S64, SrcReg, OffsetReg);
auto UnmergeSOffset = B.buildUnmerge({S32, S32}, ShiftOffset);

// A 64-bit bitfield extract uses the 32-bit bitfield extract instructions
// if the width is a constant.
if (auto ConstWidth = getIConstantVRegValWithLookThrough(WidthReg, MRI)) {
// Use the 32-bit bitfield extract instruction if the width is a constant.
// Depending on the width size, use either the low or high 32-bits.
auto Zero = B.buildConstant(S32, 0);
auto WidthImm = ConstWidth->Value.getZExtValue();
if (WidthImm <= 32) {
// Use bitfield extract on the lower 32-bit source, and then sign-extend
// or clear the upper 32-bits.
auto Extract =
Signed ? B.buildSbfx(S32, UnmergeSOffset.getReg(0), Zero, WidthReg)
: B.buildUbfx(S32, UnmergeSOffset.getReg(0), Zero, WidthReg);
auto Extend =
Signed ? B.buildAShr(S32, Extract, B.buildConstant(S32, 31)) : Zero;
B.buildMergeLikeInstr(DstReg, {Extract, Extend});
} else {
// Use bitfield extract on upper 32-bit source, and combine with lower
// 32-bit source.
auto UpperWidth = B.buildConstant(S32, WidthImm - 32);
auto Extract =
Signed
? B.buildSbfx(S32, UnmergeSOffset.getReg(1), Zero, UpperWidth)
: B.buildUbfx(S32, UnmergeSOffset.getReg(1), Zero, UpperWidth);
B.buildMergeLikeInstr(DstReg, {UnmergeSOffset.getReg(0), Extract});
}
MI.eraseFromParent();
if (DstBank != &AMDGPU::VGPRRegBank) {
// SGPR: Canonicalize to a G_S/UBFX
if (!isa<GIntrinsic>(MI))
return true;
}

// Expand to Src >> Offset << (64 - Width) >> (64 - Width) using 64-bit
// operations.
auto ExtShift = B.buildSub(S32, B.buildConstant(S32, 64), WidthReg);
auto SignBit = B.buildShl(S64, ShiftOffset, ExtShift);
ApplyRegBankMapping ApplyBank(B, *this, MRI, &AMDGPU::SGPRRegBank);
if (Signed)
B.buildAShr(S64, SignBit, ExtShift);
B.buildSbfx(DstReg, SrcReg, OffsetReg, WidthReg);
else
B.buildLShr(S64, SignBit, ExtShift);
B.buildUbfx(DstReg, SrcReg, OffsetReg, WidthReg);
MI.eraseFromParent();
return true;
}

// The scalar form packs the offset and width in a single operand.

ApplyRegBankMapping ApplyBank(B, *this, MRI, &AMDGPU::SGPRRegBank);

// Ensure the high bits are clear to insert the offset.
auto OffsetMask = B.buildConstant(S32, maskTrailingOnes<unsigned>(6));
auto ClampOffset = B.buildAnd(S32, OffsetReg, OffsetMask);

// Zeros out the low bits, so don't bother clamping the input value.
auto ShiftWidth = B.buildShl(S32, WidthReg, B.buildConstant(S32, 16));

// Transformation function, pack the offset and width of a BFE into
// the format expected by the S_BFE_I32 / S_BFE_U32. In the second
// source, bits [5:0] contain the offset and bits [22:16] the width.
auto MergedInputs = B.buildOr(S32, ClampOffset, ShiftWidth);
// VGPR
if (Ty == S32)
return true;

// TODO: It might be worth using a pseudo here to avoid scc clobber and
// register class constraints.
unsigned Opc = Ty == S32 ? (Signed ? AMDGPU::S_BFE_I32 : AMDGPU::S_BFE_U32) :
(Signed ? AMDGPU::S_BFE_I64 : AMDGPU::S_BFE_U64);
// There is no 64-bit vgpr bitfield extract instructions so the operation
// is expanded to a sequence of instructions that implement the operation.
ApplyRegBankMapping ApplyBank(B, *this, MRI, &AMDGPU::VGPRRegBank);

auto MIB = B.buildInstr(Opc, {DstReg}, {SrcReg, MergedInputs});
if (!constrainSelectedInstRegOperands(*MIB, *TII, *TRI, *this))
llvm_unreachable("failed to constrain BFE");
const LLT S64 = LLT::scalar(64);
// Shift the source operand so that extracted bits start at bit 0.
auto ShiftOffset = Signed ? B.buildAShr(S64, SrcReg, OffsetReg)
: B.buildLShr(S64, SrcReg, OffsetReg);
auto UnmergeSOffset = B.buildUnmerge({S32, S32}, ShiftOffset);

// A 64-bit bitfield extract uses the 32-bit bitfield extract instructions
// if the width is a constant.
if (auto ConstWidth = getIConstantVRegValWithLookThrough(WidthReg, MRI)) {
// Use the 32-bit bitfield extract instruction if the width is a constant.
// Depending on the width size, use either the low or high 32-bits.
auto Zero = B.buildConstant(S32, 0);
auto WidthImm = ConstWidth->Value.getZExtValue();
if (WidthImm <= 32) {
// Use bitfield extract on the lower 32-bit source, and then sign-extend
// or clear the upper 32-bits.
auto Extract =
Signed ? B.buildSbfx(S32, UnmergeSOffset.getReg(0), Zero, WidthReg)
: B.buildUbfx(S32, UnmergeSOffset.getReg(0), Zero, WidthReg);
auto Extend =
Signed ? B.buildAShr(S32, Extract, B.buildConstant(S32, 31)) : Zero;
B.buildMergeLikeInstr(DstReg, {Extract, Extend});
} else {
// Use bitfield extract on upper 32-bit source, and combine with lower
// 32-bit source.
auto UpperWidth = B.buildConstant(S32, WidthImm - 32);
auto Extract =
Signed ? B.buildSbfx(S32, UnmergeSOffset.getReg(1), Zero, UpperWidth)
: B.buildUbfx(S32, UnmergeSOffset.getReg(1), Zero, UpperWidth);
B.buildMergeLikeInstr(DstReg, {UnmergeSOffset.getReg(0), Extract});
}
MI.eraseFromParent();
return true;
}

// Expand to Src >> Offset << (64 - Width) >> (64 - Width) using 64-bit
// operations.
auto ExtShift = B.buildSub(S32, B.buildConstant(S32, 64), WidthReg);
auto SignBit = B.buildShl(S64, ShiftOffset, ExtShift);
if (Signed)
B.buildAShr(S64, SignBit, ExtShift);
else
B.buildLShr(S64, SignBit, ExtShift);
MI.eraseFromParent();
return true;
}
Expand Down
107 changes: 107 additions & 0 deletions llvm/test/CodeGen/AMDGPU/GlobalISel/regbankcombiner-lower-bfx.mir
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
# RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1010 -run-pass=amdgpu-regbank-combiner -verify-machineinstrs %s -o - | FileCheck %s

---
name: test_s_bfe_i32__constants
legalized: true
regBankSelected: true
tracksRegLiveness: true
body: |
bb.1:
liveins: $sgpr0

; CHECK-LABEL: name: test_s_bfe_i32__constants
; CHECK: liveins: $sgpr0
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: %reg:sreg_32(s32) = COPY $sgpr0
; CHECK-NEXT: %width:sgpr(s32) = G_CONSTANT i32 5
; CHECK-NEXT: %offset:sgpr(s32) = G_CONSTANT i32 7
; CHECK-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 16
; CHECK-NEXT: [[SHL:%[0-9]+]]:sgpr(s32) = G_SHL %width, [[C]](s32)
; CHECK-NEXT: [[OR:%[0-9]+]]:sreg_32(s32) = G_OR %offset, [[SHL]]
; CHECK-NEXT: %bfx:sreg_32(s32) = S_BFE_I32 %reg(s32), [[OR]](s32), implicit-def $scc
; CHECK-NEXT: $sgpr0 = COPY %bfx(s32)
%reg:sgpr(s32) = COPY $sgpr0
%width:sgpr(s32) = G_CONSTANT i32 5
%offset:sgpr(s32) = G_CONSTANT i32 7
%bfx:sgpr(s32) = G_SBFX %reg, %offset, %width
$sgpr0 = COPY %bfx
...
---
name: test_s_bfe_u32__constants
legalized: true
regBankSelected: true
tracksRegLiveness: true
body: |
bb.1:
liveins: $sgpr0

; CHECK-LABEL: name: test_s_bfe_u32__constants
; CHECK: liveins: $sgpr0
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: %reg:sreg_32(s32) = COPY $sgpr0
; CHECK-NEXT: %width:sgpr(s32) = G_CONSTANT i32 5
; CHECK-NEXT: %offset:sgpr(s32) = G_CONSTANT i32 7
; CHECK-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 16
; CHECK-NEXT: [[SHL:%[0-9]+]]:sgpr(s32) = G_SHL %width, [[C]](s32)
; CHECK-NEXT: [[OR:%[0-9]+]]:sreg_32(s32) = G_OR %offset, [[SHL]]
; CHECK-NEXT: %bfx:sreg_32(s32) = S_BFE_U32 %reg(s32), [[OR]](s32), implicit-def $scc
; CHECK-NEXT: $sgpr0 = COPY %bfx(s32)
%reg:sgpr(s32) = COPY $sgpr0
%width:sgpr(s32) = G_CONSTANT i32 5
%offset:sgpr(s32) = G_CONSTANT i32 7
%bfx:sgpr(s32) = G_UBFX %reg, %offset, %width
$sgpr0 = COPY %bfx
...
---
name: test_s_bfe_i64__constants
legalized: true
regBankSelected: true
tracksRegLiveness: true
body: |
bb.1:
liveins: $sgpr0_sgpr1

; CHECK-LABEL: name: test_s_bfe_i64__constants
; CHECK: liveins: $sgpr0_sgpr1
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: %reg:sreg_64(s64) = COPY $sgpr0_sgpr1
; CHECK-NEXT: %width:sgpr(s32) = G_CONSTANT i32 5
; CHECK-NEXT: %offset:sgpr(s32) = G_CONSTANT i32 7
; CHECK-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 16
; CHECK-NEXT: [[SHL:%[0-9]+]]:sgpr(s32) = G_SHL %width, [[C]](s32)
; CHECK-NEXT: [[OR:%[0-9]+]]:sreg_32(s32) = G_OR %offset, [[SHL]]
; CHECK-NEXT: %bfx:sreg_64(s64) = S_BFE_I64 %reg(s64), [[OR]](s32), implicit-def $scc
; CHECK-NEXT: $sgpr0_sgpr1 = COPY %bfx(s64)
%reg:sgpr(s64) = COPY $sgpr0_sgpr1
%width:sgpr(s32) = G_CONSTANT i32 5
%offset:sgpr(s32) = G_CONSTANT i32 7
%bfx:sgpr(s64) = G_SBFX %reg, %offset, %width
$sgpr0_sgpr1 = COPY %bfx
...
---
name: test_s_bfe_u64__constants
legalized: true
regBankSelected: true
tracksRegLiveness: true
body: |
bb.1:
liveins: $sgpr0_sgpr1

; CHECK-LABEL: name: test_s_bfe_u64__constants
; CHECK: liveins: $sgpr0_sgpr1
; CHECK-NEXT: {{ $}}
; CHECK-NEXT: %reg:sreg_64(s64) = COPY $sgpr0_sgpr1
; CHECK-NEXT: %width:sgpr(s32) = G_CONSTANT i32 5
; CHECK-NEXT: %offset:sgpr(s32) = G_CONSTANT i32 7
; CHECK-NEXT: [[C:%[0-9]+]]:sgpr(s32) = G_CONSTANT i32 16
; CHECK-NEXT: [[SHL:%[0-9]+]]:sgpr(s32) = G_SHL %width, [[C]](s32)
; CHECK-NEXT: [[OR:%[0-9]+]]:sreg_32(s32) = G_OR %offset, [[SHL]]
; CHECK-NEXT: %bfx:sreg_64(s64) = S_BFE_U64 %reg(s64), [[OR]](s32), implicit-def $scc
; CHECK-NEXT: $sgpr0_sgpr1 = COPY %bfx(s64)
%reg:sgpr(s64) = COPY $sgpr0_sgpr1
%width:sgpr(s32) = G_CONSTANT i32 5
%offset:sgpr(s32) = G_CONSTANT i32 7
%bfx:sgpr(s64) = G_UBFX %reg, %offset, %width
$sgpr0_sgpr1 = COPY %bfx
...
Loading