Skip to content

Commit 8665929

Browse files
committed
[LV, VP]VP intrinsics support for the Loop Vectorizer
This patch introduces generating VP intrinsics in the Loop Vectorizer. Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities. Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions. Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics. - The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation. - The second way is to insert instructions to compute `min(VF, trip_count - index)` for each vector iteration. - For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic `get_vector_length`, that can be lowered to architecture specific instruction(s) to compute EVL. Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives. ===Tentative Development Roadmap=== * Use vp-intrinsics for all possible vector operations. That work has 2 possible implementations: 1. Introduce a new pass which transforms emitted vector instructions to vp intrinsics if the the loop was transformed to use predication for loads/stores. The advantage of this approach is that it does not require many changes in the loop vectorizer itself. The disadvantage is that it may require to copy some existing functionality from the loop vectorizer in a separate patch, have similar code in the different passes and perform the same analysis 2 times, at least. 2. Extend Loop Vectorizer using VectorBuildor and make it emit vp intrinsics automatically in presence of EVL value. The advantage is that it does not require a separate pass, thus it may reduce compile time. Plus, we can avoid code duplication. It requires some extra work in the LoopVectorizer to add VectorBuilder support and smart vector instructions/vp intrinsics emission. Also, to fully support Loop Vectorizer it will require adding a new PHI recipe to handle EVL on the previous iteration + extending several existing recipes with the new operands (depends on the design). * Switch to vp-intrinsics for memory operations for VLS and VLA vectorizations. Differential Revision: https://reviews.llvm.org/D99750
1 parent a08506e commit 8665929

24 files changed

+1648
-32
lines changed

llvm/include/llvm/Analysis/TargetTransformInfo.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -190,7 +190,10 @@ enum class TailFoldingStyle {
190190
/// Use predicate to control both data and control flow, but modify
191191
/// the trip count so that a runtime overflow check can be avoided
192192
/// and such that the scalar epilogue loop can always be removed.
193-
DataAndControlFlowWithoutRuntimeCheck
193+
DataAndControlFlowWithoutRuntimeCheck,
194+
/// Use predicated EVL instructions for tail-folding.
195+
/// Indicates that VP intrinsics should be used if tail-folding is enabled.
196+
DataWithEVL,
194197
};
195198

196199
struct TailFoldingInfo {

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,10 @@ RISCVTTIImpl::getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx,
228228
return TTI::TCC_Free;
229229
}
230230

231+
bool RISCVTTIImpl::hasActiveVectorLength(unsigned, Type *DataTy, Align) const {
232+
return ST->hasVInstructions();
233+
}
234+
231235
TargetTransformInfo::PopcntSupportKind
232236
RISCVTTIImpl::getPopcntSupport(unsigned TyWidth) {
233237
assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,22 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
7575
const APInt &Imm, Type *Ty,
7676
TTI::TargetCostKind CostKind);
7777

78+
/// \name Vector Predication Information
79+
/// Whether the target supports the %evl parameter of VP intrinsic efficiently
80+
/// in hardware, for the given opcode and type/alignment. (see LLVM Language
81+
/// Reference - "Vector Predication Intrinsics",
82+
/// https://llvm.org/docs/LangRef.html#vector-predication-intrinsics and
83+
/// "IR-level VP intrinsics",
84+
/// https://llvm.org/docs/Proposals/VectorPredication.html#ir-level-vp-intrinsics).
85+
/// \param Opcode the opcode of the instruction checked for predicated version
86+
/// support.
87+
/// \param DataType the type of the instruction with the \p Opcode checked for
88+
/// prediction support.
89+
/// \param Alignment the alignment for memory access operation checked for
90+
/// predicated version support.
91+
bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
92+
Align Alignment) const;
93+
7894
TargetTransformInfo::PopcntSupportKind getPopcntSupport(unsigned TyWidth);
7995

8096
bool shouldExpandReduction(const IntrinsicInst *II) const;

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

Lines changed: 144 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,7 @@
123123
#include "llvm/IR/User.h"
124124
#include "llvm/IR/Value.h"
125125
#include "llvm/IR/ValueHandle.h"
126+
#include "llvm/IR/VectorBuilder.h"
126127
#include "llvm/IR/Verifier.h"
127128
#include "llvm/Support/Casting.h"
128129
#include "llvm/Support/CommandLine.h"
@@ -247,10 +248,12 @@ static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
247248
clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
248249
"Create lane mask using active.lane.mask intrinsic, and use "
249250
"it for both data and control flow"),
250-
clEnumValN(
251-
TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
252-
"data-and-control-without-rt-check",
253-
"Similar to data-and-control, but remove the runtime check")));
251+
clEnumValN(TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck,
252+
"data-and-control-without-rt-check",
253+
"Similar to data-and-control, but remove the runtime check"),
254+
clEnumValN(TailFoldingStyle::DataWithEVL, "data-with-evl",
255+
"Use predicated EVL instructions for tail folding if the "
256+
"target supports vector length predication")));
254257

255258
static cl::opt<bool> MaximizeBandwidth(
256259
"vectorizer-maximize-bandwidth", cl::init(false), cl::Hidden,
@@ -1098,9 +1101,7 @@ void InnerLoopVectorizer::collectPoisonGeneratingRecipes(
10981101
// handled.
10991102
if (isa<VPWidenMemoryInstructionRecipe>(CurRec) ||
11001103
isa<VPInterleaveRecipe>(CurRec) ||
1101-
isa<VPScalarIVStepsRecipe>(CurRec) ||
1102-
isa<VPCanonicalIVPHIRecipe>(CurRec) ||
1103-
isa<VPActiveLaneMaskPHIRecipe>(CurRec))
1104+
isa<VPScalarIVStepsRecipe>(CurRec) || isa<VPHeaderPHIRecipe>(CurRec))
11041105
continue;
11051106

11061107
// This recipe contributes to the address computation of a widen
@@ -1640,6 +1641,23 @@ class LoopVectorizationCostModel {
16401641
return foldTailByMasking() || Legal->blockNeedsPredication(BB);
16411642
}
16421643

1644+
/// Returns true if VP intrinsics with explicit vector length support should
1645+
/// be generated in the tail folded loop.
1646+
bool useVPIWithVPEVLVectorization() const {
1647+
return PreferEVL && !EnableVPlanNativePath &&
1648+
getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
1649+
// FIXME: implement support for max safe dependency distance.
1650+
Legal->isSafeForAnyVectorWidth() &&
1651+
// FIXME: remove this once reductions are supported.
1652+
Legal->getReductionVars().empty() &&
1653+
// FIXME: remove this once vp_reverse is supported.
1654+
none_of(
1655+
WideningDecisions,
1656+
[](const std::pair<std::pair<Instruction *, ElementCount>,
1657+
std::pair<InstWidening, InstructionCost>>
1658+
&Data) { return Data.second.first == CM_Widen_Reverse; });
1659+
}
1660+
16431661
/// Returns true if the Phi is part of an inloop reduction.
16441662
bool isInLoopReduction(PHINode *Phi) const {
16451663
return InLoopReductions.contains(Phi);
@@ -1785,6 +1803,10 @@ class LoopVectorizationCostModel {
17851803
/// All blocks of loop are to be masked to fold tail of scalar iterations.
17861804
bool CanFoldTailByMasking = false;
17871805

1806+
/// Control whether to generate VP intrinsics with explicit-vector-length
1807+
/// support in vectorized code.
1808+
bool PreferEVL = false;
1809+
17881810
/// A map holding scalar costs for different vectorization factors. The
17891811
/// presence of a cost for an instruction in the mapping indicates that the
17901812
/// instruction will be scalarized when vectorizing with the associated
@@ -4691,6 +4713,39 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
46914713
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
46924714
if (Legal->prepareToFoldTailByMasking()) {
46934715
CanFoldTailByMasking = true;
4716+
if (getTailFoldingStyle() == TailFoldingStyle::None)
4717+
return MaxFactors;
4718+
4719+
if (UserIC > 1) {
4720+
LLVM_DEBUG(dbgs() << "LV: Preference for VP intrinsics indicated. Will "
4721+
"not generate VP intrinsics since interleave count "
4722+
"specified is greater than 1.\n");
4723+
return MaxFactors;
4724+
}
4725+
4726+
if (MaxFactors.ScalableVF.isVector()) {
4727+
assert(MaxFactors.ScalableVF.isScalable() &&
4728+
"Expected scalable vector factor.");
4729+
// FIXME: use actual opcode/data type for analysis here.
4730+
PreferEVL = getTailFoldingStyle() == TailFoldingStyle::DataWithEVL &&
4731+
TTI.hasActiveVectorLength(0, nullptr, Align());
4732+
#if !NDEBUG
4733+
if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
4734+
if (PreferEVL)
4735+
dbgs() << "LV: Preference for VP intrinsics indicated. Will "
4736+
"try to generate VP Intrinsics.\n";
4737+
else
4738+
dbgs() << "LV: Preference for VP intrinsics indicated. Will "
4739+
"not try to generate VP Intrinsics since the target "
4740+
"does not support vector length predication.\n";
4741+
}
4742+
#endif // !NDEBUG
4743+
4744+
// Tail folded loop using VP intrinsics restricts the VF to be scalable.
4745+
if (PreferEVL)
4746+
MaxFactors.FixedVF = ElementCount::getFixed(1);
4747+
}
4748+
46944749
return MaxFactors;
46954750
}
46964751

@@ -5300,6 +5355,10 @@ LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
53005355
if (!isScalarEpilogueAllowed())
53015356
return 1;
53025357

5358+
// Do not interleave if EVL is preferred and no User IC is specified.
5359+
if (useVPIWithVPEVLVectorization())
5360+
return 1;
5361+
53035362
// We used the distance for the interleave count.
53045363
if (!Legal->isSafeForAnyVectorWidth())
53055364
return 1;
@@ -8537,6 +8596,8 @@ void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,
85378596
VPlanTransforms::truncateToMinimalBitwidths(
85388597
*Plan, CM.getMinimalBitwidths(), PSE.getSE()->getContext());
85398598
VPlanTransforms::optimize(*Plan, *PSE.getSE());
8599+
if (CM.useVPIWithVPEVLVectorization())
8600+
VPlanTransforms::addExplicitVectorLength(*Plan);
85408601
assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
85418602
VPlans.push_back(std::move(Plan));
85428603
}
@@ -9399,6 +9460,52 @@ void VPReplicateRecipe::execute(VPTransformState &State) {
93999460
State.ILV->scalarizeInstruction(UI, this, VPIteration(Part, Lane), State);
94009461
}
94019462

9463+
/// Creates either vp_store or vp_scatter intrinsics calls to represent
9464+
/// predicated store/scatter.
9465+
static Instruction *
9466+
lowerStoreUsingVectorIntrinsics(IRBuilderBase &Builder, Value *Addr,
9467+
Value *StoredVal, bool IsScatter, Value *Mask,
9468+
Value *EVLPart, const Align &Alignment) {
9469+
CallInst *Call;
9470+
if (IsScatter) {
9471+
Call = Builder.CreateIntrinsic(Type::getVoidTy(EVLPart->getContext()),
9472+
Intrinsic::vp_scatter,
9473+
{StoredVal, Addr, Mask, EVLPart});
9474+
} else {
9475+
VectorBuilder VBuilder(Builder);
9476+
VBuilder.setEVL(EVLPart).setMask(Mask);
9477+
Call = cast<CallInst>(VBuilder.createVectorInstruction(
9478+
Instruction::Store, Type::getVoidTy(EVLPart->getContext()),
9479+
{StoredVal, Addr}));
9480+
}
9481+
Call->addParamAttr(
9482+
1, Attribute::getWithAlignment(Call->getContext(), Alignment));
9483+
return Call;
9484+
}
9485+
9486+
/// Creates either vp_load or vp_gather intrinsics calls to represent
9487+
/// predicated load/gather.
9488+
static Instruction *lowerLoadUsingVectorIntrinsics(IRBuilderBase &Builder,
9489+
VectorType *DataTy,
9490+
Value *Addr, bool IsGather,
9491+
Value *Mask, Value *EVLPart,
9492+
const Align &Alignment) {
9493+
CallInst *Call;
9494+
if (IsGather) {
9495+
Call = Builder.CreateIntrinsic(DataTy, Intrinsic::vp_gather,
9496+
{Addr, Mask, EVLPart}, nullptr,
9497+
"wide.masked.gather");
9498+
} else {
9499+
VectorBuilder VBuilder(Builder);
9500+
VBuilder.setEVL(EVLPart).setMask(Mask);
9501+
Call = cast<CallInst>(VBuilder.createVectorInstruction(
9502+
Instruction::Load, DataTy, Addr, "vp.op.load"));
9503+
}
9504+
Call->addParamAttr(
9505+
0, Attribute::getWithAlignment(Call->getContext(), Alignment));
9506+
return Call;
9507+
}
9508+
94029509
void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
94039510
VPValue *StoredValue = isStore() ? getStoredValue() : nullptr;
94049511

@@ -9430,14 +9537,31 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
94309537
}
94319538
}
94329539

9540+
auto MaskValue = [&](unsigned Part) -> Value * {
9541+
if (isMaskRequired)
9542+
return BlockInMaskParts[Part];
9543+
return nullptr;
9544+
};
9545+
94339546
// Handle Stores:
94349547
if (SI) {
94359548
State.setDebugLocFrom(SI->getDebugLoc());
94369549

94379550
for (unsigned Part = 0; Part < State.UF; ++Part) {
94389551
Instruction *NewSI = nullptr;
94399552
Value *StoredVal = State.get(StoredValue, Part);
9440-
if (CreateGatherScatter) {
9553+
if (State.EVL) {
9554+
Value *EVLPart = State.get(State.EVL, Part);
9555+
// If EVL is not nullptr, then EVL must be a valid value set during plan
9556+
// creation, possibly default value = whole vector register length. EVL
9557+
// is created only if TTI prefers predicated vectorization, thus if EVL
9558+
// is not nullptr it also implies preference for predicated
9559+
// vectorization.
9560+
// FIXME: Support reverse store after vp_reverse is added.
9561+
NewSI = lowerStoreUsingVectorIntrinsics(
9562+
Builder, State.get(getAddr(), Part), StoredVal, CreateGatherScatter,
9563+
MaskValue(Part), EVLPart, Alignment);
9564+
} else if (CreateGatherScatter) {
94419565
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
94429566
Value *VectorGep = State.get(getAddr(), Part);
94439567
NewSI = Builder.CreateMaskedScatter(StoredVal, VectorGep, Alignment,
@@ -9467,7 +9591,18 @@ void VPWidenMemoryInstructionRecipe::execute(VPTransformState &State) {
94679591
State.setDebugLocFrom(LI->getDebugLoc());
94689592
for (unsigned Part = 0; Part < State.UF; ++Part) {
94699593
Value *NewLI;
9470-
if (CreateGatherScatter) {
9594+
if (State.EVL) {
9595+
Value *EVLPart = State.get(State.EVL, Part);
9596+
// If EVL is not nullptr, then EVL must be a valid value set during plan
9597+
// creation, possibly default value = whole vector register length. EVL
9598+
// is created only if TTI prefers predicated vectorization, thus if EVL
9599+
// is not nullptr it also implies preference for predicated
9600+
// vectorization.
9601+
// FIXME: Support reverse loading after vp_reverse is added.
9602+
NewLI = lowerLoadUsingVectorIntrinsics(
9603+
Builder, DataTy, State.get(getAddr(), Part), CreateGatherScatter,
9604+
MaskValue(Part), EVLPart, Alignment);
9605+
} else if (CreateGatherScatter) {
94719606
Value *MaskPart = isMaskRequired ? BlockInMaskParts[Part] : nullptr;
94729607
Value *VectorGep = State.get(getAddr(), Part);
94739608
NewLI = Builder.CreateMaskedGather(DataTy, VectorGep, Alignment, MaskPart,

llvm/lib/Transforms/Vectorize/VPlan.h

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,12 @@ struct VPTransformState {
242242
ElementCount VF;
243243
unsigned UF;
244244

245+
/// If EVL is not nullptr, then EVL must be a valid value set during plan
246+
/// creation, possibly a default value = whole vector register length. EVL is
247+
/// created only if TTI prefers predicated vectorization, thus if EVL is
248+
/// not nullptr it also implies preference for predicated vectorization.
249+
VPValue *EVL = nullptr;
250+
245251
/// Hold the indices to generate specific scalar instructions. Null indicates
246252
/// that all instances are to be generated, using either scalar or vector
247253
/// instructions.
@@ -1069,6 +1075,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
10691075
SLPLoad,
10701076
SLPStore,
10711077
ActiveLaneMask,
1078+
ExplicitVectorLength,
1079+
ExplicitVectorLengthIVIncrement,
10721080
CalculateTripCountMinusVF,
10731081
// Increment the canonical IV separately for each unrolled part.
10741082
CanonicalIVIncrementForPart,
@@ -1178,6 +1186,8 @@ class VPInstruction : public VPRecipeWithIRFlags, public VPValue {
11781186
default:
11791187
return false;
11801188
case VPInstruction::ActiveLaneMask:
1189+
case VPInstruction::ExplicitVectorLength:
1190+
case VPInstruction::ExplicitVectorLengthIVIncrement:
11811191
case VPInstruction::CalculateTripCountMinusVF:
11821192
case VPInstruction::CanonicalIVIncrementForPart:
11831193
case VPInstruction::BranchOnCount:
@@ -2224,6 +2234,39 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
22242234
#endif
22252235
};
22262236

2237+
/// A recipe for generating the phi node for the current index of elements,
2238+
/// adjusted in accordance with EVL value. It starts at StartIV value and gets
2239+
/// incremented by EVL in each iteration of the vector loop.
2240+
class VPEVLBasedIVPHIRecipe : public VPHeaderPHIRecipe {
2241+
public:
2242+
VPEVLBasedIVPHIRecipe(VPValue *StartMask, DebugLoc DL)
2243+
: VPHeaderPHIRecipe(VPDef::VPEVLBasedIVPHISC, nullptr, StartMask, DL) {}
2244+
2245+
~VPEVLBasedIVPHIRecipe() override = default;
2246+
2247+
VP_CLASSOF_IMPL(VPDef::VPEVLBasedIVPHISC)
2248+
2249+
static inline bool classof(const VPHeaderPHIRecipe *D) {
2250+
return D->getVPDefID() == VPDef::VPEVLBasedIVPHISC;
2251+
}
2252+
2253+
/// Generate phi for handling IV based on EVL over iterations correctly.
2254+
void execute(VPTransformState &State) override;
2255+
2256+
/// Returns true if the recipe only uses the first lane of operand \p Op.
2257+
bool onlyFirstLaneUsed(const VPValue *Op) const override {
2258+
assert(is_contained(operands(), Op) &&
2259+
"Op must be an operand of the recipe");
2260+
return true;
2261+
}
2262+
2263+
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
2264+
/// Print the recipe.
2265+
void print(raw_ostream &O, const Twine &Indent,
2266+
VPSlotTracker &SlotTracker) const override;
2267+
#endif
2268+
};
2269+
22272270
/// A Recipe for widening the canonical induction variable of the vector loop.
22282271
class VPWidenCanonicalIVRecipe : public VPRecipeBase, public VPValue {
22292272
public:

llvm/lib/Transforms/Vectorize/VPlanAnalysis.cpp

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -207,14 +207,14 @@ Type *VPTypeAnalysis::inferScalarType(const VPValue *V) {
207207
Type *ResultTy =
208208
TypeSwitch<const VPRecipeBase *, Type *>(V->getDefiningRecipe())
209209
.Case<VPCanonicalIVPHIRecipe, VPFirstOrderRecurrencePHIRecipe,
210-
VPReductionPHIRecipe, VPWidenPointerInductionRecipe>(
211-
[this](const auto *R) {
212-
// Handle header phi recipes, except VPWienIntOrFpInduction
213-
// which needs special handling due it being possibly truncated.
214-
// TODO: consider inferring/caching type of siblings, e.g.,
215-
// backedge value, here and in cases below.
216-
return inferScalarType(R->getStartValue());
217-
})
210+
VPReductionPHIRecipe, VPWidenPointerInductionRecipe,
211+
VPEVLBasedIVPHIRecipe>([this](const auto *R) {
212+
// Handle header phi recipes, except VPWienIntOrFpInduction
213+
// which needs special handling due it being possibly truncated.
214+
// TODO: consider inferring/caching type of siblings, e.g.,
215+
// backedge value, here and in cases below.
216+
return inferScalarType(R->getStartValue());
217+
})
218218
.Case<VPWidenIntOrFpInductionRecipe, VPDerivedIVRecipe>(
219219
[](const auto *R) { return R->getScalarType(); })
220220
.Case<VPPredInstPHIRecipe, VPWidenPHIRecipe, VPScalarIVStepsRecipe,

0 commit comments

Comments
 (0)