Skip to content

[TTI][RISCV]Improve costs for whole vector reg extract/insert. #80164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

alexey-bataev
Copy link
Member

@alexey-bataev alexey-bataev commented Jan 31, 2024

If we can detect, that whole register extract/insert is requested,
consider it free.

Created using spr 1.3.5
@llvmbot llvmbot added backend:RISC-V llvm:analysis Includes value tracking, cost tables and constant folding labels Jan 31, 2024
@llvmbot
Copy link
Member

llvmbot commented Jan 31, 2024

@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-llvm-analysis

Author: Alexey Bataev (alexey-bataev)

Changes

If we can detect, that whole register extract/insert is requested, it
emits VMV_V_V instruction or just a vsetvli.


Patch is 81.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/80164.diff

4 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+42)
  • (modified) llvm/test/Analysis/CostModel/RISCV/shuffle-extract_subvector.ll (+87-87)
  • (modified) llvm/test/Analysis/CostModel/RISCV/shuffle-insert_subvector.ll (+21-21)
  • (modified) llvm/test/Analysis/CostModel/RISCV/shuffle-interleave.ll (+3-3)
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index fe1cdb2dfa423..465a05b6497a2 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -326,6 +326,48 @@ InstructionCost RISCVTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
     switch (Kind) {
     default:
       break;
+    case TTI::SK_ExtractSubvector:
+      if (isa<FixedVectorType>(SubTp)) {
+        unsigned TpRegs = getRegUsageForType(Tp);
+        unsigned NumElems =
+            divideCeil(Tp->getElementCount().getFixedValue(), TpRegs);
+        // Whole vector extract - just the vector itself + (possible) vsetvli.
+        // TODO: consider adding the cost for vsetvli.
+        if (Index % NumElems == 0) {
+          std::pair<InstructionCost, MVT> SubLT =
+              getTypeLegalizationCost(SubTp);
+          return Index == 0
+                     ? TTI::TCC_Free
+                     : SubLT.first * getRISCVInstructionCost(RISCV::VMV_V_V,
+                                                             SubLT.second,
+                                                             CostKind);
+        }
+      }
+      break;
+    case TTI::SK_InsertSubvector:
+      if (auto *FSubTy = dyn_cast<FixedVectorType>(SubTp)) {
+        unsigned TpRegs = getRegUsageForType(Tp);
+        unsigned SubTpRegs = getRegUsageForType(SubTp);
+        unsigned NextSubTpRegs = getRegUsageForType(FixedVectorType::get(
+            Tp->getElementType(), FSubTy->getNumElements() + 1));
+        unsigned NumElems =
+            divideCeil(Tp->getElementCount().getFixedValue(), TpRegs);
+        // Whole vector insert - just the vector itself + (possible) vsetvli.
+        // TODO: consider adding the cost for vsetvli.
+        if (Index % NumElems == 0 &&
+            (any_of(Args, UndefValue::classof) ||
+             (SubTpRegs != 0 && SubTpRegs != NextSubTpRegs &&
+              TpRegs / SubTpRegs > 1))) {
+          std::pair<InstructionCost, MVT> SubLT =
+              getTypeLegalizationCost(SubTp);
+          return Index == 0
+                     ? TTI::TCC_Free
+                     : SubLT.first * getRISCVInstructionCost(RISCV::VMV_V_V,
+                                                             SubLT.second,
+                                                             CostKind);
+        }
+      }
+      break;
     case TTI::SK_PermuteSingleSrc: {
       if (Mask.size() >= 2 && LT.second.isFixedLengthVector()) {
         MVT EltTp = LT.second.getVectorElementType();
diff --git a/llvm/test/Analysis/CostModel/RISCV/shuffle-extract_subvector.ll b/llvm/test/Analysis/CostModel/RISCV/shuffle-extract_subvector.ll
index 76cb1955a2b37..901d66e1124d8 100644
--- a/llvm/test/Analysis/CostModel/RISCV/shuffle-extract_subvector.ll
+++ b/llvm/test/Analysis/CostModel/RISCV/shuffle-extract_subvector.ll
@@ -9,15 +9,15 @@
 
 define void @test_vXf64(<4 x double> %src256, <8 x double> %src512) {
 ; CHECK-LABEL: 'test_vXf64'
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_01 = shufflevector <4 x double> %src256, <4 x double> undef, <2 x i32> <i32 0, i32 1>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_23 = shufflevector <4 x double> %src256, <4 x double> undef, <2 x i32> <i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_01 = shufflevector <8 x double> %src512, <8 x double> undef, <2 x i32> <i32 0, i32 1>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_23 = shufflevector <8 x double> %src512, <8 x double> undef, <2 x i32> <i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_45 = shufflevector <8 x double> %src512, <8 x double> undef, <2 x i32> <i32 4, i32 5>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_67 = shufflevector <8 x double> %src512, <8 x double> undef, <2 x i32> <i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_0123 = shufflevector <8 x double> %src512, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_2345 = shufflevector <8 x double> %src512, <8 x double> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_4567 = shufflevector <8 x double> %src512, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V256_01 = shufflevector <4 x double> %src256, <4 x double> undef, <2 x i32> <i32 0, i32 1>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V256_23 = shufflevector <4 x double> %src256, <4 x double> undef, <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V512_01 = shufflevector <8 x double> %src512, <8 x double> undef, <2 x i32> <i32 0, i32 1>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_23 = shufflevector <8 x double> %src512, <8 x double> undef, <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_45 = shufflevector <8 x double> %src512, <8 x double> undef, <2 x i32> <i32 4, i32 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_67 = shufflevector <8 x double> %src512, <8 x double> undef, <2 x i32> <i32 6, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V512_0123 = shufflevector <8 x double> %src512, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V512_2345 = shufflevector <8 x double> %src512, <8 x double> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V512_4567 = shufflevector <8 x double> %src512, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of -1 for instruction: %V512_567u = shufflevector <8 x double> %src512, <8 x double> undef, <4 x i32> <i32 5, i32 6, i32 7, i32 poison>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
@@ -36,15 +36,15 @@ define void @test_vXf64(<4 x double> %src256, <8 x double> %src512) {
 
 define void @test_vXi64(<4 x i64> %src256, <8 x i64> %src512) {
 ; CHECK-LABEL: 'test_vXi64'
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_01 = shufflevector <4 x i64> %src256, <4 x i64> undef, <2 x i32> <i32 0, i32 1>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_23 = shufflevector <4 x i64> %src256, <4 x i64> undef, <2 x i32> <i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_01 = shufflevector <8 x i64> %src512, <8 x i64> undef, <2 x i32> <i32 0, i32 1>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_23 = shufflevector <8 x i64> %src512, <8 x i64> undef, <2 x i32> <i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_45 = shufflevector <8 x i64> %src512, <8 x i64> undef, <2 x i32> <i32 4, i32 5>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_67 = shufflevector <8 x i64> %src512, <8 x i64> undef, <2 x i32> <i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_0123 = shufflevector <8 x i64> %src512, <8 x i64> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_2345 = shufflevector <8 x i64> %src512, <8 x i64> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_4567 = shufflevector <8 x i64> %src512, <8 x i64> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V256_01 = shufflevector <4 x i64> %src256, <4 x i64> undef, <2 x i32> <i32 0, i32 1>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V256_23 = shufflevector <4 x i64> %src256, <4 x i64> undef, <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V512_01 = shufflevector <8 x i64> %src512, <8 x i64> undef, <2 x i32> <i32 0, i32 1>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_23 = shufflevector <8 x i64> %src512, <8 x i64> undef, <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_45 = shufflevector <8 x i64> %src512, <8 x i64> undef, <2 x i32> <i32 4, i32 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_67 = shufflevector <8 x i64> %src512, <8 x i64> undef, <2 x i32> <i32 6, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V512_0123 = shufflevector <8 x i64> %src512, <8 x i64> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V512_2345 = shufflevector <8 x i64> %src512, <8 x i64> undef, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V512_4567 = shufflevector <8 x i64> %src512, <8 x i64> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
   %V256_01 = shufflevector <4 x i64> %src256, <4 x i64> undef, <2 x i32> <i32 0, i32 1>
@@ -61,28 +61,28 @@ define void @test_vXi64(<4 x i64> %src256, <8 x i64> %src512) {
 
 define void @test_vXi32(<4 x i32> %src128, <8 x i32> %src256, <16 x i32> %src512) {
 ; CHECK-LABEL: 'test_vXi32'
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V128_01 = shufflevector <4 x i32> %src128, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V128_01 = shufflevector <4 x i32> %src128, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V128_23 = shufflevector <4 x i32> %src128, <4 x i32> undef, <2 x i32> <i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_01 = shufflevector <8 x i32> %src256, <8 x i32> undef, <2 x i32> <i32 0, i32 1>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V256_01 = shufflevector <8 x i32> %src256, <8 x i32> undef, <2 x i32> <i32 0, i32 1>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_23 = shufflevector <8 x i32> %src256, <8 x i32> undef, <2 x i32> <i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_45 = shufflevector <8 x i32> %src256, <8 x i32> undef, <2 x i32> <i32 4, i32 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V256_45 = shufflevector <8 x i32> %src256, <8 x i32> undef, <2 x i32> <i32 4, i32 5>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_67 = shufflevector <8 x i32> %src256, <8 x i32> undef, <2 x i32> <i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_0123 = shufflevector <8 x i32> %src256, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_4567 = shufflevector <8 x i32> %src256, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_01 = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 0, i32 1>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V256_0123 = shufflevector <8 x i32> %src256, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V256_4567 = shufflevector <8 x i32> %src256, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V512_01 = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 0, i32 1>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_23 = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_45 = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 4, i32 5>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_45 = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 4, i32 5>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_67 = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_89 = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 8, i32 9>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_89 = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 8, i32 9>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_AB = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 10, i32 11>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_CD = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 12, i32 13>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_CD = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 12, i32 13>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_EF = shufflevector <16 x i32> %src512, <16 x i32> undef, <2 x i32> <i32 14, i32 15>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_0123 = shufflevector <16 x i32> %src512, <16 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_4567 = shufflevector <16 x i32> %src512, <16 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_89AB = shufflevector <16 x i32> %src512, <16 x i32> undef, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_CDEF = shufflevector <16 x i32> %src512, <16 x i32> undef, <4 x i32> <i32 12, i32 13, i32 14, i32 15>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_01234567 = shufflevector <16 x i32> %src512, <16 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 4 for instruction: %V512_89ABCDEF = shufflevector <16 x i32> %src512, <16 x i32> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V512_0123 = shufflevector <16 x i32> %src512, <16 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_4567 = shufflevector <16 x i32> %src512, <16 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_89AB = shufflevector <16 x i32> %src512, <16 x i32> undef, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V512_CDEF = shufflevector <16 x i32> %src512, <16 x i32> undef, <4 x i32> <i32 12, i32 13, i32 14, i32 15>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V512_01234567 = shufflevector <16 x i32> %src512, <16 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V512_89ABCDEF = shufflevector <16 x i32> %src512, <16 x i32> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
 ;
   %V128_01 = shufflevector <4 x i32> %src128, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
@@ -112,62 +112,62 @@ define void @test_vXi32(<4 x i32> %src128, <8 x i32> %src256, <16 x i32> %src512
 
 define void @test_vXi16(<4 x i16> %src64, <8 x i16> %src128, <16 x i16> %src256, <32 x i16> %src512) {
 ; CHECK-LABEL: 'test_vXi16'
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V64_01 = shufflevector <4 x i16> %src64, <4 x i16> undef, <2 x i32> <i32 0, i32 1>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V64_01 = shufflevector <4 x i16> %src64, <4 x i16> undef, <2 x i32> <i32 0, i32 1>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V64_23 = shufflevector <4 x i16> %src64, <4 x i16> undef, <2 x i32> <i32 2, i32 3>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V128_01 = shufflevector <8 x i16> %src128, <8 x i16> undef, <2 x i32> <i32 0, i32 1>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V128_01 = shufflevector <8 x i16> %src128, <8 x i16> undef, <2 x i32> <i32 0, i32 1>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V128_23 = shufflevector <8 x i16> %src128, <8 x i16> undef, <2 x i32> <i32 2, i32 3>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V128_45 = shufflevector <8 x i16> %src128, <8 x i16> undef, <2 x i32> <i32 4, i32 5>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V128_67 = shufflevector <8 x i16> %src128, <8 x i16> undef, <2 x i32> <i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V128_0123 = shufflevector <8 x i16> %src128, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V128_0123 = shufflevector <8 x i16> %src128, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V128_4567 = shufflevector <8 x i16> %src128, <8 x i16> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_01 = shufflevector <16 x i16> %src256, <16 x i16> undef, <2 x i32> <i32 0, i32 1>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V256_01 = shufflevector <16 x i16> %src256, <16 x i16> undef, <2 x i32> <i32 0, i32 1>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_23 = shufflevector <16 x i16> %src256, <16 x i16> undef, <2 x i32> <i32 2, i32 3>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_45 = shufflevector <16 x i16> %src256, <16 x i16> undef, <2 x i32> <i32 4, i32 5>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_67 = shufflevector <16 x i16> %src256, <16 x i16> undef, <2 x i32> <i32 6, i32 7>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_89 = shufflevector <16 x i16> %src256, <16 x i16> undef, <2 x i32> <i32 8, i32 9>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: %V256_89 = shufflevector <16 x i16> %src256, <16 x i16> undef, <2 x i32> <i32 8, i32 9>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_AB = shufflevector <16 x i16> %src256, <16 x i16> undef, <2 x i32> <i32 10, i32 11>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_CD = shufflevector <16 x i16> %src256, <16 x i16> undef, <2 x i32> <i32 12, i32 13>
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_EF = shufflevector <16 x i16> %src256, <16 x i16> undef, <2 x i32> <i32 14, i32 15>
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 2 for instruction: %V256_0123 = shufflevector <16 x i16> %src256, <16 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: %V256_0123 = shufflevector <16 x i16> %src256, <16 x i16> u...
[truncated]

@topperc
Copy link
Collaborator

topperc commented Jan 31, 2024

Don't we need to know the exact vlen to know where register boundaries are?

@alexey-bataev
Copy link
Member Author

Don't we need to know the exact vlen to know where register boundaries are?

I use getRegUsageForType() to get this info.

@topperc
Copy link
Collaborator

topperc commented Jan 31, 2024

Don't we need to know the exact vlen to know where register boundaries are?

I use getRegUsageForType() to get this info.

That tells the maximum number of registers needed for the type assuming a minimum VLEN. If hardware VLEN is more than the minimum VLEN, we still use the extra registers but the elements in them are not used since they would be past VL. CodeGen has to use a slidedown unless we also know the maximum VLEN is the same as the minimum VLEN.

@alexey-bataev
Copy link
Member Author

Don't we need to know the exact vlen to know where register boundaries are?

I use getRegUsageForType() to get this info.

That tells the maximum number of registers needed for the type assuming a minimum VLEN. If hardware VLEN is more than the minimum VLEN, we still use the extra registers but the elements in them are not used since they would be past VL. CodeGen has to use a slidedown unless we also know the maximum VLEN is the same as the minimum VLEN.

Do we have anything in TTI that returns correct VLEN?

@topperc
Copy link
Collaborator

topperc commented Jan 31, 2024

Don't we need to know the exact vlen to know where register boundaries are?

I use getRegUsageForType() to get this info.

That tells the maximum number of registers needed for the type assuming a minimum VLEN. If hardware VLEN is more than the minimum VLEN, we still use the extra registers but the elements in them are not used since they would be past VL. CodeGen has to use a slidedown unless we also know the maximum VLEN is the same as the minimum VLEN.

Do we have anything in TTI that returns correct VLEN?

You can check that ST->getRealMaxVlen() == ST->getRealMinVLen()

@alexey-bataev
Copy link
Member Author

Don't we need to know the exact vlen to know where register boundaries are?

I use getRegUsageForType() to get this info.

That tells the maximum number of registers needed for the type assuming a minimum VLEN. If hardware VLEN is more than the minimum VLEN, we still use the extra registers but the elements in them are not used since they would be past VL. CodeGen has to use a slidedown unless we also know the maximum VLEN is the same as the minimum VLEN.

Do we have anything in TTI that returns correct VLEN?

You can check that ST->getRealMaxVlen() == ST->getRealMinVLen()

Will add

Created using spr 1.3.5
divideCeil(Tp->getElementCount().getFixedValue(), TpRegs);
// Whole vector extract - just the vector itself + (possible) vsetvli.
// TODO: consider adding the cost for vsetvli.
if (Index == 0 || (ST->getRealMaxVLen() == ST->getRealMinVLen() &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this check would be more clearly expressed as an and of the following clauses
a) ST->getRealMaxVLen() == ST->getRealMinVLen()
b) NumElems * ElementSizeInBits == VLEN
c) Index % NumElems == 0

Note that this only supports m1 full extracts. But starting there and extending it to m2, and m4 later seems entirely reasonable.

getTypeLegalizationCost(SubTp);
return Index == 0
? TTI::TCC_Free
: SubLT.first * getRISCVInstructionCost(RISCV::VMV_V_V,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a full VREG case, you never need the VMV_V_V. You only need the VMV_V_V if NumElems < VLMAX.

Extending this to sub-register extract with exact VLEN known would be reasonable, but let's do that in a separate patch.

break;
case TTI::SK_InsertSubvector:
if (auto *FSubTy = dyn_cast<FixedVectorType>(SubTp)) {
unsigned TpRegs = getRegUsageForType(Tp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same basic style comments as above.

Created using spr 1.3.5
Created using spr 1.3.5
Created using spr 1.3.5
; RTBASE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V128_01 = shufflevector <4 x i32> %src128, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
; RTBASE-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %V128_23 = shufflevector <4 x i32> %src128, <4 x i32> undef, <2 x i32> <i32 2, i32 3>
; RTBASE-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %V256_01 = shufflevector <8 x i32> %src256, <8 x i32> undef, <2 x i32> <i32 0, i32 1>
; RTBASE-NEXT: Cost Model: Found an estimated cost of 2 for instruction: %V256_23 = shufflevector <8 x i32> %src256, <8 x i32> undef, <2 x i32> <i32 2, i32 3>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just making a note, this is an LMUL 1 vslidedown:

	vsetivli	zero, 2, e32, m1, ta, ma
	vslidedown.vi	v8, v8, 2

So I think the cost should just be one here. This looks like it's coming from the scalable vector cost path. Is the type being passed in a <8 x i32> instead of a <2 x i32>? Something that could be looked at in a later patch.

Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd encourage you to split off a change to handle only the Index == 0 case. It should be simple, but that's the value. :)

Created using spr 1.3.5
@preames
Copy link
Collaborator

preames commented Feb 14, 2024

I'd encourage you to split off a change to handle only the Index == 0 case. It should be simple, but that's the value. :)

I posted this here: #81751

@alexey-bataev
Copy link
Member Author

The latest version of the patch actually handles only index 0

@lukel97
Copy link
Contributor

lukel97 commented Feb 15, 2024

#81751 handles both fixed and scalable vectors from the looks of things. I wonder if it's possible to have this patch handle the whole reg extract/insert case for scalable vectors too, if we move the logic into the scalable part below the fixed vector switch? Any scalable vector extract/insert should be free if both the vector and subvector are >= LMUL 1.

@alexey-bataev
Copy link
Member Author

#81751 handles both fixed and scalable vectors from the looks of things. I wonder if it's possible to have this patch handle the whole reg extract/insert case for scalable vectors too, if we move the logic into the scalable part below the fixed vector switch? Any scalable vector extract/insert should be free if both the vector and subvector are >= LMUL 1.

Not sure that insert subvector should be free. It can be free, if either the second vector is undef or inserting the whole vector. LMUL >=1 not enough for the second case, also need to check that the whole vector is insert, not, say, half of it.

@lukel97
Copy link
Contributor

lukel97 commented Feb 15, 2024

LMUL >=1 not enough for the second case, also need to check that the whole vector is insert, not, say, half of it.

But for llvm.vector.insert there is the constraint that all the subvec elements must be within bounds of the vector, and for scalable vectors the index at which it is inserted is scaled by vscale.

So if the subvector is LMUL >=1 it shouldn't be possible for only half of it be inserted, since it won't be truncated and the index will be a multiple of an LMUL1 register boundary.

This is separate from the exact VLEN fixed vector case in the original version of this PR though, we can leave it for a future patch.

Created using spr 1.3.5
@alexey-bataev alexey-bataev changed the title [TTI][RISCV]Improve costs for fixed vector whole reg extract/insert. [TTI][RISCV]Improve costs for whole vector reg extract/insert. Feb 15, 2024
unsigned NextSubTpRegs = getRegUsageForType(FixedVectorType::get(
Tp->getElementType(), FSubTy->getNumElements() + 1));
// Whole vector insert - just the vector itself.
if (Index == 0 && SubTpRegs != 0 && SubTpRegs != NextSubTpRegs &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to check, is SubTpRegs != NextSubTpRegs to check that SubTp isn't a fractional LMUL?

Created using spr 1.3.5
@@ -442,6 +454,9 @@ InstructionCost RISCVTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
return LT.first *
getRISCVInstructionCost(RISCV::VSLIDEDOWN_VI, LT.second, CostKind);
case TTI::SK_InsertSubvector:
if (Index == 0 && any_of(Args, UndefValue::classof))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check that Args isn't empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

Created using spr 1.3.5
Created using spr 1.3.5
Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the PR title need updated to reflect that this handles insert_subvectors only now

@@ -326,6 +326,18 @@ InstructionCost RISCVTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
switch (Kind) {
default:
break;
case TTI::SK_InsertSubvector: {
auto *FSubTy = dyn_cast<FixedVectorType>(SubTp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use cast instead of dyn_cast here to get the assertion?

Created using spr 1.3.5
preames added a commit that referenced this pull request Feb 21, 2024
…ct vlen (#82405)

If we have exact vlen knowledge, we can figure out which indices
correspond to register boundaries. Our lowering uses this knowledge to
replace the vslidedown.vi with a sub-register extract. Our costs can
reflect that as well.

This is another piece split off
#80164

---------

Co-authored-by: Luke Lau <[email protected]>
@@ -442,6 +454,9 @@ InstructionCost RISCVTTIImpl::getShuffleCost(TTI::ShuffleKind Kind,
return LT.first *
getRISCVInstructionCost(RISCV::VSLIDEDOWN_VI, LT.second, CostKind);
case TTI::SK_InsertSubvector:
if (Index == 0 && !Args.empty() && any_of(Args, UndefValue::classof))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to split off this piece - or more accurately something vaguely related - and stumbled into something interesting.

The InsertSubvector w/Index=0 is unreachable from everywhere except SLP. TTI::getInstructionCost contains a check for the identity shuffle and always returns 0. improveShuffleKindFromMask will recognize the insert into passthru case as a select (correctly), and thus it doesn't hit this case either. Put together, this means that the index=0 case never makes it from the backend, and thus we have no test coverage via cost model tests.

SLP hits a slightly different codepath here and directly calls getShuffleCost with a possible identity mask. It still can't hit the select case, but it can hit the insert into poison case. SLP appears to have a bunch of guards for this already in various cases.

I'm not really a fan of having untestable logic here. Anyone have any ideas how we can rework this API to ensure SLP can't reach a case which is untestable via costmodel tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not tests for llvm.vector.insert intrinsics check this?

Created using spr 1.3.5
Created using spr 1.3.5
@alexey-bataev
Copy link
Member Author

Ping!

unsigned NextSubTpRegs = getRegUsageForType(FixedVectorType::get(
Tp->getElementType(), FSubTy->getNumElements() + 1));
// Whole vector insert - just the vector itself.
if (Index == 0 && SubTpRegs != 0 && SubTpRegs != NextSubTpRegs &&
Copy link
Collaborator

@topperc topperc Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works. getRegUsageForType is returning the maximum number of registers needed given a minimum VLEN. If the runtime VLEN is larger the used number of registers could be less.

The backend must always use a vslideup or vmv.v.v for fixed vector insert unless we know both the maximum and minimum VLEN are the same. I think you have to check ST.getRealVLen().

Created using spr 1.3.5
@alexey-bataev
Copy link
Member Author

Ping!

Copy link
Contributor

@lukel97 lukel97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the PR title still needs the reg extract part removed

Comment on lines +463 to +464
const unsigned MinVLen = ST->getRealMinVLen();
const unsigned MaxVLen = ST->getRealMaxVLen();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use ST->getRealVLen() which was added recently

Comment on lines +468 to +472
unsigned TpRegs = getRegUsageForType(Tp);
unsigned SubTpRegs = getRegUsageForType(SubTp);
unsigned NextSubTpRegs = getRegUsageForType(FixedVectorType::get(
Tp->getElementType(), FSubTy->getNumElements() + 1));
if (SubTpRegs != 0 && SubTpRegs != NextSubTpRegs && TpRegs >= SubTpRegs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for TpRegs < SubTpRegs?

preames added a commit to preames/llvm-project that referenced this pull request Mar 14, 2024
…t vlen

If we have exact vlen knowledge, we can figure out which indices
correspond to register boundaries. Our lowering will use this knowledge
to replace the vslideup.vi with a sub-register insert when the subvec
passthru is undef.  One case where the subvec passthru is known undef
is when the subvec completely fills the subregister, and that's the
easiest case to recognize during costing.

Note: This is cost modeling a lowering which hasn't landed yet, see
llvm#84107.  This change will
not land until after that one does.

This is another piece split off
llvm#80164
Copy link
Collaborator

@preames preames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have posted an alternative patch for the remaining insertsubvector case here: #85240

When writing this, I discovered that this patch models a lowering which is not implemented. There is a patch on review, but it hasn't landed yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:RISC-V llvm:analysis Includes value tracking, cost tables and constant folding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants