Skip to content

[VectorCombine] Expand vector_insert into shufflevector for earlier cost optimizations (#145512) #146479

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

laurenmchin
Copy link

@laurenmchin laurenmchin commented Jul 1, 2025

Move folding logic from InstCombineCalls to VectorCombine to ensure vector_insert intrinsics are expanded into shufflevector instructions before cost-based shuffle optimizations run.

Canonicalizes fixed-width vectors only.

@RKSimon

Fixes #145512

… cost optimizations (llvm#145512)

Move folding logic from `InstCombineCalls` to `VectorCombine` to ensure `vector_insert` intrinsics are expanded into shufflevector instructions before cost-based shuffle optimizations run.

Canonicalizes fixed-width vectors only.
@laurenmchin laurenmchin requested a review from nikic as a code owner July 1, 2025 07:02
Copy link

github-actions bot commented Jul 1, 2025

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

@llvmbot llvmbot added vectorizers llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms llvm:vectorcombine labels Jul 1, 2025
@llvmbot
Copy link
Member

llvmbot commented Jul 1, 2025

@llvm/pr-subscribers-vectorizers

@llvm/pr-subscribers-llvm-transforms

Author: Lauren (laurenmchin)

Changes

Move folding logic from InstCombineCalls to VectorCombine to ensure vector_insert intrinsics are expanded into shufflevector instructions before cost-based shuffle optimizations run.

Canonicalizes fixed-width vectors only.

Fixes #145512


Full diff: https://github.com/llvm/llvm-project/pull/146479.diff

3 Files Affected:

  • (modified) llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp (-46)
  • (modified) llvm/lib/Transforms/Vectorize/VectorCombine.cpp (+71)
  • (added) llvm/test/Transforms/VectorCombine/fold-vector-insert.ll (+71)
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
index 5b398d3b75f59..df29024a86f67 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
@@ -3462,52 +3462,6 @@ Instruction *InstCombinerImpl::visitCallInst(CallInst &CI) {
     }
     break;
   }
-  case Intrinsic::vector_insert: {
-    Value *Vec = II->getArgOperand(0);
-    Value *SubVec = II->getArgOperand(1);
-    Value *Idx = II->getArgOperand(2);
-    auto *DstTy = dyn_cast<FixedVectorType>(II->getType());
-    auto *VecTy = dyn_cast<FixedVectorType>(Vec->getType());
-    auto *SubVecTy = dyn_cast<FixedVectorType>(SubVec->getType());
-
-    // Only canonicalize if the destination vector, Vec, and SubVec are all
-    // fixed vectors.
-    if (DstTy && VecTy && SubVecTy) {
-      unsigned DstNumElts = DstTy->getNumElements();
-      unsigned VecNumElts = VecTy->getNumElements();
-      unsigned SubVecNumElts = SubVecTy->getNumElements();
-      unsigned IdxN = cast<ConstantInt>(Idx)->getZExtValue();
-
-      // An insert that entirely overwrites Vec with SubVec is a nop.
-      if (VecNumElts == SubVecNumElts)
-        return replaceInstUsesWith(CI, SubVec);
-
-      // Widen SubVec into a vector of the same width as Vec, since
-      // shufflevector requires the two input vectors to be the same width.
-      // Elements beyond the bounds of SubVec within the widened vector are
-      // undefined.
-      SmallVector<int, 8> WidenMask;
-      unsigned i;
-      for (i = 0; i != SubVecNumElts; ++i)
-        WidenMask.push_back(i);
-      for (; i != VecNumElts; ++i)
-        WidenMask.push_back(PoisonMaskElem);
-
-      Value *WidenShuffle = Builder.CreateShuffleVector(SubVec, WidenMask);
-
-      SmallVector<int, 8> Mask;
-      for (unsigned i = 0; i != IdxN; ++i)
-        Mask.push_back(i);
-      for (unsigned i = DstNumElts; i != DstNumElts + SubVecNumElts; ++i)
-        Mask.push_back(i);
-      for (unsigned i = IdxN + SubVecNumElts; i != DstNumElts; ++i)
-        Mask.push_back(i);
-
-      Value *Shuffle = Builder.CreateShuffleVector(Vec, WidenShuffle, Mask);
-      return replaceInstUsesWith(CI, Shuffle);
-    }
-    break;
-  }
   case Intrinsic::vector_extract: {
     Value *Vec = II->getArgOperand(0);
     Value *Idx = II->getArgOperand(1);
diff --git a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
index 19e82099e87f0..dbbc6c5a07ec8 100644
--- a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
+++ b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
@@ -112,6 +112,7 @@ class VectorCombine {
   bool foldExtractExtract(Instruction &I);
   bool foldInsExtFNeg(Instruction &I);
   bool foldInsExtBinop(Instruction &I);
+  bool foldVectorInsertToShuffle(Instruction &I);
   bool foldInsExtVectorToShuffle(Instruction &I);
   bool foldBitOpOfBitcasts(Instruction &I);
   bool foldBitcastShuffle(Instruction &I);
@@ -804,6 +805,73 @@ bool VectorCombine::foldInsExtBinop(Instruction &I) {
   return true;
 }
 
+/// Try to fold vector_insert intrinsics into shufflevector instructions.
+bool VectorCombine::foldVectorInsertToShuffle(Instruction &I) {
+  auto *II = dyn_cast<IntrinsicInst>(&I);
+  // This optimization only applies to vector_insert intrinsics.
+  if (!II || II->getIntrinsicID() != Intrinsic::vector_insert)
+    return false;
+
+  Value *Vec = II->getArgOperand(0);
+  Value *SubVec = II->getArgOperand(1);
+  Value *Idx = II->getArgOperand(2);
+
+  // Caller guarantees DstTy is a fixed vector.
+  auto *DstTy = cast<FixedVectorType>(II->getType());
+  auto *VecTy = dyn_cast<FixedVectorType>(Vec->getType());
+  auto *SubVecTy = dyn_cast<FixedVectorType>(SubVec->getType());
+
+  // Only canonicalize if Vec and SubVec are both fixed vectors.
+  if (!VecTy || !SubVecTy)
+    return false;
+
+  unsigned DstNumElts = DstTy->getNumElements();
+  unsigned VecNumElts = VecTy->getNumElements();
+  unsigned SubVecNumElts = SubVecTy->getNumElements();
+  auto *SubVecPtr = dyn_cast<ConstantInt>(Idx);
+  if (!SubVecPtr)
+    return false;
+
+  unsigned SubVecIdx = SubVecPtr->getZExtValue();
+
+  // Ensure insertion of SubVec doesn't exceed Dst bounds.
+  if (SubVecIdx % SubVecNumElts != 0 || SubVecIdx + SubVecNumElts > DstNumElts)
+    return false;
+
+  // An insert that entirely overwrites Vec with SubVec is a nop.
+  if (VecNumElts == SubVecNumElts) {
+    replaceValue(I, *SubVec);
+    return true;
+  }
+
+  // Widen SubVec into a vector of the same width as Vec, since
+  // shufflevector requires the two input vectors to be the same width.
+  // Elements beyond the bounds of SubVec within the widened vector are
+  // undefined.
+  SmallVector<int, 8> WidenMask;
+  unsigned int i = 0;
+  for (i = 0; i != SubVecNumElts; ++i)
+    WidenMask.push_back(i);
+  for (; i != VecNumElts; ++i)
+    WidenMask.push_back(PoisonMaskElem);
+
+  auto *WidenShuffle = Builder.CreateShuffleVector(SubVec, WidenMask);
+  Worklist.pushValue(WidenShuffle);
+
+  SmallVector<int, 8> Mask;
+  unsigned int j;
+  for (i = 0; i != SubVecIdx; ++i)
+    Mask.push_back(i);
+  for (j = 0; j != SubVecNumElts; ++j)
+    Mask.push_back(DstNumElts + j);
+  for (i = SubVecIdx + SubVecNumElts; i != DstNumElts; ++i)
+    Mask.push_back(i);
+
+  auto *Shuffle = Builder.CreateShuffleVector(Vec, WidenShuffle, Mask);
+  replaceValue(I, *Shuffle);
+  return true;
+}
+
 bool VectorCombine::foldBitOpOfBitcasts(Instruction &I) {
   // Match: bitop(bitcast(x), bitcast(y)) -> bitcast(bitop(x, y))
   Value *LHSSrc, *RHSSrc;
@@ -3639,6 +3707,9 @@ bool VectorCombine::run() {
     // dispatching to folding functions if there's no chance of matching.
     if (IsFixedVectorType) {
       switch (Opcode) {
+      case Instruction::Call:
+        MadeChange |= foldVectorInsertToShuffle(I);
+        break;
       case Instruction::InsertElement:
         MadeChange |= vectorizeLoadInsert(I);
         break;
diff --git a/llvm/test/Transforms/VectorCombine/fold-vector-insert.ll b/llvm/test/Transforms/VectorCombine/fold-vector-insert.ll
new file mode 100644
index 0000000000000..976fdb322005b
--- /dev/null
+++ b/llvm/test/Transforms/VectorCombine/fold-vector-insert.ll
@@ -0,0 +1,71 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -passes=vector-combine -S | FileCheck %s
+
+declare <8 x i32> @llvm.vector.insert.v8i32.v4i32(<8 x i32>, <4 x i32>, i64)
+declare <8 x i32> @llvm.vector.insert.v8i32.v8i32(<8 x i32>, <8 x i32>, i64)
+declare <8 x i32> @llvm.vector.insert.v8i32.v2i32(<8 x i32>, <2 x i32>, i64)
+declare <8 x i32> @llvm.vector.insert.v8i32.v1i32(<8 x i32>, <1 x i32>, i64)
+declare <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.v2i32(<vscale x 4 x i32>, <2 x i32>, i64)
+
+define <8 x i32> @vector_insert_begin(<8 x i32> %vec, <4 x i32> %subvec) {
+; CHECK-LABEL: define <8 x i32> @vector_insert_begin(
+; CHECK-SAME: <8 x i32> [[VEC:%.*]], <4 x i32> [[SUBVEC:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <4 x i32> [[SUBVEC]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
+; CHECK-NEXT:    [[RESULT:%.*]] = shufflevector <8 x i32> [[VEC]], <8 x i32> [[TMP1]], <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:    ret <8 x i32> [[RESULT]]
+;
+  %result = call <8 x i32> @llvm.vector.insert.v8i32.v4i32(<8 x i32> %vec, <4 x i32> %subvec, i64 0)
+  ret <8 x i32> %result
+}
+
+define <8 x i32> @vector_insert_middle(<8 x i32> %vec, <2 x i32> %subvec) {
+; CHECK-LABEL: define <8 x i32> @vector_insert_middle(
+; CHECK-SAME: <8 x i32> [[VEC:%.*]], <2 x i32> [[SUBVEC:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <2 x i32> [[SUBVEC]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; CHECK-NEXT:    [[RESULT:%.*]] = shufflevector <8 x i32> [[VEC]], <8 x i32> [[TMP1]], <8 x i32> <i32 0, i32 1, i32 8, i32 9, i32 4, i32 5, i32 6, i32 7>
+; CHECK-NEXT:    ret <8 x i32> [[RESULT]]
+;
+  %result = call <8 x i32> @llvm.vector.insert.v8i32.v2i32(<8 x i32> %vec, <2 x i32> %subvec, i64 2)
+  ret <8 x i32> %result
+}
+
+define <8 x i32> @vector_insert_end(<8 x i32> %vec, <4 x i32> %subvec) {
+; CHECK-LABEL: define <8 x i32> @vector_insert_end(
+; CHECK-SAME: <8 x i32> [[VEC:%.*]], <4 x i32> [[SUBVEC:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <4 x i32> [[SUBVEC]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
+; CHECK-NEXT:    [[RESULT:%.*]] = shufflevector <8 x i32> [[VEC]], <8 x i32> [[TMP1]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 8, i32 9, i32 10, i32 11>
+; CHECK-NEXT:    ret <8 x i32> [[RESULT]]
+;
+  %result = call <8 x i32> @llvm.vector.insert.v8i32.v4i32(<8 x i32> %vec, <4 x i32> %subvec, i64 4)
+  ret <8 x i32> %result
+}
+
+define <8 x i32> @vector_insert_overwrite(<8 x i32> %vec, <8 x i32> %subvec) {
+; CHECK-LABEL: define <8 x i32> @vector_insert_overwrite(
+; CHECK-SAME: <8 x i32> [[VEC:%.*]], <8 x i32> [[SUBVEC:%.*]]) {
+; CHECK-NEXT:    ret <8 x i32> [[SUBVEC]]
+;
+  %result = call <8 x i32> @llvm.vector.insert.v8i32.v8i32(<8 x i32> %vec, <8 x i32> %subvec, i64 0)
+  ret <8 x i32> %result
+}
+
+define <8 x i32> @vector_insert_single_element_at_end(<8 x i32> %vec, <1 x i32> %subvec) {
+; CHECK-LABEL: define <8 x i32> @vector_insert_single_element_at_end(
+; CHECK-SAME: <8 x i32> [[VEC:%.*]], <1 x i32> [[SUBVEC:%.*]]) {
+; CHECK-NEXT:    [[TMP1:%.*]] = shufflevector <1 x i32> [[SUBVEC]], <1 x i32> poison, <8 x i32> <i32 0, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison, i32 poison>
+; CHECK-NEXT:    [[RESULT:%.*]] = shufflevector <8 x i32> [[VEC]], <8 x i32> [[TMP1]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 8>
+; CHECK-NEXT:    ret <8 x i32> [[RESULT]]
+;
+  %result = call <8 x i32> @llvm.vector.insert.v8i32.v1i32(<8 x i32> %vec, <1 x i32> %subvec, i64 7)
+  ret <8 x i32> %result
+}
+
+define <vscale x 4 x i32> @vector_insert_no_fold_scalable(<vscale x 4 x i32> %vec, <2 x i32> %subvec) {
+; CHECK-LABEL: define <vscale x 4 x i32> @vector_insert_no_fold_scalable(
+; CHECK-SAME: <vscale x 4 x i32> [[VEC:%.*]], <2 x i32> [[SUBVEC:%.*]]) {
+; CHECK-NEXT:    [[RESULT:%.*]] = call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.v2i32(<vscale x 4 x i32> [[VEC]], <2 x i32> [[SUBVEC]], i64 0)
+; CHECK-NEXT:    ret <vscale x 4 x i32> [[RESULT]]
+;
+  %result = call <vscale x 4 x i32> @llvm.vector.insert.nxv4i32.v2i32(<vscale x 4 x i32> %vec, <2 x i32> %subvec, i64 0)
+  ret <vscale x 4 x i32> %result
+}

@RKSimon RKSimon self-requested a review July 1, 2025 12:53
for (i = 0; i != SubVecNumElts; ++i)
WidenMask.push_back(i);
for (; i != VecNumElts; ++i)
WidenMask.push_back(PoisonMaskElem);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use std::iota to populate sequential shuffle masks like this:

SmallVector<int, 8> WidenMask(VecNumElts, PoisonMaskElem);
std::iota(WidenMask.begin(), WidenMask.begin() + SubVecNumElts, 0);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use std::iota for mask construction, thank you!

for (j = 0; j != SubVecNumElts; ++j)
Mask.push_back(DstNumElts + j);
for (i = SubVecIdx + SubVecNumElts; i != DstNumElts; ++i)
Mask.push_back(i);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SmallVector<int, 8> WidenMask(DstNumElts);
std::iota(WidenMask.begin(), WidenMask.end(), 0);
std::iota(WidenMask.begin() + SubVecIdx, WidenMask.begin() + SubVecIdx + SubVecNumElts, DstNumElts);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as well, thanks!

@RKSimon RKSimon requested a review from preames July 1, 2025 13:00
@laurenmchin laurenmchin force-pushed the fold-vector-insert-to-shuffle branch from 07a560b to 78a18d0 Compare July 1, 2025 15:54
// undefined.
SmallVector<int, 8> WidenMask(VecNumElts, PoisonMaskElem);
std::iota(WidenMask.begin(), WidenMask.begin() + SubVecNumElts, 0);
std::fill(WidenMask.begin() + SubVecNumElts, WidenMask.end(), PoisonMaskElem);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need the std::fill as the WidenMask will have splatted PoisonMaskElem to all elements, and the std::iota will have just replaced the subvector slice

Copy link

github-actions bot commented Jul 1, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff HEAD~1 HEAD --extensions cpp -- llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp llvm/lib/Transforms/Vectorize/VectorCombine.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
index 55c320103..bf4167cce 100644
--- a/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
+++ b/llvm/lib/Transforms/Vectorize/VectorCombine.cpp
@@ -857,8 +857,10 @@ bool VectorCombine::foldVectorInsertToShuffle(Instruction &I) {
 
   SmallVector<int, 8> Mask(DstNumElts);
   std::iota(Mask.begin(), Mask.begin() + SubVecIdx, 0);
-  std::iota(Mask.begin() + SubVecIdx, Mask.begin() + SubVecIdx + SubVecNumElts, DstNumElts);
-  std::iota(Mask.begin() + SubVecIdx + SubVecNumElts, Mask.end(), SubVecIdx + SubVecNumElts);
+  std::iota(Mask.begin() + SubVecIdx, Mask.begin() + SubVecIdx + SubVecNumElts,
+            DstNumElts);
+  std::iota(Mask.begin() + SubVecIdx + SubVecNumElts, Mask.end(),
+            SubVecIdx + SubVecNumElts);
 
   auto *Shuffle = Builder.CreateShuffleVector(Vec, WidenShuffle, Mask);
   replaceValue(I, *Shuffle);

SmallVector<int, 8> Mask(DstNumElts);
std::iota(Mask.begin(), Mask.begin() + SubVecIdx, 0);
std::iota(Mask.begin() + SubVecIdx, Mask.begin() + SubVecIdx + SubVecNumElts, DstNumElts);
std::iota(Mask.begin() + SubVecIdx + SubVecNumElts, Mask.end(), SubVecIdx + SubVecNumElts);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this just write the same values as the first std::iota call?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you're right. thanks for pointing that out — the second call was redundant and overwriting part of the first. i’ve updated it to just use two iota calls, with the second overwriting the relevant subrange in-place.


SmallVector<int, 8> Mask(DstNumElts);
std::iota(Mask.begin(), Mask.end(), 0);
std::iota(Mask.begin() + SubVecIdx, Mask.begin() + SubVecIdx + SubVecNumElts, DstNumElts);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-format (80col)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just ran clang-format and amended the previous commit to fix the 80-col issue, thanks again for catching that

unsigned SubVecIdx = SubVecPtr->getZExtValue();

// Ensure insertion of SubVec doesn't exceed Dst bounds.
if (SubVecIdx % SubVecNumElts != 0 || SubVecIdx + SubVecNumElts > DstNumElts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid Wparenthesis warnings:

if ((SubVecIdx % SubVecNumElts) != 0 ||
    (SubVecIdx + SubVecNumElts) > DstNumElts)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@laurenmchin laurenmchin force-pushed the fold-vector-insert-to-shuffle branch from 127b9e3 to 8e146ab Compare July 1, 2025 17:45
@laurenmchin laurenmchin force-pushed the fold-vector-insert-to-shuffle branch from 8e146ab to 2abc2e3 Compare July 1, 2025 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:instcombine Covers the InstCombine, InstSimplify and AggressiveInstCombine passes llvm:transforms llvm:vectorcombine vectorizers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[VectorCombine] Expand Intrinsic::vector_insert calls into shufflevectors
3 participants