Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPlan] Compute induction end values in VPlan. #112145

Merged
merged 36 commits into from
Dec 29, 2024

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Oct 13, 2024

Use createDerivedIV to compute IV end values directly in VPlan, instead
of creating them up-front.

This allows updating IV users outside the loop as follow-up.

Depends on #110004 and
#109975.

Copy link

github-actions bot commented Oct 13, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

fhahn added a commit to fhahn/llvm-project that referenced this pull request Oct 13, 2024
Model updating IV users directly in VPlan, replace fixupIVUsers.

Depends on llvm#110004,
llvm#109975 and
llvm#112145.
@fhahn fhahn force-pushed the vplan-compute-iv-end-values branch 2 times, most recently from 3b414ba to 90c7f53 Compare October 24, 2024 18:56
@fhahn fhahn force-pushed the vplan-compute-iv-end-values branch 3 times, most recently from 944283e to 8001f75 Compare November 10, 2024 10:32
@fhahn fhahn force-pushed the vplan-compute-iv-end-values branch from 8001f75 to 087668b Compare November 24, 2024 19:54
fhahn added a commit that referenced this pull request Nov 24, 2024
Allow setting the name to use for the generated IR value of the derived
IV in preparations for #112145.

This is analogous to VPInstruction::Name.
@fhahn fhahn force-pushed the vplan-compute-iv-end-values branch from 087668b to e9ad7c2 Compare December 7, 2024 12:21
Use createDerivedIV to compute IV end values directly in VPlan, instead
of creating them up-front.

This allows updating IV users outside the loop as follow-up.

Depends on llvm#110004 and
llvm#109975.
@llvmbot
Copy link
Member

llvmbot commented Dec 7, 2024

@llvm/pr-subscribers-backend-risc-v
@llvm/pr-subscribers-vectorizers
@llvm/pr-subscribers-backend-powerpc
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-systemz

Author: Florian Hahn (fhahn)

Changes

Use createDerivedIV to compute IV end values directly in VPlan, instead
of creating them up-front.

This allows updating IV users outside the loop as follow-up.

Depends on #110004 and
#109975.


Patch is 203.94 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/112145.diff

54 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+2-2)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+130-104)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/conditional-branches-cost.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/fixed-order-recurrence.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs-sve.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/induction-costs.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/low_trip_count_predicates.ll (+6-6)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/mul-simplification.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/reduction-recurrence-costs-sve.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/scalable-avoid-scalarization.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll (+6-6)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-interleaved-accesses.ll (+8-8)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-live-out-pointer-induction.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-widen-phi.ll (+6-6)
  • (modified) llvm/test/Transforms/LoopVectorize/PowerPC/exit-branch-cost.ll (+31-29)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/blend-any-of-reduction-cost.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/blocks-with-dead-instructions.ll (+7-7)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/dead-ops-cost.ll (+16-16)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/first-order-recurrence-scalable-vf1.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/induction-costs.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/masked_gather_scatter.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/strided-accesses.ll (+7-9)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-reverse-load-store.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-uniform-store.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/SystemZ/predicated-first-order-recurrence.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/cost-model.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/fixed-order-recurrence.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/induction-costs.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/optsize.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/pr109581-unused-blend.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/pr72969.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/scatter_crash.ll (+22-22)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/small-size.ll (+29-29)
  • (modified) llvm/test/Transforms/LoopVectorize/branch-weights.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/epilog-vectorization-any-of-reductions.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-chains.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence-complex.ll (+17-17)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll (+36-36)
  • (modified) llvm/test/Transforms/LoopVectorize/induction.ll (+26-26)
  • (modified) llvm/test/Transforms/LoopVectorize/interleaved-accesses.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/invariant-store-vectorization.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/pr59319-loop-access-info-invalidation.ll (+17-17)
  • (modified) llvm/test/Transforms/LoopVectorize/pr66616.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/reduction-align.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/reduction-inloop-cond.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/scalable-inductions.ll (+4-4)
  • (modified) llvm/test/Transforms/LoopVectorize/select-reduction.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/store-reduction-results-in-tail-folded-loop.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing.ll (+32-15)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index fbcf181a45a664..0298ab523307db 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -233,8 +233,8 @@ class VPBuilder {
 
   VPDerivedIVRecipe *createDerivedIV(InductionDescriptor::InductionKind Kind,
                                      FPMathOperator *FPBinOp, VPValue *Start,
-                                     VPCanonicalIVPHIRecipe *CanonicalIV,
-                                     VPValue *Step, const Twine &Name = "") {
+                                     VPValue *CanonicalIV, VPValue *Step,
+                                     const Twine &Name = "") {
     return tryInsertInstruction(
         new VPDerivedIVRecipe(Kind, FPBinOp, Start, CanonicalIV, Step, Name));
   }
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 4a4553e4a8db8d..1ee596502f9d44 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -513,21 +513,14 @@ class InnerLoopVectorizer {
   /// Fix the non-induction PHIs in \p Plan.
   void fixNonInductionPHIs(VPTransformState &State);
 
-  /// Create a ResumePHI VPInstruction for the induction \p InductionPhiIRI to
-  /// resume iteration count in the scalar epilogue from where the vectorized
-  /// loop left off, and add it to the scalar preheader of VPlan. Also creates
-  /// the induction resume value, and the value for the bypass block, if needed.
-  /// \p Step is the SCEV-expanded induction step to use. In cases where the
-  /// loop skeleton is more complicated (i.e., epilogue vectorization) and the
-  /// resume values can come from an additional bypass block,
-  /// \p MainVectorTripCount provides the trip count of the main vector loop,
-  /// used to compute the resume value reaching the scalar loop preheader
-  /// directly from this additional bypass block.
-  void createInductionResumeVPValue(VPIRInstruction *InductionPhiIRI,
-                                    const InductionDescriptor &ID, Value *Step,
-                                    ArrayRef<BasicBlock *> BypassBlocks,
-                                    VPBuilder &ScalarPHBuilder,
-                                    Value *MainVectorTripCount = nullptr);
+  /// Create the bypass resume value coming from the additional bypass block. \p
+  /// Step is the SCEV-expanded induction step to use. \p MainVectorTripCount
+  /// provides the trip count of the main vector loop, used to compute the
+  /// resume value reaching the scalar loop preheader directly from this
+  /// additional bypass block.
+  void createInductionBypassValue(PHINode *OrigPhi,
+                                  const InductionDescriptor &ID, Value *Step,
+                                  Value *MainVectorTripCount);
 
   /// Returns the original loop trip count.
   Value *getTripCount() const { return TripCount; }
@@ -584,17 +577,10 @@ class InnerLoopVectorizer {
   /// vector loop preheader, middle block and scalar preheader.
   void createVectorLoopSkeleton(StringRef Prefix);
 
-  /// Create new phi nodes for the induction variables to resume iteration count
-  /// in the scalar epilogue, from where the vectorized loop left off.
-  /// In cases where the loop skeleton is more complicated (i.e. epilogue
-  /// vectorization), \p MainVectorTripCount provides the trip count of the main
-  /// loop, used to compute these resume values. If \p IVSubset is provided, it
-  /// contains the phi nodes for which resume values are needed, because they
-  /// will generate wide induction phis in the epilogue loop.
-  void
-  createInductionResumeVPValues(const SCEV2ValueTy &ExpandedSCEVs,
-                                Value *MainVectorTripCount = nullptr,
-                                SmallPtrSetImpl<PHINode *> *IVSubset = nullptr);
+  /// Create values for the induction variables to resume iteration count
+  /// in the bypass block.
+  void createInductionBypassValues(const SCEV2ValueTy &ExpandedSCEVs,
+                                   Value *MainVectorTripCount);
 
   /// Allow subclasses to override and print debug traces before/after vplan
   /// execution, when trace information is requested.
@@ -2613,21 +2599,11 @@ void InnerLoopVectorizer::createVectorLoopSkeleton(StringRef Prefix) {
                  nullptr, Twine(Prefix) + "scalar.ph");
 }
 
-void InnerLoopVectorizer::createInductionResumeVPValue(
-    VPIRInstruction *InductionPhiRI, const InductionDescriptor &II, Value *Step,
-    ArrayRef<BasicBlock *> BypassBlocks, VPBuilder &ScalarPHBuilder,
+void InnerLoopVectorizer::createInductionBypassValue(
+    PHINode *OrigPhi, const InductionDescriptor &II, Value *Step,
     Value *MainVectorTripCount) {
-  // TODO: Move to LVP or general VPlan construction, once no IR values are
-  // generated.
-  auto *OrigPhi = cast<PHINode>(&InductionPhiRI->getInstruction());
-  Value *VectorTripCount = getOrCreateVectorTripCount(LoopVectorPreHeader);
-  assert(VectorTripCount && "Expected valid arguments");
-
   Instruction *OldInduction = Legal->getPrimaryInduction();
-  // For the primary induction the end values are known.
-  Value *EndValue = VectorTripCount;
   Value *EndValueFromAdditionalBypass = MainVectorTripCount;
-  // Otherwise compute them accordingly.
   if (OrigPhi != OldInduction) {
     IRBuilder<> B(LoopVectorPreHeader->getTerminator());
 
@@ -2635,10 +2611,6 @@ void InnerLoopVectorizer::createInductionResumeVPValue(
     if (isa_and_nonnull<FPMathOperator>(II.getInductionBinOp()))
       B.setFastMathFlags(II.getInductionBinOp()->getFastMathFlags());
 
-    EndValue = emitTransformedIndex(B, VectorTripCount, II.getStartValue(),
-                                    Step, II.getKind(), II.getInductionBinOp());
-    EndValue->setName("ind.end");
-
     // Compute the end value for the additional bypass (if applicable).
     if (MainVectorTripCount) {
       B.SetInsertPoint(getAdditionalBypassBlock(),
@@ -2650,22 +2622,12 @@ void InnerLoopVectorizer::createInductionResumeVPValue(
     }
   }
 
-  auto *ResumePhiRecipe = ScalarPHBuilder.createNaryOp(
-      VPInstruction::ResumePhi,
-      {Plan.getOrAddLiveIn(EndValue), Plan.getOrAddLiveIn(II.getStartValue())},
-      OrigPhi->getDebugLoc(), "bc.resume.val");
-  assert(InductionPhiRI->getNumOperands() == 0 &&
-         "InductionPhiRI should not have any operands");
-  InductionPhiRI->addOperand(ResumePhiRecipe);
-
-  if (EndValueFromAdditionalBypass) {
-    // Store the bypass value here, as it needs to be added as operand to its
-    // scalar preheader phi node after the epilogue skeleton has been created.
-    // TODO: Directly add as extra operand to the VPResumePHI recipe.
-    assert(!Induction2AdditionalBypassValue.contains(OrigPhi) &&
-           "entry for OrigPhi already exits");
-    Induction2AdditionalBypassValue[OrigPhi] = EndValueFromAdditionalBypass;
-  }
+  // Store the bypass value here, as it needs to be added as operand to its
+  // scalar preheader phi node after the epilogue skeleton has been created.
+  // TODO: Directly add as extra operand to the VPResumePHI recipe.
+  assert(!Induction2AdditionalBypassValue.contains(OrigPhi) &&
+         "entry for OrigPhi already exits");
+  Induction2AdditionalBypassValue[OrigPhi] = EndValueFromAdditionalBypass;
 }
 
 /// Return the expanded step for \p ID using \p ExpandedSCEVs to look up SCEV
@@ -2682,29 +2644,15 @@ static Value *getExpandedStep(const InductionDescriptor &ID,
   return I->second;
 }
 
-void InnerLoopVectorizer::createInductionResumeVPValues(
-    const SCEV2ValueTy &ExpandedSCEVs, Value *MainVectorTripCount,
-    SmallPtrSetImpl<PHINode *> *IVSubset) {
-  // We are going to resume the execution of the scalar loop.
-  // Go over all of the induction variable PHIs of the scalar loop header and
-  // fix their starting values, which depend on the counter of the last
-  // iteration of the vectorized loop. If we come from one of the
-  // LoopBypassBlocks then we need to start from the original start value.
-  // Otherwise we provide the trip count from the main vector loop.
-  VPBasicBlock *ScalarPHVPBB = Plan.getScalarPreheader();
-  VPBuilder ScalarPHBuilder(ScalarPHVPBB, ScalarPHVPBB->begin());
-  for (VPRecipeBase &R : *Plan.getScalarHeader()) {
-    auto *PhiR = cast<VPIRInstruction>(&R);
-    auto *Phi = dyn_cast<PHINode>(&PhiR->getInstruction());
-    if (!Phi)
-      break;
-    if (!Legal->getInductionVars().contains(Phi) ||
-        (IVSubset && !IVSubset->contains(Phi)))
-      continue;
-    const InductionDescriptor &II = Legal->getInductionVars().find(Phi)->second;
-    createInductionResumeVPValue(PhiR, II, getExpandedStep(II, ExpandedSCEVs),
-                                 LoopBypassBlocks, ScalarPHBuilder,
-                                 MainVectorTripCount);
+void InnerLoopVectorizer::createInductionBypassValues(
+    const SCEV2ValueTy &ExpandedSCEVs, Value *MainVectorTripCount) {
+  assert(MainVectorTripCount && "Must have bypass information");
+
+  for (const auto &InductionEntry : Legal->getInductionVars()) {
+    PHINode *OrigPhi = InductionEntry.first;
+    const InductionDescriptor &II = InductionEntry.second;
+    createInductionBypassValue(OrigPhi, II, getExpandedStep(II, ExpandedSCEVs),
+                               MainVectorTripCount);
   }
 }
 
@@ -2766,8 +2714,8 @@ InnerLoopVectorizer::createVectorizedLoopSkeleton(
   // faster.
   emitMemRuntimeChecks(LoopScalarPreHeader);
 
-  // Emit phis for the new starting index of the scalar loop.
-  createInductionResumeVPValues(ExpandedSCEVs);
+  Value *VectorTripCount = getOrCreateVectorTripCount(LoopVectorPreHeader);
+  assert(VectorTripCount && "Expected valid arguments");
 
   return {LoopVectorPreHeader, nullptr};
 }
@@ -7848,19 +7796,6 @@ EpilogueVectorizerMainLoop::createEpilogueVectorizedLoopSkeleton(
   // Generate the induction variable.
   EPI.VectorTripCount = getOrCreateVectorTripCount(LoopVectorPreHeader);
 
-  // Generate VPValues and ResumePhi recipes for wide inductions in the epilogue
-  // plan only. Other inductions only need a resume value for the canonical
-  // induction, which will get created during epilogue skeleton construction.
-  SmallPtrSet<PHINode *, 4> WideIVs;
-  for (VPRecipeBase &H :
-       EPI.EpiloguePlan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
-    if (auto *WideIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(&H))
-      WideIVs.insert(WideIV->getPHINode());
-    else if (auto *PtrIV = dyn_cast<VPWidenPointerInductionRecipe>(&H))
-      WideIVs.insert(cast<PHINode>(PtrIV->getUnderlyingValue()));
-  }
-  createInductionResumeVPValues(ExpandedSCEVs, nullptr, &WideIVs);
-
   return {LoopVectorPreHeader, nullptr};
 }
 
@@ -8049,13 +7984,14 @@ EpilogueVectorizerEpilogueLoop::createEpilogueVectorizedLoopSkeleton(
     EPResumeVal->addIncoming(Init, EPI.MainLoopIterationCountCheck);
   }
 
-  // Generate induction resume values. These variables save the new starting
-  // indexes for the scalar loop. They are used to test if there are any tail
-  // iterations left once the vector loop has completed.
-  // Note that when the vectorized epilogue is skipped due to iteration count
-  // check, then the resume value for the induction variable comes from
-  // the trip count of the main vector loop, passed as the second argument.
-  createInductionResumeVPValues(ExpandedSCEVs, EPI.VectorTripCount);
+  Value *VectorTripCount = getOrCreateVectorTripCount(LoopVectorPreHeader);
+  assert(VectorTripCount && "Expected valid arguments");
+
+  // Generate bypass values from the bypass blocks. Note that when the
+  // vectorized epilogue is skipped due to iteration count check, then the
+  // resume value for the induction variable comes from the trip count of the
+  // main vector loop, passed as the second argument.
+  createInductionBypassValues(ExpandedSCEVs, EPI.VectorTripCount);
 
   return {LoopVectorPreHeader, EPResumeVal};
 }
@@ -8858,13 +8794,64 @@ static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, bool HasNUW,
                        {CanonicalIVIncrement, &Plan.getVectorTripCount()}, DL);
 }
 
+static VPValue *addResumeValuesForInduction(VPHeaderPHIRecipe *PhiR,
+                                            VPBuilder &Builder,
+                                            VPBuilder &ScalarPHBuilder,
+                                            VPTypeAnalysis &TypeInfo,
+                                            VPValue *VectorTC) {
+  PHINode *OrigPhi;
+  const InductionDescriptor *ID;
+  VPValue *Start;
+  VPValue *Step;
+  Type *ScalarTy;
+  bool IsCanonical = false;
+  if (auto *WideIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(PhiR)) {
+    if (WideIV->getTruncInst())
+      return nullptr;
+    OrigPhi = cast<PHINode>(WideIV->getUnderlyingValue());
+    ID = &WideIV->getInductionDescriptor();
+    Start = WideIV->getStartValue();
+    Step = WideIV->getStepValue();
+    ScalarTy = WideIV->getScalarType();
+    IsCanonical = WideIV->isCanonical();
+  } else if (auto *WideIV = dyn_cast<VPWidenPointerInductionRecipe>(PhiR)) {
+    OrigPhi = cast<PHINode>(WideIV->getUnderlyingValue());
+    ID = &WideIV->getInductionDescriptor();
+    Start = WideIV->getStartValue();
+    Step = WideIV->getOperand(1);
+    ScalarTy = Start->getLiveInIRValue()->getType();
+  } else {
+    return nullptr;
+  }
+
+  VPValue *EndValue = VectorTC;
+  if (!IsCanonical) {
+    EndValue = Builder.createDerivedIV(
+        ID->getKind(),
+        dyn_cast_or_null<FPMathOperator>(ID->getInductionBinOp()), Start,
+        VectorTC, Step);
+  }
+
+  if (ScalarTy != TypeInfo.inferScalarType(EndValue)) {
+    EndValue = Builder.createScalarCast(Instruction::Trunc, EndValue, ScalarTy);
+  }
+
+  auto *ResumePhiRecipe =
+      ScalarPHBuilder.createNaryOp(VPInstruction::ResumePhi, {EndValue, Start},
+                                   OrigPhi->getDebugLoc(), "bc.resume.val");
+  return ResumePhiRecipe;
+}
+
 /// Create resume phis in the scalar preheader for first-order recurrences and
 /// reductions and update the VPIRInstructions wrapping the original phis in the
 /// scalar header.
 static void addScalarResumePhis(VPRecipeBuilder &Builder, VPlan &Plan) {
+  VPTypeAnalysis TypeInfo(Plan.getCanonicalIV()->getScalarType());
   auto *ScalarPH = Plan.getScalarPreheader();
   auto *MiddleVPBB = cast<VPBasicBlock>(ScalarPH->getSinglePredecessor());
   VPBuilder ScalarPHBuilder(ScalarPH);
+  VPBuilder VectorPHBuilder(
+      cast<VPBasicBlock>(Plan.getVectorLoopRegion()->getSinglePredecessor()));
   VPBuilder MiddleBuilder(MiddleVPBB, MiddleVPBB->getFirstNonPhi());
   VPValue *OneVPV = Plan.getOrAddLiveIn(
       ConstantInt::get(Plan.getCanonicalIV()->getScalarType(), 1));
@@ -8874,6 +8861,13 @@ static void addScalarResumePhis(VPRecipeBuilder &Builder, VPlan &Plan) {
     if (!ScalarPhiI)
       break;
     auto *VectorPhiR = cast<VPHeaderPHIRecipe>(Builder.getRecipe(ScalarPhiI));
+
+    if (VPValue *ResumePhi = addResumeValuesForInduction(
+            VectorPhiR, VectorPHBuilder, ScalarPHBuilder, TypeInfo,
+            &Plan.getVectorTripCount())) {
+      ScalarPhiIRI->addOperand(ResumePhi);
+      continue;
+    }
     if (!isa<VPFirstOrderRecurrencePHIRecipe, VPReductionPHIRecipe>(VectorPhiR))
       continue;
     // The backedge value provides the value to resume coming out of a loop,
@@ -9635,7 +9629,6 @@ void VPDerivedIVRecipe::execute(VPTransformState &State) {
       State.Builder, CanonicalIV, getStartValue()->getLiveInIRValue(), Step,
       Kind, cast_if_present<BinaryOperator>(FPBinOp));
   DerivedIV->setName(Name);
-  assert(DerivedIV != CanonicalIV && "IV didn't need transforming?");
 
   State.set(this, DerivedIV, VPLane(0));
 }
@@ -10299,6 +10292,39 @@ bool LoopVectorizePass::processLoop(Loop *L) {
                                            EPI, &LVL, &CM, BFI, PSI, Checks,
                                            *BestMainPlan);
 
+        // Collect PHI nodes of wide inductions in the VPlan for the epilogue.
+        // Those will need their resume-values computed from the main vector
+        // loop. Others can be removed in the main VPlan.
+        SmallPtrSet<PHINode *, 2> WidenedPhis;
+        for (VPRecipeBase &R :
+             BestEpiPlan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
+          if (!isa<VPWidenIntOrFpInductionRecipe,
+                   VPWidenPointerInductionRecipe>(&R))
+            continue;
+          if (isa<VPWidenIntOrFpInductionRecipe>(&R))
+            WidenedPhis.insert(
+                cast<VPWidenIntOrFpInductionRecipe>(&R)->getPHINode());
+          else
+            WidenedPhis.insert(
+                cast<PHINode>(R.getVPSingleValue()->getUnderlyingValue()));
+        }
+        for (VPRecipeBase &R :
+             *cast<VPIRBasicBlock>(BestMainPlan->getScalarHeader())) {
+          auto *VPIRInst = cast<VPIRInstruction>(&R);
+          auto *IRI = dyn_cast<PHINode>(&VPIRInst->getInstruction());
+          if (!IRI)
+            break;
+          if (WidenedPhis.contains(IRI) ||
+              !LVL.getInductionVars().contains(IRI))
+            continue;
+          VPRecipeBase *ResumePhi =
+              VPIRInst->getOperand(0)->getDefiningRecipe();
+          VPIRInst->setOperand(0, BestMainPlan->getOrAddLiveIn(
+                                      Constant::getNullValue(IRI->getType())));
+          ResumePhi->eraseFromParent();
+        }
+        VPlanTransforms::removeDeadRecipes(*BestMainPlan);
+
         auto ExpandedSCEVs = LVP.executePlan(EPI.MainLoopVF, EPI.MainLoopUF,
                                              *BestMainPlan, MainILV, DT, false);
         ++LoopsVectorized;
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index b438700a3fe2ce..9e3567d04d2332 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -64,6 +64,7 @@ bool VPRecipeBase::mayWriteToMemory() const {
     case VPInstruction::FirstOrderRecurrenceSplice:
     case VPInstruction::LogicalAnd:
     case VPInstruction::PtrAdd:
+    case VPInstruction::ResumePhi:
       return false;
     default:
       return true;
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll b/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
index 1948211858d446..f1191fab8350c0 100644
--- a/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
@@ -13,9 +13,9 @@ define void @clamped_tc_8(ptr nocapture %dst, i32 %n, i64 %val) vscale_range(1,1
 ; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 8, [[TMP4]]
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
-; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 [[N_VEC]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
+; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 [[N_VEC]]
 ; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 8)
 ; CHECK-NEXT:    [[TMP8:%.*]] = call <vscale x 8 x i64> @llvm.stepvector.nxv8i64()
 ; CHECK-NEXT:    [[TMP9:%.*]] = add <vscale x 8 x i64> [[TMP8]], zeroinitializer
@@ -100,9 +100,9 @@ define void @clamped_tc_max_8(ptr nocapture %dst, i32 %n, i64 %val) vscale_range
 ; CHECK-NEXT:    [[N_RND_UP:%.*]] = add i64 [[WIDE_TRIP_COUNT]], [[TMP4]]
 ; CHECK-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
 ; CHECK-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
-; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 [[N_VEC]]
 ; CHECK-NEXT:    [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-NEXT:    [[TMP6:%.*]] = mul i64 [[TMP5]], 8
+; CHECK-NEXT:    [[IND_END:%.*]] = getelementptr i8, ptr [[DST]], i64 [[N_VEC]]
 ; CHECK-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[WIDE_TRIP_COUNT]])
 ; CHECK-NEXT:    [[TMP8:%.*]] = call <vscale x 8 x i64> @llvm.stepvector.nxv8i64()
 ; CHECK-NEXT:    [[TMP9:%.*]] = add <vscale x 8 x i64> [[TMP8]], zeroinitializer
diff...
[truncated]

@fhahn
Copy link
Contributor Author

fhahn commented Dec 7, 2024

Now that #110577 landed, this should be ready for review.

Some VPlan tests still need updating, but the relevant IR tests have been updated already. It still needs some updates for outer loop vectorization.

Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice step leveraging ResumePhi and DerivedIV recipes!
Patch should be marked NFC?

@@ -64,6 +64,7 @@ bool VPRecipeBase::mayWriteToMemory() const {
case VPInstruction::FirstOrderRecurrenceSplice:
case VPInstruction::LogicalAnd:
case VPInstruction::PtrAdd:
case VPInstruction::ResumePhi:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Independent fix, but testable only now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had to undo this change; we have to create a ResumePhi for the canonical IV in the main plan, which won't have any users but cannot be cleaned up

Instruction *OldInduction = Legal->getPrimaryInduction();
// For the primary induction the end values are known.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment still applies, to additional end value only:

// For the primary induction the additional bypass end value is known. Otherwise it is computed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added back, thanks

Comment on lines 580 to 581
/// Create values for the induction variables to resume iteration count
/// in the bypass block.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Create values for the induction variables to resume iteration count
/// in the bypass block.
/// Create and record the values for induction variables to resume coming from the additional bypass block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

/// provides the trip count of the main vector loop, used to compute the
/// resume value reaching the scalar loop preheader directly from this
/// additional bypass block.
void createInductionBypassValue(PHINode *OrigPhi,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void createInductionBypassValue(PHINode *OrigPhi,
void createInductionAdditionalBypassValue(PHINode *OrigPhi,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

ArrayRef<BasicBlock *> BypassBlocks,
VPBuilder &ScalarPHBuilder,
Value *MainVectorTripCount = nullptr);
/// Create the bypass resume value coming from the additional bypass block. \p
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Create the bypass resume value coming from the additional bypass block. \p
/// Create and record the bypass resume value for an induction Phi coming from the additional bypass block. \p

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

@@ -9635,7 +9629,6 @@ void VPDerivedIVRecipe::execute(VPTransformState &State) {
State.Builder, CanonicalIV, getStartValue()->getLiveInIRValue(), Step,
Kind, cast_if_present<BinaryOperator>(FPBinOp));
DerivedIV->setName(Name);
assert(DerivedIV != CanonicalIV && "IV didn't need transforming?");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such redundant recipes should be cleaned up, if generated at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up in the latest version, except for the case where the vector trip count is constant null, which will only be set in prepareToExecute.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and when Index=VTC=0 it may result in DerivedIV==Index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

Comment on lines 10295 to 10327
// Collect PHI nodes of wide inductions in the VPlan for the epilogue.
// Those will need their resume-values computed from the main vector
// loop. Others can be removed in the main VPlan.
SmallPtrSet<PHINode *, 2> WidenedPhis;
for (VPRecipeBase &R :
BestEpiPlan.getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
if (!isa<VPWidenIntOrFpInductionRecipe,
VPWidenPointerInductionRecipe>(&R))
continue;
if (isa<VPWidenIntOrFpInductionRecipe>(&R))
WidenedPhis.insert(
cast<VPWidenIntOrFpInductionRecipe>(&R)->getPHINode());
else
WidenedPhis.insert(
cast<PHINode>(R.getVPSingleValue()->getUnderlyingValue()));
}
for (VPRecipeBase &R :
*cast<VPIRBasicBlock>(BestMainPlan->getScalarHeader())) {
auto *VPIRInst = cast<VPIRInstruction>(&R);
auto *IRI = dyn_cast<PHINode>(&VPIRInst->getInstruction());
if (!IRI)
break;
if (WidenedPhis.contains(IRI) ||
!LVL.getInductionVars().contains(IRI))
continue;
VPRecipeBase *ResumePhi =
VPIRInst->getOperand(0)->getDefiningRecipe();
VPIRInst->setOperand(0, BestMainPlan->getOrAddLiveIn(
Constant::getNullValue(IRI->getType())));
ResumePhi->eraseFromParent();
}
VPlanTransforms::removeDeadRecipes(*BestMainPlan);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be outlined and/or placed elsewhere, too detailed for processLoop() which is already too long and doing too much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outlined, thanks!

@@ -10299,6 +10292,39 @@ bool LoopVectorizePass::processLoop(Loop *L) {
EPI, &LVL, &CM, BFI, PSI, Checks,
*BestMainPlan);

// Collect PHI nodes of wide inductions in the VPlan for the epilogue.
// Those will need their resume-values computed from the main vector
// loop. Others can be removed in the main VPlan.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we only need resume values from the main plan, if there is a corresponding wide induction in the epilogue plan.

Comment on lines 10320 to 10324
VPRecipeBase *ResumePhi =
VPIRInst->getOperand(0)->getDefiningRecipe();
VPIRInst->setOperand(0, BestMainPlan->getOrAddLiveIn(
Constant::getNullValue(IRI->getType())));
ResumePhi->eraseFromParent();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up redundant resume Phi recipes created for inductions that were not widened? VPlan's DCE cannot collect them because of cross-VPlan def-use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, unfortunately the latest version cannot use VPlan DCE, because we also need to add a ResumePHI for the canonical IV in the main vector loop which. won't have any uses in the main plan. Hence manual removal here.

Comment on lines 10322 to 10323
VPIRInst->setOperand(0, BestMainPlan->getOrAddLiveIn(
Constant::getNullValue(IRI->getType())));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deserves some explanation/comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added thanks!

@@ -74,6 +74,7 @@ define void @low_vf_ic_is_better(ptr nocapture noundef %p, i32 %tc, i16 noundef
; CHECK-VS1-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP3]], [[N_MOD_VF]]
; CHECK-VS1-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-VS1-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 16
; CHECK-VS1-NEXT: [[TMP40:%.*]] = add i64 [[TMP0]], [[N_VEC]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused instruction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes unfortunately now we create the end VPValues during VPlan construction and now during the legacy epilogue skeleton code generation there's no convenient way to remove them for the epilogue only case I think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth recording a FIXME somewhere to handle when epilog and main are covered by one VPlan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added TODO, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be gone now, thanks

const InductionDescriptor &ID = WideIV->getInductionDescriptor();
Type *ScalarTy = TypeInfo.inferScalarType(WideIV);
bool IsCanonical = false;
if (auto *WideIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(PhiR)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (auto *WideIV = dyn_cast<VPWidenIntOrFpInductionRecipe>(PhiR)) {
if (auto *WideIntOrFP = dyn_cast<VPWidenIntOrFpInductionRecipe>(PhiR)) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

Comment on lines 8857 to 8858
// Truncated wide inductions resume from the last lane of their vector value
// in the last vector iteration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... which is handled elsewhere (so early return null)? Rather than precompute (which is done here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

}

VPValue *EndValue = VectorTC;
if (!IsCanonical) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better positioned next to setting IsCanonical above, and folded into if (WideIV->isCanonical())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it can be moved, we need to compute the end value for all cases except wide canonical IVs

/// reductions and update the VPIRInstructions wrapping the original phis in the
/// scalar header.
static void addScalarResumePhis(VPRecipeBuilder &Builder, VPlan &Plan) {
/// Create a ResumePhi for \p PhiR, if it is wide induction recipe. If the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Create a ResumePhi for \p PhiR, if it is wide induction recipe. If the
/// Create and return a ResumePhi for \p PhiR, if it is wide induction recipe. If the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

Start, VectorTC, Step);
}

// EndValue is based on the vector trip count (which has the same type as the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// EndValue is based on the vector trip count (which has the same type as the
// EndValue is derived from the vector trip count (which has the same type as the

consistent with being a DerivedIV.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

continue;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better position addResumeValuesForInduction() above under the condition for inductions below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

if (!isa<VPFirstOrderRecurrencePHIRecipe, VPReductionPHIRecipe>(
VectorPhiR)) {
assert(cast<VPWidenIntOrFpInductionRecipe>(VectorPhiR)->getTruncInst() &&
"should only skip truncated wide inductions");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better also handle truncated wide inductions here? Possibly follow-up.

OTOH, better to first handle (legalize) all loop live-outs the same - using the last value computed in the loop, followed by a VPlan2VPlan pass (optimize) that converts non-truncating inductions to use precomputed values instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO, thanks

@@ -9635,7 +9629,6 @@ void VPDerivedIVRecipe::execute(VPTransformState &State) {
State.Builder, CanonicalIV, getStartValue()->getLiveInIRValue(), Step,
Kind, cast_if_present<BinaryOperator>(FPBinOp));
DerivedIV->setName(Name);
assert(DerivedIV != CanonicalIV && "IV didn't need transforming?");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and when Index=VTC=0 it may result in DerivedIV==Index?

DerivedIV->setName(Name);
assert(DerivedIV != CanonicalIV && "IV didn't need transforming?");

// Index may only be set to constant 0 in prepareToExecute.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Index may only be set to constant 0 in prepareToExecute.
// Index may only be set to constant 0 temporarily in prepareToExecute.

is this indeed temporarily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we only create the expression for the vector trip count late and set it there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still seem to be missing something here. The fact that prepareToExecute() creates the vector trip count (typically to non zero?) implies that derived IVs that are based on it will miss the opportunity to be folded by simplifyRecipes(), because it's too late.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is the issue; VectorTripCountV is only created during skeleton creation and set as underlying value here, so we miss this opportunity to fold. I adjusted the assert to check if the index is the vector trip count instead and expanded the comment.

Added a TODO to remove this once we construct VTC in VPlan, and can use it for simplifications.

@@ -9439,6 +9418,20 @@ VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {
bool HasNUW = true;
addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), HasNUW,
DebugLoc());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to reduce the disparity between native and legacy VPlan constructions (ultimately converging). Perhaps have the former also record Ingredient2Recipe for widened inductions it creates during VPInstructionsToVPRecipes(), to then use getRecipe(P) as above, rather than searching linearly for each as below?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this

  VPRecipeBuilder RecipeBuilder(...); // As in tryToBuildVPlanWithVPRecipes() following addCanonicalIVRecipes();
  for (auto &R : Plan->getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
    auto *HeaderR = cast<VPHeaderPHIRecipe>(&R);
    RecipeBuilder.setRecipe(HeaderR->getUnderlyingInstr(), HeaderR);
  }
  addScalarResumePhis(*Plan, RecipeBuilder);

work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, adjusted, thanks!

B.setFastMathFlags(II.getInductionBinOp()->getFastMathFlags());

// Compute the end value for the additional bypass.
B.SetInsertPoint(getAdditionalBypassBlock(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to set two builders at the outset (VectorPHBuilder, AdditionalBypassBuilder)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only a single builder (in the bypass block), moved, thanks!

auto *WideIV = dyn_cast<VPWidenInductionRecipe>(PhiR);
if (!WideIV)
return nullptr;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto *WideIntOrFp = dyn_cast<VPWidenIntOrFpInductionRecipe>(WideIV));
// Truncated wide inductions resume from the last lane of their vector value
// in the last vector iteration which is handled elsewhere.
if (WidenIntOrFp && WideIntOrFp->getTruncInst())
return nullptr;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted, thanks!


VPValue *Start = WideIV->getStartValue();
VPValue *Step = WideIV->getStepValue();
const InductionDescriptor &ID = WideIV->getInductionDescriptor();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const InductionDescriptor &ID = WideIV->getInductionDescriptor();
const InductionDescriptor &ID = WideIV->getInductionDescriptor();
VPValue *EndValue = VectorTC;
if (!WidenIntOrFp || !WidenIntOrFp->IsCanonical()) {
EndValue = VectorPHBuilder.createDerivedIV(
ID.getKind(), dyn_cast_or_null<FPMathOperator>(ID.getInductionBinOp()),
Start, VectorTC, Step);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted, thanks!

@@ -9481,7 +9445,9 @@ LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(VFRange &Range) {
VPlanTransforms::handleUncountableEarlyExit(
*Plan, *PSE.getSE(), OrigLoop, UncountableExitingBlock, RecipeBuilder);
}
addScalarResumePhis(RecipeBuilder, *Plan);
addScalarResumePhis(*Plan, [&RecipeBuilder](PHINode *P) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler to continue pass RecipeBuilder into addScalarResumePhis()? The latter is now also being called by native buildVPlan(), to support induction resumes (right?), so it would need to construct a RecipeBuilder and have it record the mapping from header phis to their recipes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@@ -9439,6 +9418,20 @@ VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {
bool HasNUW = true;
addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), HasNUW,
DebugLoc());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like this

  VPRecipeBuilder RecipeBuilder(...); // As in tryToBuildVPlanWithVPRecipes() following addCanonicalIVRecipes();
  for (auto &R : Plan->getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
    auto *HeaderR = cast<VPHeaderPHIRecipe>(&R);
    RecipeBuilder.setRecipe(HeaderR->getUnderlyingInstr(), HeaderR);
  }
  addScalarResumePhis(*Plan, RecipeBuilder);

work?

Comment on lines 195 to 196
; CHECK: attributes #[[ATTR0:[0-9]+]] = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
;.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, run the script with an unneeded argument, removed, thanks

@@ -348,7 +347,7 @@ define void @single_stride_int_iv(ptr %p, i64 %stride) {
; NOSTRIDED-NEXT: br i1 [[CMP_N]], label [[EXIT:%.*]], label [[SCALAR_PH]]
; NOSTRIDED: scalar.ph:
; NOSTRIDED-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[VECTOR_SCEVCHECK]] ], [ 0, [[ENTRY:%.*]] ]
; NOSTRIDED-NEXT: [[BC_RESUME_VAL1:%.*]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[VECTOR_SCEVCHECK]] ], [ 0, [[ENTRY]] ]
; NOSTRIDED-NEXT: [[BC_RESUME_VAL1:%.*]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[VECTOR_SCEVCHECK]] ], [ 0, [[ENTRY]] ]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change to use N_VEC instead of N_VEC * STRIDE is based on scevcheck for unit stride, saving a mul instruction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@@ -231,12 +231,15 @@ class VPBuilder {
new VPInstruction(Ptr, Offset, GEPNoWrapFlags::inBounds(), DL, Name));
}

/// Convert the input value \p Current to the corresponding value of an
/// induction with different start and step values, using Start + Current *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// induction with different start and step values, using Start + Current *
/// induction with \p Start and \p Step values, using Start + Current *

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks!

@@ -9209,9 +9173,9 @@ addUsersInExitBlocks(VPlan &Plan,
static void addExitUsersForFirstOrderRecurrences(
VPlan &Plan, SetVector<VPIRInstruction *> &ExitUsersToFix) {
VPRegionBlock *VectorRegion = Plan.getVectorLoopRegion();
auto *ScalarPHVPBB = Plan.getScalarPreheader();
auto *MainScalarPH = Plan.getScalarPreheader();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto *MainScalarPH = Plan.getScalarPreheader();
auto *ScalarPHVPBB = Plan.getScalarPreheader();

ahh, here there's only one loop/plan, rather than having both main and epilog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undone, thanks

DerivedIV->setName(Name);
assert(DerivedIV != CanonicalIV && "IV didn't need transforming?");

// Index may only be set to constant 0 in prepareToExecute.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still seem to be missing something here. The fact that prepareToExecute() creates the vector trip count (typically to non zero?) implies that derived IVs that are based on it will miss the opportunity to be folded by simplifyRecipes(), because it's too late.

fhahn added a commit that referenced this pull request Dec 22, 2024
Split DerivedIV simplification off from
#112145 and use to remove the
need for extra checks in createScalarIVSteps. Required an extra
simplification run after IV transforms.
Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split off the changes to simplify DerivedIV to e1833e3 and use it to simplify createScalarIVSteps

Comment on lines 2674 to 2676
if (isa_and_nonnull<FPMathOperator>(II.getInductionBinOp()))
BypassBuilder.setFastMathFlags(
II.getInductionBinOp()->getFastMathFlags());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
if (isa_and_nonnull<FPMathOperator>(II.getInductionBinOp()))
BypassBuilder.setFastMathFlags(
II.getInductionBinOp()->getFastMathFlags());
auto *BinOp = II.getInductionBinOp();
if (isa_and_nonnull<FPMathOperator>(BinOp))
BypassBuilder.setFastMathFlags(BinOp->getFastMathFlags());

plus additional use below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks

Comment on lines 8919 to 8921
if (isa<VPWidenInductionRecipe>(VectorPhiR)) {
if (VPValue *ResumePhi = addResumePhiRecipeForInduction(
VectorPhiR, VectorPHBuilder, ScalarPHBuilder, TypeInfo,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (isa<VPWidenInductionRecipe>(VectorPhiR)) {
if (VPValue *ResumePhi = addResumePhiRecipeForInduction(
VectorPhiR, VectorPHBuilder, ScalarPHBuilder, TypeInfo,
if (auto *WideIVR = dyn_cast<VPWidenInductionRecipe>(VectorPhiR)) {
if (VPValue *ResumePhi = addResumePhiRecipeForInduction(
WideIVR, VectorPHBuilder, ScalarPHBuilder, TypeInfo,

instead of the dyn_cast in addResumePhiRecipeForInduction().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks


// If index is the vector trip count, the concrete value will only be set in
// prepareToExecute, leading to missed simplifications, e.g. if it is 0.
// TODO: Remove the special case for the vector trip count once it is computed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is clear.

Knowing VTC is zero at compile-time suggests more simplification than eliminating redundant DeriveIV's ... ideally VF's (and UF's) should be restricted so as to prevent this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is slightly counter-intuitive, VTC == 0 can only happen when the original BTC == -1, but should be handled correctly with checking the canonical IV increment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So prepareToExecute() sets the value of VTC too late for any folding to take place. Treating overflow of VTC = max_uint BTC + 1 = 0 by skipping the vector loop, is clear. Question is if we can avoid vectorizing altogether, knowing this is the case at compile-time, considering the effort of computeMaxVF()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will look into that separately, thanks

VPRecipeBase *ResumePhi = VPIRInst->getOperand(0)->getDefiningRecipe();
VPIRInst->setOperand(
0, MainPlan.getOrAddLiveIn(Constant::getNullValue(IRI->getType())));
ResumePhi->eraseFromParent();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth noting here (also) that this erasure should be followed by dce.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a run here; this cleans up the redundant IV computations discussed in one of the tests

@@ -11,6 +11,7 @@ target triple = "aarch64-unknown-linux-gnu"
; VPLANS-LABEL: Checking a loop in 'simple_memset'
; VPLANS: VPlan 'Initial VPlan for VF={vscale x 1,vscale x 2,vscale x 4},UF>=1' {
; VPLANS-NEXT: Live-in vp<[[VFxUF:%.+]]> = VF * UF
; VPLANS-NEXT: Live-in vp<[[VTC:%.+]]> = vector-trip-count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless considered irrelevant for the test, in which case this check should (continue to) be dropped?
Checking only the def w/o any use is somewhat inconsistent, but there are many test changes...

Comment on lines 10083 to 10084
VPIRInst->setOperand(
0, MainPlan.getOrAddLiveIn(Constant::getNullValue(IRI->getType())));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VPIRinst wraps the header phi in the scalar loop, so it is probaby better to keep it here.

Is this operand needed, set artificially to zero (can convey a type... ;-), or should it best be removed, keeping VPIRInst only as a wrapper of the header phi?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment it is needed, otherwise it would complicate VPIRInstruction::execute; but in the future we can disconnect the scalar header for the main vector loop, thus removing the need to update them there. (There is code elsewhere that use the scalar phi to get the resume value from the main vector loop)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but in the future we can disconnect the scalar header for the main vector loop ...

When can this scalar header be disconnected? Presumably it is needed for phi's that do have wide users in EpiPlan, so could be dropped altogether (or not constructed at all) if and only if all phi's are free of such users?

Can VPIRInst be erased as well (perhaps with make_early_inc_range)?

(thought I responded but cannot see the response here): VPIRinst wraps the header phi in the scalar loop, so it is probably better to keep it here.

A VPIRInst wraps an IRInst in order to (a) add operands to phi's and/or (b) support placing non-VPIRInst recipes between IRInst's, by bumping the builder. If neither is needed, VPIRInst can be dropped?
If an artificial operand is needed to appease VPIRInstruction::execute(), could undef/poison be used instead of zero?

(There is code elsewhere that use the scalar phi to get the resume value from the main vector loop)

So dropping it or using undef would require updating said code in VPIRInstruction::execute() and/or elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both can be removed in the latest version, updated, thanks!

Disconnection should be able after #120918

Comment on lines 2676 to 2677
BypassBuilder.setFastMathFlags(
II.getInductionBinOp()->getFastMathFlags());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BypassBuilder.setFastMathFlags(
II.getInductionBinOp()->getFastMathFlags());
BypassBuilder.setFastMathFlags(BinOp->getFastMathFlags());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DOne thanks

/// Create resume phis in the scalar preheader for first-order recurrences and
/// reductions and update the VPIRInstructions wrapping the original phis in the
/// scalar header.
/// Create and return a ResumePhi for \p PhiR, if it is wide induction recipe.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Create and return a ResumePhi for \p PhiR, if it is wide induction recipe.
/// Create and return a ResumePhi for \p WideIV, unless it is truncated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated thanks!

Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added a couple of final thoughts.

/// Create resume phis in the scalar preheader for first-order recurrences and
/// reductions and update the VPIRInstructions wrapping the original phis in the
/// scalar header.
/// Create and return a ResumePhi for \p WideIF, unless it is truncated. If the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Create and return a ResumePhi for \p WideIF, unless it is truncated. If the
/// Create and return a ResumePhi for \p WideIV, unless it is truncated. If the

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, thanks!


// If index is the vector trip count, the concrete value will only be set in
// prepareToExecute, leading to missed simplifications, e.g. if it is 0.
// TODO: Remove the special case for the vector trip count once it is computed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So prepareToExecute() sets the value of VTC too late for any folding to take place. Treating overflow of VTC = max_uint BTC + 1 = 0 by skipping the vector loop, is clear. Question is if we can avoid vectorizing altogether, knowing this is the case at compile-time, considering the effort of computeMaxVF()?

Comment on lines 10083 to 10084
VPIRInst->setOperand(
0, MainPlan.getOrAddLiveIn(Constant::getNullValue(IRI->getType())));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but in the future we can disconnect the scalar header for the main vector loop ...

When can this scalar header be disconnected? Presumably it is needed for phi's that do have wide users in EpiPlan, so could be dropped altogether (or not constructed at all) if and only if all phi's are free of such users?

Can VPIRInst be erased as well (perhaps with make_early_inc_range)?

(thought I responded but cannot see the response here): VPIRinst wraps the header phi in the scalar loop, so it is probably better to keep it here.

A VPIRInst wraps an IRInst in order to (a) add operands to phi's and/or (b) support placing non-VPIRInst recipes between IRInst's, by bumping the builder. If neither is needed, VPIRInst can be dropped?
If an artificial operand is needed to appease VPIRInstruction::execute(), could undef/poison be used instead of zero?

(There is code elsewhere that use the scalar phi to get the resume value from the main vector loop)

So dropping it or using undef would require updating said code in VPIRInstruction::execute() and/or elsewhere?

// There is no corresponding wide induction in the epilogue plan that would
// need a resume value. Set the operand in VPIRInst to zero, so ResumePhi
// can be removed. The resume values for the scalar loop will be created
// during execution of EpiPlan.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resume values for the scalar loop will be created
// during execution of EpiPlan.

All resume values, or only those free of wide users in EpiPlan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be all resume values for the scalar loop, those free of wide users are not need to resume in the epilogue vector loop

if (EpiWidenedPhis.contains(IRI))
continue;
// There is no corresponding wide induction in the epilogue plan that would
// need a resume value. Set the operand in VPIRInst to zero, so ResumePhi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emphasize that this zero is artificial, along with a FIXME to remove it (one way or another)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both can be removed in the latest version, updated, thanks!

@fhahn fhahn merged commit 7f3428d into llvm:main Dec 29, 2024
8 checks passed
@fhahn fhahn deleted the vplan-compute-iv-end-values branch December 29, 2024 19:05
fhahn added a commit to fhahn/llvm-project that referenced this pull request Dec 29, 2024
Model updating IV users directly in VPlan, replace fixupIVUsers.

Depends on llvm#110004,
llvm#109975 and
llvm#112145.
@llvm-ci
Copy link
Collaborator

llvm-ci commented Dec 29, 2024

LLVM Buildbot has detected a new failure on builder clang-aarch64-sve2-vla-2stage running on linaro-g4-01 while building llvm at step 12 "ninja check 2".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/199/builds/506

Here is the relevant piece of the build log for the reference
Step 12 (ninja check 2) failure: stage 2 checked (failure)
******************** TEST 'HWAddressSanitizer-aarch64 :: TestCases/Linux/fixed-shadow.c' FAILED ********************
Exit Code: 2

Command Output (stdout):
--
1 0x10000000000
2 0x20000000000
3 0x30000000000
4 0x40000000000
5 0x50000000000
6 0x60000000000
7 0x70000000000
8 0x80000000000
9 0x90000000000
10 0xa0000000000
11 0xb0000000000
12 0xc0000000000
13 0xd0000000000
14 0xe0000000000
15 0xf0000000000
16 0x100000000000
17 0x110000000000
18 0x120000000000
19 0x130000000000
20 0x140000000000
21 0x150000000000
22 0x160000000000
23 0x170000000000
24 0x180000000000
25 0x190000000000
26 0x1a0000000000
27 0x1b0000000000
28 0x1c0000000000
29 0x1d0000000000
30 0x1e0000000000
31 0x1f0000000000
32 0x200000000000
33 0x210000000000
34 0x220000000000
35 0x230000000000
36 0x240000000000
37 0x250000000000
38 0x260000000000
39 0x270000000000
40 0x280000000000
41 0x290000000000
42 0x2a0000000000
43 0x2b0000000000
44 0x2c0000000000
45 0x2d0000000000
...

sunshaoce pushed a commit to sunshaoce/llvm-public that referenced this pull request Dec 30, 2024
Use createDerivedIV to compute IV end values directly in VPlan, instead
of creating them up-front.

This allows updating IV users outside the loop as follow-up.

Depends on llvm#110004 and
llvm#109975.

PR: llvm#112145
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants