Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LV] Ignore some costs when loop gets fully unrolled #106699

Merged
merged 9 commits into from
Dec 9, 2024
41 changes: 40 additions & 1 deletion llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2682,6 +2682,25 @@ static Value *getExpandedStep(const InductionDescriptor &ID,
return I->second;
}

/// Knowing that loop \p L executes a single vector iteration, add instructions
/// that will get simplified and thus should not have any cost to \p
/// InstsToIgnore.
static void addFullyUnrolledInstructionsToIgnore(
Loop *L, const LoopVectorizationLegality::InductionList &IL,
SmallPtrSetImpl<Instruction *> &InstsToIgnore) {
auto *Cmp = L->getLatchCmpInst();
if (Cmp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I just thought of something. What if there is an additional use of Cmp outside the loop? Should we also be checking that there is only a single use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is the only latch instruction, its value would be simplified to false when the control flow leaves the loop. So, we shouldn't worry about that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be right, but have you tried this out with real IR? There would be more than one use of the cmp so I wasn't sure if InstCombine would apply the fold in this case. In any case, adding an extra one use check wouldn't affect the loops you care about I think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean getLatchCmpInst ensures there is only a single user? Would be good to make sure we have a test (and maybe assert that there's a single user)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't check the number of users but checks that there's only one latch in the loop. The latch is something that makes a loop to be a loop. If there is a single latch and we don't jump to the loop header, then we are exiting the loop, and the value of the condition is constant.

See, it simplifies this condition outside away (pre LV):

https://godbolt.org/z/s9q1558zY

And if the compiler couldn't do it, we should've better taught him to do it rather than adding unnecessary checks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was actually thinking about something else @igogo-x86:

define i1 @foo(ptr nocapture noundef %dst, ptr nocapture noundef readonly %src) {
entry:
  br label %for.body

for.body:
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds nuw i32, ptr %src, i64 %indvars.iv
  %0 = load i32, ptr %arrayidx, align 4
  %arrayidx2 = getelementptr inbounds nuw i32, ptr %dst, i64 %indvars.iv
  %1 = load i32, ptr %arrayidx2, align 4
  %add = add nsw i32 %1, %0
  store i32 %add, ptr %arrayidx2, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 16
  br i1 %exitcond.not, label %exit, label %for.body

exit:
  ret i1 %exitcond.not
}

This is a case where in theory the final comparison is used after the loop. That's what I was hoping you could try out. Anyway, I ran this with the following command:

opt -mcpu=neoverse-v1 -p loop-vectorize,loop-unroll -force-vector-width=4 -force-vector-interleave=1 -S -debug-only=loop-vectorize < foo.ll

and it looks like we do indeed unroll the loop and the comparison disappears. Similary we unroll this and the comparison disappears:

define i1 @foo(ptr nocapture noundef %dst, ptr nocapture noundef %p, ptr nocapture noundef readonly %src) {
entry:
  br label %for.body

for.body:
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds nuw i32, ptr %src, i64 %indvars.iv
  %0 = load i32, ptr %arrayidx, align 4
  %arrayidx2 = getelementptr inbounds nuw i32, ptr %dst, i64 %indvars.iv
  %1 = load i32, ptr %arrayidx2, align 4
  %add = add nsw i32 %1, %0
  store i32 %add, ptr %arrayidx2, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, 16
  %arrayidx3 = getelementptr inbounds nuw i1, ptr %p, i64 %indvars.iv
  store i1 %exitcond.not, ptr %arrayidx3, align 4
  br i1 %exitcond.not, label %exit, label %for.body

exit:
  ret i1 false
}

In this case I'm happy that the comparison disappears, although I don't think an extra use check would have done any harm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a test case ,if possible

InstsToIgnore.insert(Cmp);
for (const auto &[IV, IndDesc] : IL) {
// Get next iteration value of the induction variable.
Instruction *IVInst =
cast<Instruction>(IV->getIncomingValueForBlock(L->getLoopLatch()));
if (all_of(IVInst->users(),
[&](const User *U) { return U == IV || U == Cmp; }))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://lab.llvm.org/buildbot/#/builders/78/builds/11135:

/home/llvm-libc-buildbot/buildbot-worker/libc-x86_64-debian/libc-x86_64-debian-dbg-runtimes-build/llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:2699:49: error: reference to local binding 'IV' declared in enclosing function 'addFullyUnrolledInstructionsToIgnore'
               [&](const User *U) { return U == IV || U == Cmp; }))
                                                ^
/home/llvm-libc-buildbot/buildbot-worker/libc-x86_64-debian/libc-x86_64-debian-dbg-runtimes-build/llvm-project/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:2694:21: note: 'IV' declared here
  for (const auto &[IV, IndDesc] : IL) {
                    ^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@kazutakahirata kazutakahirata Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed the warning with: 9099d69

InstsToIgnore.insert(IVInst);
}
}

void InnerLoopVectorizer::createInductionResumeVPValues(
const SCEV2ValueTy &ExpandedSCEVs, Value *MainVectorTripCount,
SmallPtrSetImpl<PHINode *> *IVSubset) {
Expand Down Expand Up @@ -5592,14 +5611,23 @@ InstructionCost LoopVectorizationCostModel::computePredInstDiscount(
InstructionCost LoopVectorizationCostModel::expectedCost(ElementCount VF) {
InstructionCost Cost;

// If the vector loop gets executed exactly once with the given VF, ignore the
// costs of comparison and induction instructions, as they'll get simplified
// away.
SmallPtrSet<Instruction *, 2> ValuesToIgnoreForVF;
auto TC = PSE.getSE()->getSmallConstantTripCount(TheLoop);
if (VF.isFixed() && TC == VF.getFixedValue() && !foldTailByMasking())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TC == VF, no tail should remain, so we shouldn't try to fold the tail by masking. Could you make this an assert instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course you're right! There is code in computeMaxVF that disables tail-folding. In that case an assert does make sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhahn Well, if I make it as assert, these tests fail:

  LLVM :: Transforms/LoopVectorize/AArch64/clamped-trip-count.ll
  LLVM :: Transforms/LoopVectorize/RISCV/low-trip-count.ll
  LLVM :: Transforms/LoopVectorize/RISCV/short-trip-count.ll
  LLVM :: Transforms/LoopVectorize/RISCV/truncate-to-minimal-bitwidth-cost.ll

Without assert two RISCV tests show different output (that's why I added this check in the first place)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because in computeMaxVF the variable MaxPowerOf2RuntimeVF is bigger than the trip count and so the remainder of MaxPowerOf2RuntimeVF/TC is non-zero. Basically we only disable tail-folding if we can prove for every single possible VF there will not be a tail. So I believe @igogo-x86 is right and we need an explicit check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, suppose TC=8 and the max VF=16. In this case we could choose a VF of 16 and there will be a remainder, but the cost model may prefer VF=8 where there isn't a remainder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking, that makes sense.

addFullyUnrolledInstructionsToIgnore(TheLoop, Legal->getInductionVars(),
ValuesToIgnoreForVF);

// For each block.
for (BasicBlock *BB : TheLoop->blocks()) {
InstructionCost BlockCost;

// For each instruction in the old loop.
for (Instruction &I : BB->instructionsWithoutDebug()) {
// Skip ignored values.
if (ValuesToIgnore.count(&I) ||
if (ValuesToIgnore.count(&I) || ValuesToIgnoreForVF.count(&I) ||
(VF.isVector() && VecValuesToIgnore.count(&I)))
continue;

Expand Down Expand Up @@ -7281,6 +7309,17 @@ LoopVectorizationPlanner::precomputeCosts(VPlan &Plan, ElementCount VF,
continue;
IVInsts.push_back(CI);
}

// If the vector loop gets executed exactly once with the given VF, ignore
// the costs of comparison and induction instructions, as they'll get
// simplified away.
// TODO: Remove this code after stepping away from the legacy cost model and
// adding code to simplify VPlans before calculating their costs.
auto TC = PSE.getSE()->getSmallConstantTripCount(OrigLoop);
if (VF.isFixed() && TC == VF.getFixedValue() && !CM.foldTailByMasking())
addFullyUnrolledInstructionsToIgnore(OrigLoop, Legal->getInductionVars(),
CostCtx.SkipCostComputation);

for (Instruction *IVInst : IVInsts) {
if (CostCtx.skipCostComputation(IVInst, VF.isVector()))
continue;
Expand Down
18 changes: 5 additions & 13 deletions llvm/test/Transforms/LoopVectorize/AArch64/fully-unrolled-cost.ll
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,10 @@ define i64 @test(ptr %a, ptr %b) #0 {
; CHECK-NEXT: Cost of 1 for VF 8: exit condition instruction %exitcond.not = icmp eq i64 %i.iv.next, 16
; CHECK-NEXT: Cost of 0 for VF 8: EMIT vp<%2> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
; CHECK: Cost for VF 8: 26
; CHECK-NEXT: Cost of 1 for VF 16: induction instruction %i.iv.next = add nuw nsw i64 %i.iv, 1
; CHECK-NEXT: Cost of 0 for VF 16: induction instruction %i.iv = phi i64 [ 0, %entry ], [ %i.iv.next, %for.body ]
; CHECK-NEXT: Cost of 1 for VF 16: exit condition instruction %exitcond.not = icmp eq i64 %i.iv.next, 16
; CHECK-NEXT: Cost of 0 for VF 16: EMIT vp<%2> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
; CHECK: Cost for VF 16: 50
; CHECK: LV: Selecting VF: vscale x 2
; CHECK: Cost for VF 16: 48
; CHECK: LV: Selecting VF: 16
entry:
br label %for.body

Expand Down Expand Up @@ -50,9 +48,8 @@ define i64 @test_external_iv_user(ptr %a, ptr %b) #0 {
; CHECK: Cost for VF 8: 26
; CHECK-NEXT: Cost of 1 for VF 16: induction instruction %i.iv.next = add nuw nsw i64 %i.iv, 1
; CHECK-NEXT: Cost of 0 for VF 16: induction instruction %i.iv = phi i64 [ 0, %entry ], [ %i.iv.next, %for.body ]
; CHECK-NEXT: Cost of 1 for VF 16: exit condition instruction %exitcond.not = icmp eq i64 %i.iv.next, 16
; CHECK-NEXT: Cost of 0 for VF 16: EMIT vp<%2> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
; CHECK: Cost for VF 16: 50
; CHECK: Cost for VF 16: 49
; CHECK: LV: Selecting VF: vscale x 2
entry:
br label %for.body
Expand Down Expand Up @@ -86,13 +83,10 @@ define i64 @test_two_ivs(ptr %a, ptr %b, i64 %start) #0 {
; CHECK-NEXT: Cost of 1 for VF 8: exit condition instruction %exitcond.not = icmp eq i64 %i.iv.next, 16
; CHECK-NEXT: Cost of 0 for VF 8: EMIT vp<%2> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
; CHECK: Cost for VF 8: 27
; CHECK-NEXT: Cost of 1 for VF 16: induction instruction %i.iv.next = add nuw nsw i64 %i.iv, 1
; CHECK-NEXT: Cost of 0 for VF 16: induction instruction %i.iv = phi i64 [ 0, %entry ], [ %i.iv.next, %for.body ]
; CHECK-NEXT: Cost of 1 for VF 16: induction instruction %j.iv.next = add nuw nsw i64 %j.iv, 1
; CHECK-NEXT: Cost of 0 for VF 16: induction instruction %j.iv = phi i64 [ %start, %entry ], [ %j.iv.next, %for.body ]
; CHECK-NEXT: Cost of 1 for VF 16: exit condition instruction %exitcond.not = icmp eq i64 %i.iv.next, 16
; CHECK-NEXT: Cost of 0 for VF 16: EMIT vp<%2> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
; CHECK: Cost for VF 16: 51
; CHECK: Cost for VF 16: 48
; CHECK: LV: Selecting VF: 16
entry:
br label %for.body
Expand Down Expand Up @@ -125,11 +119,9 @@ define i1 @test_extra_cmp_user(ptr nocapture noundef %dst, ptr nocapture noundef
; CHECK-NEXT: Cost of 4 for VF 8: exit condition instruction %exitcond.not = icmp eq i64 %indvars.iv.next, 16
; CHECK-NEXT: Cost of 0 for VF 8: EMIT vp<%3> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
; CHECK: Cost for VF 8: 12
; CHECK-NEXT: Cost of 8 for VF 16: induction instruction %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
; CHECK-NEXT: Cost of 0 for VF 16: induction instruction %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
; CHECK-NEXT: Cost of 8 for VF 16: exit condition instruction %exitcond.not = icmp eq i64 %indvars.iv.next, 16
; CHECK-NEXT: Cost of 0 for VF 16: EMIT vp<%3> = CANONICAL-INDUCTION ir<0>, vp<%index.next>
; CHECK: Cost for VF 16: 20
; CHECK: Cost for VF 16: 4
; CHECK: LV: Selecting VF: 16
entry:
br label %for.body
Expand Down
Loading