Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interleave loop pre-header with SWP prologue #236

Open
wants to merge 8 commits into
base: aie-public
Choose a base branch
from

Conversation

gbossu
Copy link
Collaborator

@gbossu gbossu commented Nov 14, 2024

Better review commit by commit, a lot of them are NFC and prepare for the final commit which enables interleaving.

QoR shows only improvements. Those will get bigger as we increase the scope of the post-pipeliner.

| Core_Compute_Cycle_Count       | ReduceProdAxis_1_aie2_bf16 | ReduceProdAxis_2_aie2_bf16 | ReduceProdAxis_4_aie2_bf16 | ReduceMeanAxis_2_aie2_bf16 | ReduceMeanAxis_4_aie2_bf16 | ReduceMeanAxis_1_aie2_bf16 | Sigmoid_bf16_0 | Requantize_0 | Softmax_bf16_0 | ReduceMeanAxis_3_aie2_bf16 | ReduceMeanAxis_6_aie2_bf16 | ReduceMeanAxis_5_aie2_bf16 | ReduceSumAxis_2_aie2_bf16 | Sigmoid_bf16_1 | Exp_bf16_0   | ReduceSumAxis_4_aie2_bf16 | ReduceSumAxis_1_aie2_bf16 | Log_bf16_0   | Neg_aie2_1   | EleMax_aie2_bfloat16 | EleMin_aie2_bfloat16 | ReduceMeanAxis_7_aie2_bf16 | Requantize_1 | MulAttributeBroadcasting_aie2_int8_0 | Tanh_1       | ReduceProdAxis_6_aie2_bf16 | ReduceSumAxis_5_aie2_bf16 | ReduceSumAxis_3_aie2_bf16 | ReduceSumAxis_6_aie2_bf16 | GELU_1       | HardSigmoid_bf16_0 | ReduceProdAxis_3_aie2_bf16 | ReduceProdAxis_5_aie2_bf16 | Exp_bf16_1   | ReduceProdAxis_7_aie2_bf16 | ReduceSumAxis_7_aie2_bf16 | Tanh_0       | GELU_0       | InstanceNormPart2_aie2_bf16_0 | MulBroadcasting_aie2_0 | HardSigmoid_bf16_1 | EleMax_aie2_int8 | EleMin_aie2_int8 | BatchNorm1d_aie2_bfloat16 | Mul_aie2_0   | HardSigmoid_int8_1 | HardSigmoid_int8_0 | BatchNorm1d_aie2_int8 | BitwiseNot_aie2_0 | BatchNorm2D_1 | GEMM_int8_0  | GEMM_int8_1  | Softmax_1    | BatchNorm2D_0 | Average diff |
| ------------------------------ | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------------------- | -------------- | ------------ | -------------- | -------------------------- | -------------------------- | -------------------------- | ------------------------- | -------------- | ------------ | ------------------------- | ------------------------- | ------------ | ------------ | -------------------- | -------------------- | -------------------------- | ------------ | ------------------------------------ | ------------ | -------------------------- | ------------------------- | ------------------------- | ------------------------- | ------------ | ------------------ | -------------------------- | -------------------------- | ------------ | -------------------------- | ------------------------- | ------------ | ------------ | ----------------------------- | ---------------------- | ------------------ | ---------------- | ---------------- | ------------------------- | ------------ | ------------------ | ------------------ | --------------------- | ----------------- | ------------- | ------------ | ------------ | ------------ | ------------- | ------------ |
| baseline                       | 35466                      | 17821                      | 35498                      | 13076                      | 13046                      | 13040                      | 2627           | 1421         | 8176           | 7233                       | 7219                       | 7212                       | 11900                     | 1727           | 7872         | 11920                     | 11884                     | 4149         | 455          | 227                  | 227                  | 6263                       | 781          | 517                                  | 2578         | 8709                       | 7055                      | 7052                      | 7038                      | 2811         | 937                | 8722                       | 8730                       | 1531         | 1794                       | 6212                      | 1970         | 2144         | 9536                          | 294                    | 649                | 164              | 164              | 390                       | 231          | 427                | 417                | 408                   | 135               | 416           | 2797         | 32931        | 570          | 308           | +0.00%       |
| this PR                        | 35405                      | 17789                      | 35433                      | 13051                      | 13016                      | 13010                      | 2620           | 1417         | 8144           | 7204                       | 7190                       | 7183                       | 11852                     | 1720           | 7840         | 11871                     | 11835                     | 4131         | 453          | 226                  | 226                  | 6231                       | 777          | 514                                  | 2560         | 8647                       | 7004                      | 7001                      | 6987                      | 2790         | 930                | 8656                       | 8662                       | 1519         | 1778                       | 6156                      | 1952         | 2123         | 9440                          | 291                    | 642                | 162              | 162              | 385                       | 228          | 421                | 411                | 402                   | 132               | 406           | 2701         | 31779        | 550          | 296           | -0.15%       |
| Total diff                     | IMPR(-0.17%)               | IMPR(-0.18%)               | IMPR(-0.18%)               | IMPR(-0.19%)               | IMPR(-0.23%)               | IMPR(-0.23%)               | IMPR(-0.27%)   | IMPR(-0.28%) | IMPR(-0.39%)   | IMPR(-0.40%)               | IMPR(-0.40%)               | IMPR(-0.40%)               | IMPR(-0.40%)              | IMPR(-0.41%)   | IMPR(-0.41%) | IMPR(-0.41%)              | IMPR(-0.41%)              | IMPR(-0.43%) | IMPR(-0.44%) | IMPR(-0.44%)         | IMPR(-0.44%)         | IMPR(-0.51%)               | IMPR(-0.51%) | IMPR(-0.58%)                         | IMPR(-0.70%) | IMPR(-0.71%)               | IMPR(-0.72%)              | IMPR(-0.72%)              | IMPR(-0.72%)              | IMPR(-0.75%) | IMPR(-0.75%)       | IMPR(-0.76%)               | IMPR(-0.78%)               | IMPR(-0.78%) | IMPR(-0.89%)               | IMPR(-0.90%)              | IMPR(-0.91%) | IMPR(-0.98%) | IMPR(-1.01%)                  | IMPR(-1.02%)           | IMPR(-1.08%)       | IMPR(-1.22%)     | IMPR(-1.22%)     | IMPR(-1.28%)              | IMPR(-1.30%) | IMPR(-1.41%)       | IMPR(-1.44%)       | IMPR(-1.47%)          | IMPR(-2.22%)      | IMPR(-2.40%)  | IMPR(-3.43%) | IMPR(-3.50%) | IMPR(-3.51%) | IMPR(-3.90%)  | -0.15%       |

This makes the option available for any user of getMinTripCount()
@@ -29,7 +37,11 @@ const MDNode *getLoopID(const MachineBasicBlock &LoopBlock) {
}

std::optional<int64_t> getMinTripCount(const MachineBasicBlock &LoopBlock) {
return getMinTripCount(getLoopID(LoopBlock));
std::optional<int64_t> MinTripCount = getMinTripCount(getLoopID(LoopBlock));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of sense ;-)

@@ -0,0 +1,422 @@
# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever! I like TDD. It also helps to see from where you start and how the work progresses with the sequence of commits.

for (MachineBasicBlock *SuccBB : MBB.successors()) {
auto &SBS = Sched.getInterBlock().getBlockState(SuccBB);
assert(SBS.isScheduled());
if (SBS.getRegions().empty()) {
Copy link
Collaborator

@andcarminati andcarminati Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious about how rare this can be. If this can be frequent, we can take the successor's successor to reduce the pessimism.

@andcarminati
Copy link
Collaborator

Hi @gbossu, as we did before, we could integrate the 1st and 3rd commits before, as they are standalone changes.

@gbossu gbossu reopened this Nov 15, 2024
@gbossu
Copy link
Collaborator Author

gbossu commented Nov 15, 2024

Hi @gbossu, as we did before, we could integrate the 1st and 3rd commits before, as they are standalone changes.

I'm actually tempted to drop the 3rd commit and do not have AIEMaxLatencyFinder/RegionEndEdges care about "timed"/"fixed" instructions. I think it makes sense if they keep focusing on successor MBBs instead. For dependencies between "fixed" and "free" instructions, I can add them in the EmitFixedSUnits mutator instead. Any opinion?

@andcarminati
Copy link
Collaborator

Hi @gbossu, as we did before, we could integrate the 1st and 3rd commits before, as they are standalone changes.

I'm actually tempted to drop the 3rd commit and do not have AIEMaxLatencyFinder/RegionEndEdges care about "timed"/"fixed" instructions. I think it makes sense if they keep focusing on successor MBBs instead. For dependencies between "fixed" and "free" instructions, I can add them in the EmitFixedSUnits mutator instead. Any opinion?

Is this case, I think it makes sense, as we are introducing this new mutator! In this way we can also restrict a bit more the changes.

}
size_t getNumTopFixedInstructions() const { return NumTopFixedInstructions; }

/// Iterate over the instructions that are fixed at the top. Typically those
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: at the bot

if (Zone.isTop())
return nullptr;
const Region &Reg = InterBlock.getBlockState(CurMBB).getCurrentRegion();
unsigned NumEmitted = getNumEmittedInstrs(DAG, /*IsTop=*/false);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: const unsigned

Those iterators are left with garbage values when a function has
multiple regions. This can confuse DAGMutators.
@gbossu gbossu force-pushed the gaetan.merge.preheader.swp-prologue branch from dd15264 to d4c8472 Compare November 18, 2024 10:38
@@ -543,11 +543,10 @@ class WAWEdges : public ScheduleDAGMutation {
LiveRegs.init(*TRI);
bool AddReservedRegs = true;
if (Scheduler) {
assert(!Scheduler->doMBBSchedRegionsTopDown());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK: this assert is specific for the prologue handling, and the code needs to be extended for epilogue handling.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, WAW handling could run on any region. Here I'm making sure that BS.getBottom() actually represents the bottom region.

@@ -596,6 +596,10 @@ std::optional<unsigned> ScheduleDAGInstrs::initSUnit(MachineInstr &MI) {
if (MI.isDebugOrPseudoInstr())
return {};

// Ensure all instructions can be given a SUnit without re-allocation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking if we could do this in ScheduleDAGInstrs::enterRegion to remove this fixed cost from here. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would also be fine, the reason I kept it here is to avoid distributing the logic about creating SUnits in too many places. But I'm open to discussion :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martien-de-jong Would you find this better if I do the initialization in ScheduleDAGInstrs::enterRegion then?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so. I was happy that initSUnit didn't need to care. At the time I hadn't thought about adding SUnits after edge creation. If we can pre-reserve the total number of sunits with some interface I'm fine.

auto Region = BS.getCurrentRegion();
auto BottomRegion = BS.getBottom();
if (*Region.begin() == *BottomRegion.begin()) {
if (&BS.getCurrentRegion() == &BS.getBottom()) {
Copy link
Collaborator

@martien-de-jong martien-de-jong Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK: this a relic of the pre-GatheringRegions era

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, the DAGMutator can run for all types of regions. For non-loop blocks, we do not pre-gather regions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. now you have me worried about reallocation of the region array. pre-gather would imply pre-allocation, making the test 'obviously correct'


// Verify that all fixed instructions are at the right place in the MBB
assert(TopFixedBundles.empty() || Begin == BB->begin());
assert(TopFixedBundles.empty() ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it. && at top level in an assert should be split in different asserts.

return getBundleStart(MI->getIterator()) == FreeEnd;
}));

for (auto It = FreeBegin; It != FreeEnd; ++It) {
SemanticOrder.push_back(&*It);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention whether and where fixed instructions are added to the semantic order.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. They aren't. And I know why.

@@ -441,6 +461,12 @@ ScheduleHazardRecognizer::HazardType AIEHazardRecognizer::getHazardType(
iterator_range<const MachineOperand *> MIOperands,
const MachineRegisterInfo &MRI, int DeltaCycles) const {
const unsigned SchedClass = TII->getSchedClass(Desc, MIOperands, MRI);
if (!IsPreRA && SchedClass == 0 &&
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This verification could be refactored out to a helper function like check... or validate..., that can be used also below (same code).

}
size_t size() const { return SemanticOrder.size(); }
ArrayRef<MachineBundle> getBotFixedBundles() const { return BotFixedBundles; }
size_t getNumBotFixedInstructions() const { return BotFixedBundles.size(); }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: NumBotFixedBundles to be absolutely clear?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually thinking of deleting that function. One can just do Region.getBotFixedBundles().size(). Previously it was harder because I only had iterators, and std::distance isn't O(1)

@@ -318,6 +305,49 @@ class RegionEndEdges : public ScheduleDAGMutation {
};
};

class EmitFixedSUnits : public ScheduleDAGMutation {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could probably add a top-level comment for that mutator

@@ -318,6 +305,49 @@ class RegionEndEdges : public ScheduleDAGMutation {
};
};

class EmitFixedSUnits : public ScheduleDAGMutation {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for the future, we could have a high-level description of the mutator.

if (auto BundleStart = getBundleStart(Instr->getIterator());
BundleStart->isBundle()) {
assert(BundleStart != Instr->getIterator());
BundleStart->eraseFromBundle();
Copy link
Collaborator

@martien-de-jong martien-de-jong Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering when this occurs, and whether we should not remove BUNDLEs as a precondition to calling this. Apart from that, I wonder whether there's anybody caring about the format order apart from the MClayer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This happens when the prologue has VLIW instructions. They get materialized as BUNDLE. I could separate that "debundling" in a helper function if you want.
  • The ordering is mostly for stable ref updates at this point. Otherwise slight changes to the tryCandidate heuristics can be annoying, because they would change the order of instructions within bundles.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. ref order is a good reason. Maybe my main confusion is that this function should live somewhere else.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should. But I blame AIE1 ;)

assert(!(IsTop && FirstBotFixedSU) && "Top-fixed SUnits must be added first");
assert(DAG->SUnits.size() < DAG->SUnits.capacity() &&
"SUnits need to be re-allocated.");
unsigned SUNum = DAG->initSUnit(MI).value();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ignoring has_value() here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be an error if the value is missing

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we don't like asserts? :-b

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that I would get to the same behaviour: the error would be caught and not silently ignored.

SU.NodeNum < FirstBotFixedSU.value_or(DAG->SUnits.size());
}
return FirstBotFixedSU && SU.NodeNum >= *FirstBotFixedSU &&
SU.NodeNum <= LastBotFixedSU.value();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: SU.NodeNum <= *LastBotFixedSU (to be symmetric with the previous case) or FirstBotFixedSU.value().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on purpose to have dynamic checks on LastBotFixedSU (and have a runtime failure if there is no value). For FirstBotFixedSU, I previously checked has_value(), so there is no need for checking again.

Fixed instructions will be handled specifically by the scheduler. They
cannot be reordered or inserted in different cycles.

Note this commit is just an "interface change". The list of "fixed"
instructions is still empty at the moment, and SWP prologues/epilogues
are inserted with a specific logic in commitBlockSchedule().
This updates the scheduler code base so it can handle SUnits that are
BUNDLE. This particularly focuses on/

 - Properly emitting BUNDLEs in the scoreboard
 - Rebundling instructions that are in the same cycle.

The commit also adds a bunch of asserts to verify "hidden"
pre-conditions. I had to add some Itineraries in AIE1/AIE2 to make sure
those conditions hold.
This means they will be placed by the scheduler, and once they are
placed, other "free" instructions can be bundled with them.
@gbossu gbossu force-pushed the gaetan.merge.preheader.swp-prologue branch from 3fc8e6b to 171faec Compare November 19, 2024 16:00
/// Those should not be re-ordered by the scheduler.
ArrayRef<MachineBundle> TopFixedBundles;

/// Instructions that are already scheduled at the bottom, e.g. an swp
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a swp prologue.

@gbossu
Copy link
Collaborator Author

gbossu commented Nov 21, 2024

Any more comments @andcarminati @martien-de-jong ?

PatchCycles >= InsertedCycles ? PatchCycles - InsertedCycles : 0;
}
EdgeLatency = std::max(EdgeLatency, PatchCycles);
const unsigned ZOLDistance = 8;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future: this magic 8 could be retrieved from TII as well.

CurRegion.getBotFixedBundles();
for (SUnit &FreeSU : make_filter_range(DAG->SUnits, IsFreeSU)) {
const MachineInstr &MI = *FreeSU.getInstr();
MachineInstr *FixedDepMI =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit:

To help readability, we could extend with overloaded cast operators:

 struct InstrAndCycle {
   MachineInstr *MI = nullptr;
   int Cycle;
+  operator MachineInstr *() const { return MI; }
+  operator int() const { return Cycle; }
 };

Then we can just use:

MachineInstr *FixedDepMI = AIE::findEarliestRef(MI, BotFixedBundles, BotFixedBundles.size());

And also:

Earliest = findEarliestRef(MI, TopBundles, Earliest);


// Keep track of BUNDLE instructions. They need to be cleaned up when
// de-bundling before re-bundling. See applyBundles()
if (Instr->getOpcode() == TargetOpcode::BUNDLE) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: Instr->isBundle()

unsigned SUNum = DAG->initSUnit(MI).value();
SUnit &SU = DAG->SUnits[SUNum];

if (IsTop) {
Copy link
Collaborator

@andcarminati andcarminati Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is really smart that you are preparing the way for epilogue merging (in some small details) ;-)

@andcarminati
Copy link
Collaborator

Hi @gbossu, I finished my review and left few suggestions. Nice piece of engineering that will make a lot more difference for the post-swp as we increase its use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants