-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[LLVM] Optimize G_SHUFFLE_VECTOR into more efficient generic opcodes #41
base: aie-public
Are you sure you want to change the base?
[LLVM] Optimize G_SHUFFLE_VECTOR into more efficient generic opcodes #41
Conversation
10708e3
to
0417b1c
Compare
0417b1c
to
429acc5
Compare
2f752d1
to
3e692b4
Compare
429acc5
to
72a0a03
Compare
72a0a03
to
228e499
Compare
228e499
to
0e664aa
Compare
llvm/test/CodeGen/AArch64/GlobalISel/prelegalizercombiner-undef.mir
Outdated
Show resolved
Hide resolved
; CHECK-GI-DOT-NEXT: addv s0, v0.4s | ||
; CHECK-GI-DOT-NEXT: fmov w0, s0 | ||
; CHECK-GI-DOT-NEXT: ret | ||
; CHECK-LABEL: add_pair_v8i16_v4i32_double_sext_zext_shuffle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay!
; CHECK-NEXT: mova r16, #13 | ||
; CHECK-NEXT: vextract.s32 r5, x2, r16 | ||
; CHECK-NEXT: j #.LBB0_3 | ||
; CHECK-NEXT: nopb ; nopa ; nops ; nopx ; vmov wl0, wh0; nopv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
3e692b4
to
79d7e96
Compare
f872cf4
to
73a92c2
Compare
You can test this locally with the following command:git-clang-format --diff ea89ed6223aa93251b9708c16b9d605ab30dccdb 07244e73f9104198bb9cb52273afc033367e0429 -- llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp llvm/lib/Target/AIE/AIE2PreLegalizerCombiner.cpp View the diff from clang-format here.diff --git a/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp b/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
index 70297e7d43..b42b0b7c29 100644
--- a/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
@@ -352,32 +352,31 @@ Register CombinerHelper::createUnmergeValue(
TargetReg = MRI.createGenericVirtualRegister(HalfSizeTy);
}
- // Each destination fits n times into the source and each iteration we
- // exactly half the source. Therefore we need to pick on which side we want
- // to iterate on.
- const uint32_t DstNumElements =
- DstTy.isVector() ? DstTy.getNumElements() : 1;
- const uint32_t HalfWay = Start + ((End - Start) / 2);
- const uint32_t Position = DestinationIndex * DstNumElements;
-
- uint32_t NextStart, NextEnd;
- if (Position < HalfWay) {
- Builder.buildInstr(TargetOpcode::G_UNMERGE_VALUES, {TargetReg, TmpReg},
- {SrcReg});
- NextStart = Start;
- NextEnd = HalfWay;
- } else {
- Builder.buildInstr(TargetOpcode::G_UNMERGE_VALUES, {TmpReg, TargetReg},
- {SrcReg});
- NextStart = HalfWay;
- NextEnd = End;
- }
+ // Each destination fits n times into the source and each iteration we
+ // exactly half the source. Therefore we need to pick on which side we want
+ // to iterate on.
+ const uint32_t DstNumElements = DstTy.isVector() ? DstTy.getNumElements() : 1;
+ const uint32_t HalfWay = Start + ((End - Start) / 2);
+ const uint32_t Position = DestinationIndex * DstNumElements;
+
+ uint32_t NextStart, NextEnd;
+ if (Position < HalfWay) {
+ Builder.buildInstr(TargetOpcode::G_UNMERGE_VALUES, {TargetReg, TmpReg},
+ {SrcReg});
+ NextStart = Start;
+ NextEnd = HalfWay;
+ } else {
+ Builder.buildInstr(TargetOpcode::G_UNMERGE_VALUES, {TmpReg, TargetReg},
+ {SrcReg});
+ NextStart = HalfWay;
+ NextEnd = End;
+ }
- if (HalfSizeTy.isVector() && DstTy != HalfSizeTy)
- return createUnmergeValue(MI, TargetReg, DstReg, DestinationIndex,
- NextStart, NextEnd);
+ if (HalfSizeTy.isVector() && DstTy != HalfSizeTy)
+ return createUnmergeValue(MI, TargetReg, DstReg, DestinationIndex,
+ NextStart, NextEnd);
- return DstReg;
+ return DstReg;
}
bool CombinerHelper::tryCombineShuffleVector(MachineInstr &MI) {
|
ea44d18
to
4c022df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are plenty of upstream tests failing. Could you please take a look?
b49d34c
to
f855c29
Compare
llvm/test/CodeGen/AIE/aie2/GlobalISel/prelegalizercombiner-shufflevector.mir
Outdated
Show resolved
Hide resolved
f855c29
to
4d6af83
Compare
@@ -399,6 +399,54 @@ adderGenerator(const int32_t From, const int32_t To, const int32_t StepSize) { | |||
}; | |||
} | |||
|
|||
Register CombinerHelper::createUnmergeValue( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: Maybe IRTranslator
already has a similar piece of code that we can re-use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not from what I can see, G_UNMERGE_VALUES
isn't meant for vectors to begin with. It only supports scalars, according to the documentation. Most backends just support this as an extension since there is no other opcode to do it. In theory, this will be ripped out in favour of a simple G_SUBREGISTER_EXTRACT
in the near future if that opcode gets proper support. Utils.cpp
does have something that is a bit similar, called llvm::extractParts
, but that splits into two parts while this code goes through it like a binary tree.
In any case, I would keep it like this since this code should be replaced before upstreaming and that is easier if we are the only ones depending on it.
4d6af83
to
13cc82a
Compare
Register createUnmergeValue(MachineInstr &MI, const Register SrcReg, | ||
const Register DstReg, uint8_t DestinationIndex, | ||
const uint32_t Start, const uint32_t End); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
G_UNMERGE_VALUE
isn't really meant to be used for vectors like this, it is actually defined as a scalar opcode. There is a nice drop-in replacement opcode called G_EXTRACT_SUBVECTOR
which has landed in our tree in early August, but there is no support in other backends for this yet. So this implementation is a necessary evil until the other opcode is able to replace it.
; CHECK-GISEL-NEXT: ext v0.16b, v0.16b, v1.16b, #8 | ||
; CHECK-GISEL-NEXT: // kill: def $d0 killed $d0 killed $q0 | ||
; CHECK-GISEL-NEXT: mov d0, v0.d[1] | ||
; CHECK-GISEL-NEXT: ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is:
define <8 x i8> @i8_off8(<16 x i8> %arg1, <16 x i8> %arg2) {
entry:
%shuffle = shufflevector <16 x i8> %arg1, <16 x i8> %arg2, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
ret <8 x i8> %shuffle
}
This turns into:
bb.1.entry:
liveins: $q0, $q1
%0:_(<16 x s8>) = COPY $q0
%3:_(<8 x s8>), %2:_(<8 x s8>) = G_UNMERGE_VALUES %0:_(<16 x s8>)
$d0 = COPY %2:_(<8 x s8>)
RET_ReallyLR implicit $d0
This is expected
; CHECK-GISEL-NEXT: movi v1.2d, #0000000000000000 | ||
; CHECK-GISEL-NEXT: ext v0.16b, v0.16b, v1.16b, #8 | ||
; CHECK-GISEL-NEXT: // kill: def $d0 killed $d0 killed $q0 | ||
; CHECK-GISEL-NEXT: mov d0, v0.d[1] | ||
; CHECK-GISEL-NEXT: ret | ||
entry: | ||
%shuffle = shufflevector <16 x i8> %arg1, <16 x i8> zeroinitializer, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is:
define <8 x i8> @i8_zero_off8(<16 x i8> %arg1) {
entry:
%shuffle = shufflevector <16 x i8> %arg1, <16 x i8> zeroinitializer, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
ret <8 x i8> %shuffle
}
And turns into:
bb.1.entry:
liveins: $q0, $q1
%0:_(<16 x s8>) = COPY $q0
%5:_(<8 x s8>), %4:_(<8 x s8>) = G_UNMERGE_VALUES %0:_(<16 x s8>)
$d0 = COPY %4:_(<8 x s8>)
RET_ReallyLR implicit $d0
this is expected
; CHECK-GI-NEXT: ushll2 v0.4s, v0.8h, #0 | ||
; CHECK-GI-NEXT: ushll v5.4s, v1.4h, #0 | ||
; CHECK-GI-NEXT: ushll2 v1.4s, v1.8h, #0 | ||
; CHECK-GI-NEXT: ushll v6.4s, v2.4h, #0 | ||
; CHECK-GI-NEXT: ushll2 v2.4s, v2.8h, #0 | ||
; CHECK-GI-NEXT: ushll v7.4s, v3.4h, #0 | ||
; CHECK-GI-NEXT: ushll2 v3.4s, v3.8h, #0 | ||
; CHECK-GI-NEXT: add v0.4s, v4.4s, v0.4s | ||
; CHECK-GI-NEXT: add v1.4s, v5.4s, v1.4s | ||
; CHECK-GI-NEXT: add v2.4s, v6.4s, v2.4s | ||
; CHECK-GI-NEXT: add v3.4s, v7.4s, v3.4s | ||
; CHECK-GI-NEXT: uaddw2 v0.4s, v4.4s, v0.8h | ||
; CHECK-GI-NEXT: uaddw2 v1.4s, v5.4s, v1.8h | ||
; CHECK-GI-NEXT: uaddw2 v2.4s, v6.4s, v2.8h | ||
; CHECK-GI-NEXT: uaddw2 v3.4s, v7.4s, v3.8h | ||
; CHECK-GI-NEXT: add v0.4s, v0.4s, v1.4s | ||
; CHECK-GI-NEXT: add v1.4s, v2.4s, v3.4s | ||
; CHECK-GI-NEXT: add v0.4s, v0.4s, v1.4s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for me the most surprising test result and hardest to parse. The input LLVM IR looks like:
define i32 @add_pair_v8i16_v4i32_double_sext_zext_shuffle(<8 x i16> %ax, <8 x i16> %ay, <8 x i16> %bx, <8 x i16> %by) {
entry:
%axx = zext <8 x i16> %ax to <8 x i32>
%s1h = shufflevector <8 x i32> %axx, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%s1l = shufflevector <8 x i32> %axx, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%axs = add <4 x i32> %s1h, %s1l
%ayy = zext <8 x i16> %ay to <8 x i32>
%s2h = shufflevector <8 x i32> %ayy, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%s2l = shufflevector <8 x i32> %ayy, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%ays = add <4 x i32> %s2h, %s2l
%az = add <4 x i32> %axs, %ays
%bxx = zext <8 x i16> %bx to <8 x i32>
%s3h = shufflevector <8 x i32> %bxx, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%s3l = shufflevector <8 x i32> %bxx, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%bxs = add <4 x i32> %s3h, %s3l
%byy = zext <8 x i16> %by to <8 x i32>
%s4h = shufflevector <8 x i32> %byy, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
%s4l = shufflevector <8 x i32> %byy, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
%bys = add <4 x i32> %s4h, %s4l
%bz = add <4 x i32> %bxs, %bys
%z = add <4 x i32> %az, %bz
%z2 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %z)
ret i32 %z2
}
This is translated into the following global isel IR:
bb.1.entry: liveins: $q0, $q1, $q2, $q3
%0:_(<8 x s16>) = COPY $q0
%1:_(<8 x s16>) = COPY $q1
%2:_(<8 x s16>) = COPY $q2
%3:_(<8 x s16>) = COPY $q3
%33:_(<4 x s16>), %34:_(<4 x s16>) = G_UNMERGE_VALUES %0:_(<8 x s16>)
%35:_(<4 x s32>) = G_ZEXT %33:_(<4 x s16>)
%36:_(<4 x s32>) = G_ZEXT %34:_(<4 x s16>)
%8:_(<4 x s32>) = G_ADD %35:_, %36:_
%37:_(<4 x s16>), %38:_(<4 x s16>) = G_UNMERGE_VALUES %1:_(<8 x s16>)
%39:_(<4 x s32>) = G_ZEXT %37:_(<4 x s16>)
%40:_(<4 x s32>) = G_ZEXT %38:_(<4 x s16>)
%12:_(<4 x s32>) = G_ADD %39:_, %40:_
%13:_(<4 x s32>) = G_ADD %8:_, %12:_
%41:_(<4 x s16>), %42:_(<4 x s16>) = G_UNMERGE_VALUES %2:_(<8 x s16>)
%43:_(<4 x s32>) = G_ZEXT %41:_(<4 x s16>)
%44:_(<4 x s32>) = G_ZEXT %42:_(<4 x s16>)
%17:_(<4 x s32>) = G_ADD %43:_, %44:_
%45:_(<4 x s16>), %46:_(<4 x s16>) = G_UNMERGE_VALUES %3:_(<8 x s16>)
%47:_(<4 x s32>) = G_ZEXT %45:_(<4 x s16>)
%48:_(<4 x s32>) = G_ZEXT %46:_(<4 x s16>)
%21:_(<4 x s32>) = G_ADD %47:_, %48:_
%22:_(<4 x s32>) = G_ADD %17:_, %21:_
%23:_(<4 x s32>) = G_ADD %13:_, %22:_
%24:_(s32) = G_VECREDUCE_ADD %23:_(<4 x s32>)
$w0 = COPY %24:_(s32)
RET_ReallyLR implicit $w0
This corresponds to the the shufflevectors perfectly and actually shows that they simplify the unmerges as well, which is a welcome bonus.
if (DstNumElts <= 2) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do these tests in a callback? Now, if one fails, it will exit the combiner completely rather than continuing on. For now, this is fine, but with more patterns that are close, this might be problems.
; CHECK-GI-NEXT: mov v0.b[2], v3.b[0] | ||
; CHECK-GI-NEXT: mov v0.b[3], v4.b[0] | ||
; CHECK-GI-NEXT: mov v0.b[4], v5.b[0] | ||
; CHECK-GI-NEXT: mov v0.b[5], v6.b[0] | ||
; CHECK-GI-NEXT: mov v0.b[6], v7.b[0] | ||
; CHECK-GI-NEXT: mov v0.b[7], v16.b[0] | ||
; CHECK-GI-NEXT: tbl v0.16b, { v0.16b, v1.16b }, v2.16b | ||
; CHECK-GI-NEXT: mov v0.d[1], v1.d[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input LLVM IR is:
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────define <16 x i8> @test_concat_v16i8_v8i8_v16i8(<8 x i8> %x, <16 x i8> %y) #0 {
entry:
%vecext = extractelement <8 x i8> %x, i32 0
%vecinit = insertelement <16 x i8> undef, i8 %vecext, i32 0
%vecext1 = extractelement <8 x i8> %x, i32 1
%vecinit2 = insertelement <16 x i8> %vecinit, i8 %vecext1, i32 1
%vecext3 = extractelement <8 x i8> %x, i32 2
%vecinit4 = insertelement <16 x i8> %vecinit2, i8 %vecext3, i32 2
%vecext5 = extractelement <8 x i8> %x, i32 3
%vecinit6 = insertelement <16 x i8> %vecinit4, i8 %vecext5, i32 3
%vecext7 = extractelement <8 x i8> %x, i32 4
%vecinit8 = insertelement <16 x i8> %vecinit6, i8 %vecext7, i32 4
%vecext9 = extractelement <8 x i8> %x, i32 5
%vecinit10 = insertelement <16 x i8> %vecinit8, i8 %vecext9, i32 5
%vecext11 = extractelement <8 x i8> %x, i32 6
%vecinit12 = insertelement <16 x i8> %vecinit10, i8 %vecext11, i32 6
%vecext13 = extractelement <8 x i8> %x, i32 7
%vecinit14 = insertelement <16 x i8> %vecinit12, i8 %vecext13, i32 7
%vecinit30 = shufflevector <16 x i8> %vecinit14, <16 x i8> %y, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 2>
ret <16 x i8> %vecinit30
}
Our outputted post-combiner IR is:
bb.1.entry:
liveins: $d0, $q1
%0:_(<8 x s8>) = COPY $d0
%1:_(<16 x s8>) = COPY $q1
%3:_(s64) = G_CONSTANT i64 0
%2:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %3:_(s64)
%7:_(s64) = G_CONSTANT i64 1
%6:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %7:_(s64)
%10:_(s64) = G_CONSTANT i64 2
%9:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %10:_(s64)
%13:_(s64) = G_CONSTANT i64 3
%12:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %13:_(s64)
%16:_(s64) = G_CONSTANT i64 4
%15:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %16:_(s64)
%19:_(s64) = G_CONSTANT i64 5
%18:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %19:_(s64)
%22:_(s64) = G_CONSTANT i64 6
%21:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %22:_(s64)
%25:_(s64) = G_CONSTANT i64 7
%24:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %25:_(s64)
%34:_(<8 x s8>) = G_BUILD_VECTOR %2:_(s8), %6:_(s8), %9:_(s8), %12:_(s8), %15:_(s8), %18:_(s8), %21:_(s8), %24:_(s8)
%31:_(<8 x s8>), %32:_(<8 x s8>) = G_UNMERGE_VALUES %1:_(<16 x s8>)
%27:_(<16 x s8>) = G_CONCAT_VECTORS %34:_(<8 x s8>), %31:_(<8 x s8>)
$q0 = COPY %27:_(<16 x s8>)
RET_ReallyLR implicit $q0
The IR is alright, it correctly puts the vector together, however there is a combiner missing that takes the extractions, turns it into a shufflevector and then optimizes it out.
; CHECK-GI-NEXT: // kill: def $d0 killed $d0 def $q0 | ||
; CHECK-GI-NEXT: mov s2, v0.s[1] | ||
; CHECK-GI-NEXT: mov v0.s[1], v2.s[0] | ||
; CHECK-GI-NEXT: ldr q2, [x8, :lo12:.LCPI135_0] | ||
; CHECK-GI-NEXT: tbl v0.16b, { v0.16b, v1.16b }, v2.16b | ||
; CHECK-GI-NEXT: mov v0.d[1], v1.d[0] | ||
; CHECK-GI-NEXT: ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
To explain the lateness of the reviews: ASUS managed to lose my hawkpoint laptop and have been promising that they will ship it soon (tm) for the past three weeks. I will see if I can get some progress done on my old 4 core |
These generators are used to match onto shufflemask for optimizations. The idea is that each shufflemask essentially encodes a function that turns one vector into another. Generators are those functions and allow us to match shufflevectors by generating masks. Since masks are frequently very similar, this allows to define many masks in relatively few lines.
13cc82a
to
c38562a
Compare
We check for iterative shift masks which corresponds to the CONCAT_VECTOR instruction.
We check for iterative shift masks which corresponds to the CONCAT_VECTOR instruction.
… of vector A and B
… of two vectors together
c38562a
to
ebe6489
Compare
@konstantinschwarz @gbossu friendly review bump. Hopefully the tests don't fail (I forgot where the git-clang-format hook was again) :) Changes from the last time:
It is a bit late, so I hope I haven't missed anything. Good luck today! |
FYI: there will be a conflict issue whenever LLVM main is merged into this branch due to The relevant PR is: llvm/llvm-project#110545 @jsetoain the above PR is interesting for you since it sidesteps the need to convert into shufflevector explicitly. It will optimize sequences of inserts/extracts/buildvector automatically. You may want to cherry-pick the original version which turns into a Cheers! |
Hi @ValentijnvdBeek, I finally came back to this, sorry for the long delay. It uses the already available helper functions for creating contiguous shuffle masks. WDYT? |
It is fine, I am just happy if the work doesn't totally bit rot. I like the linked code, for the most part, it takes my approach and makes it less of a tech demo (which this sort of is a bit). You do cover a part of the approach though, which admittedly the one that this implements, but there are some worries that I have about the general cost of this combiner in the future. It is pretty expensive and putting it in many different combiners does add a lot of code (and causes it to check the same values a lot). But as an approach within AIE, that is probably fine. So as of this PR, what could be an idea is that I close it and you guys pick it over for those combiners. I will then take it back, be inspired by your approach and rewrite it with some of my crazier ideas to reduce the complexity so that I can push it upstream directly. Together with maybe #129, that way it is also out of your hair and review todo list. Is that an idea @konstantinschwarz? |
This merge requests implements an infrastructure based on streams that allows for the creation of shuffle vector pattern masks that can be used to replace inefficient scalarizations with faster generic opcodes. It implements five patterns that are useful for the AIE going forwards, namely one that matches the concatenation of vectors, the same in reverse, the insertion of a vector and the merging of two vectors.
Some todos in the merge request that can be done after the rest has been OK'ed