Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLVM] Optimize G_SHUFFLE_VECTOR into more efficient generic opcodes #41

Open
wants to merge 9 commits into
base: aie-public
Choose a base branch
from

Conversation

ValentijnvdBeek
Copy link
Collaborator

@ValentijnvdBeek ValentijnvdBeek commented May 17, 2024

This merge requests implements an infrastructure based on streams that allows for the creation of shuffle vector pattern masks that can be used to replace inefficient scalarizations with faster generic opcodes. It implements five patterns that are useful for the AIE going forwards, namely one that matches the concatenation of vectors, the same in reverse, the insertion of a vector and the merging of two vectors.

Some todos in the merge request that can be done after the rest has been OK'ed

  • The assertions that check the initial state are done a bit adhoc, it might be nice to pass them as function pointers as well
  • Inefficient matching for larger groups of patterns. The patterns now match the minimum of what they can. For example, only the first two parts of vectors are transformed. Most of these can be extended, but this leads to very large set of patterns quickly
  • There are two matching functions, since the concat version does a lot of additional work that I am not sure what it does. I should probably figure out if those can be fully merged or whether just the simply variant ought to survive
  • Use llvm::SmallVector and not std::vector

@ValentijnvdBeek ValentijnvdBeek self-assigned this May 17, 2024
@ValentijnvdBeek ValentijnvdBeek added llvm:instcombine Code that modifies the combiner llvm:core Modifies non-AIE specific code backend:aie Code that modifies AIE code labels May 17, 2024
@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch from 10708e3 to 0417b1c Compare May 22, 2024 09:15
@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch from 0417b1c to 429acc5 Compare May 31, 2024 09:59
@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch from 429acc5 to 72a0a03 Compare June 11, 2024 09:47
@ValentijnvdBeek ValentijnvdBeek changed the base branch from vvandebe.shufflevector.basic to aie-public June 11, 2024 09:49
@ValentijnvdBeek ValentijnvdBeek changed the base branch from aie-public to vvandebe.shufflevector.basic June 11, 2024 09:54
@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch from 72a0a03 to 228e499 Compare June 14, 2024 12:31
@ValentijnvdBeek ValentijnvdBeek marked this pull request as ready for review June 14, 2024 12:34
@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch from 228e499 to 0e664aa Compare June 14, 2024 12:56
llvm/test/CodeGen/AArch64/ext-narrow-index.ll Outdated Show resolved Hide resolved
; CHECK-GI-DOT-NEXT: addv s0, v0.4s
; CHECK-GI-DOT-NEXT: fmov w0, s0
; CHECK-GI-DOT-NEXT: ret
; CHECK-LABEL: add_pair_v8i16_v4i32_double_sext_zext_shuffle:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

; CHECK-NEXT: mova r16, #13
; CHECK-NEXT: vextract.s32 r5, x2, r16
; CHECK-NEXT: j #.LBB0_3
; CHECK-NEXT: nopb ; nopa ; nops ; nopx ; vmov wl0, wh0; nopv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp Outdated Show resolved Hide resolved
@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch 2 times, most recently from f872cf4 to 73a92c2 Compare June 25, 2024 09:36
Copy link

github-actions bot commented Jun 25, 2024

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff ea89ed6223aa93251b9708c16b9d605ab30dccdb 07244e73f9104198bb9cb52273afc033367e0429 -- llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp llvm/lib/Target/AIE/AIE2PreLegalizerCombiner.cpp
View the diff from clang-format here.
diff --git a/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp b/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
index 70297e7d43..b42b0b7c29 100644
--- a/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
@@ -352,32 +352,31 @@ Register CombinerHelper::createUnmergeValue(
     TargetReg = MRI.createGenericVirtualRegister(HalfSizeTy);
   }
 
-    // Each destination fits n times into the source and each iteration we
-    // exactly half the source. Therefore we need to pick on which side we want
-    // to iterate on.
-    const uint32_t DstNumElements =
-        DstTy.isVector() ? DstTy.getNumElements() : 1;
-    const uint32_t HalfWay = Start + ((End - Start) / 2);
-    const uint32_t Position = DestinationIndex * DstNumElements;
-
-    uint32_t NextStart, NextEnd;
-    if (Position < HalfWay) {
-      Builder.buildInstr(TargetOpcode::G_UNMERGE_VALUES, {TargetReg, TmpReg},
-                         {SrcReg});
-      NextStart = Start;
-      NextEnd = HalfWay;
-    } else {
-      Builder.buildInstr(TargetOpcode::G_UNMERGE_VALUES, {TmpReg, TargetReg},
-                         {SrcReg});
-      NextStart = HalfWay;
-      NextEnd = End;
-    }
+  // Each destination fits n times into the source and each iteration we
+  // exactly half the source. Therefore we need to pick on which side we want
+  // to iterate on.
+  const uint32_t DstNumElements = DstTy.isVector() ? DstTy.getNumElements() : 1;
+  const uint32_t HalfWay = Start + ((End - Start) / 2);
+  const uint32_t Position = DestinationIndex * DstNumElements;
+
+  uint32_t NextStart, NextEnd;
+  if (Position < HalfWay) {
+    Builder.buildInstr(TargetOpcode::G_UNMERGE_VALUES, {TargetReg, TmpReg},
+                       {SrcReg});
+    NextStart = Start;
+    NextEnd = HalfWay;
+  } else {
+    Builder.buildInstr(TargetOpcode::G_UNMERGE_VALUES, {TmpReg, TargetReg},
+                       {SrcReg});
+    NextStart = HalfWay;
+    NextEnd = End;
+  }
 
-    if (HalfSizeTy.isVector() && DstTy != HalfSizeTy)
-      return createUnmergeValue(MI, TargetReg, DstReg, DestinationIndex,
-                                NextStart, NextEnd);
+  if (HalfSizeTy.isVector() && DstTy != HalfSizeTy)
+    return createUnmergeValue(MI, TargetReg, DstReg, DestinationIndex,
+                              NextStart, NextEnd);
 
-    return DstReg;
+  return DstReg;
 }
 
 bool CombinerHelper::tryCombineShuffleVector(MachineInstr &MI) {

@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch from ea44d18 to 4c022df Compare August 7, 2024 19:04
Copy link
Collaborator

@konstantinschwarz konstantinschwarz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are plenty of upstream tests failing. Could you please take a look?

llvm/lib/Target/AIE/AIECombine.td Outdated Show resolved Hide resolved
llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp Outdated Show resolved Hide resolved
llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp Outdated Show resolved Hide resolved
@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch 2 times, most recently from b49d34c to f855c29 Compare August 12, 2024 12:27
@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch from f855c29 to 4d6af83 Compare August 13, 2024 14:46
@@ -399,6 +399,54 @@ adderGenerator(const int32_t From, const int32_t To, const int32_t StepSize) {
};
}

Register CombinerHelper::createUnmergeValue(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: Maybe IRTranslator already has a similar piece of code that we can re-use?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from what I can see, G_UNMERGE_VALUES isn't meant for vectors to begin with. It only supports scalars, according to the documentation. Most backends just support this as an extension since there is no other opcode to do it. In theory, this will be ripped out in favour of a simple G_SUBREGISTER_EXTRACT in the near future if that opcode gets proper support. Utils.cpp does have something that is a bit similar, called llvm::extractParts, but that splits into two parts while this code goes through it like a binary tree.

In any case, I would keep it like this since this code should be replaced before upstreaming and that is easier if we are the only ones depending on it.

@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch from 4d6af83 to 13cc82a Compare August 15, 2024 08:52
llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp Outdated Show resolved Hide resolved
Comment on lines +272 to +275
Register createUnmergeValue(MachineInstr &MI, const Register SrcReg,
const Register DstReg, uint8_t DestinationIndex,
const uint32_t Start, const uint32_t End);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

G_UNMERGE_VALUE isn't really meant to be used for vectors like this, it is actually defined as a scalar opcode. There is a nice drop-in replacement opcode called G_EXTRACT_SUBVECTOR which has landed in our tree in early August, but there is no support in other backends for this yet. So this implementation is a necessary evil until the other opcode is able to replace it.

Comment on lines -45 to 47
; CHECK-GISEL-NEXT: ext v0.16b, v0.16b, v1.16b, #8
; CHECK-GISEL-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-GISEL-NEXT: mov d0, v0.d[1]
; CHECK-GISEL-NEXT: ret
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is:

define <8 x i8> @i8_off8(<16 x i8> %arg1, <16 x i8> %arg2) {
entry:
  %shuffle = shufflevector <16 x i8> %arg1, <16 x i8> %arg2, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  ret <8 x i8> %shuffle
}

This turns into:

bb.1.entry:
  liveins: $q0, $q1
  %0:_(<16 x s8>) = COPY $q0
  %3:_(<8 x s8>), %2:_(<8 x s8>) = G_UNMERGE_VALUES %0:_(<16 x s8>)
  $d0 = COPY %2:_(<8 x s8>)
  RET_ReallyLR implicit $d0

This is expected

Comment on lines -257 to 260
; CHECK-GISEL-NEXT: movi v1.2d, #0000000000000000
; CHECK-GISEL-NEXT: ext v0.16b, v0.16b, v1.16b, #8
; CHECK-GISEL-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-GISEL-NEXT: mov d0, v0.d[1]
; CHECK-GISEL-NEXT: ret
entry:
%shuffle = shufflevector <16 x i8> %arg1, <16 x i8> zeroinitializer, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is:

define <8 x i8> @i8_zero_off8(<16 x i8> %arg1) {
entry:
  %shuffle = shufflevector <16 x i8> %arg1, <16 x i8> zeroinitializer, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  ret <8 x i8> %shuffle
}

And turns into:

bb.1.entry:
  liveins: $q0, $q1
  %0:_(<16 x s8>) = COPY $q0
  %5:_(<8 x s8>), %4:_(<8 x s8>) = G_UNMERGE_VALUES %0:_(<16 x s8>)
  $d0 = COPY %4:_(<8 x s8>)
  RET_ReallyLR implicit $d0

this is expected

Comment on lines -3747 to 3757
; CHECK-GI-NEXT: ushll2 v0.4s, v0.8h, #0
; CHECK-GI-NEXT: ushll v5.4s, v1.4h, #0
; CHECK-GI-NEXT: ushll2 v1.4s, v1.8h, #0
; CHECK-GI-NEXT: ushll v6.4s, v2.4h, #0
; CHECK-GI-NEXT: ushll2 v2.4s, v2.8h, #0
; CHECK-GI-NEXT: ushll v7.4s, v3.4h, #0
; CHECK-GI-NEXT: ushll2 v3.4s, v3.8h, #0
; CHECK-GI-NEXT: add v0.4s, v4.4s, v0.4s
; CHECK-GI-NEXT: add v1.4s, v5.4s, v1.4s
; CHECK-GI-NEXT: add v2.4s, v6.4s, v2.4s
; CHECK-GI-NEXT: add v3.4s, v7.4s, v3.4s
; CHECK-GI-NEXT: uaddw2 v0.4s, v4.4s, v0.8h
; CHECK-GI-NEXT: uaddw2 v1.4s, v5.4s, v1.8h
; CHECK-GI-NEXT: uaddw2 v2.4s, v6.4s, v2.8h
; CHECK-GI-NEXT: uaddw2 v3.4s, v7.4s, v3.8h
; CHECK-GI-NEXT: add v0.4s, v0.4s, v1.4s
; CHECK-GI-NEXT: add v1.4s, v2.4s, v3.4s
; CHECK-GI-NEXT: add v0.4s, v0.4s, v1.4s
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is for me the most surprising test result and hardest to parse. The input LLVM IR looks like:

define i32 @add_pair_v8i16_v4i32_double_sext_zext_shuffle(<8 x i16> %ax, <8 x i16> %ay, <8 x i16> %bx, <8 x i16> %by) {
entry:
  %axx = zext <8 x i16> %ax to <8 x i32>
  %s1h = shufflevector <8 x i32> %axx, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %s1l = shufflevector <8 x i32> %axx, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %axs = add <4 x i32> %s1h, %s1l
  %ayy = zext <8 x i16> %ay to <8 x i32>
  %s2h = shufflevector <8 x i32> %ayy, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %s2l = shufflevector <8 x i32> %ayy, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %ays = add <4 x i32> %s2h, %s2l
  %az = add <4 x i32> %axs, %ays
  %bxx = zext <8 x i16> %bx to <8 x i32>
  %s3h = shufflevector <8 x i32> %bxx, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %s3l = shufflevector <8 x i32> %bxx, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %bxs = add <4 x i32> %s3h, %s3l
  %byy = zext <8 x i16> %by to <8 x i32>
  %s4h = shufflevector <8 x i32> %byy, <8 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %s4l = shufflevector <8 x i32> %byy, <8 x i32> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
  %bys = add <4 x i32> %s4h, %s4l
  %bz = add <4 x i32> %bxs, %bys
  %z = add <4 x i32> %az, %bz
  %z2 = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %z)
  ret i32 %z2
}

This is translated into the following global isel IR:


bb.1.entry:                                                                                                                                                                                                                                                        liveins: $q0, $q1, $q2, $q3
  %0:_(<8 x s16>) = COPY $q0
  %1:_(<8 x s16>) = COPY $q1
  %2:_(<8 x s16>) = COPY $q2
  %3:_(<8 x s16>) = COPY $q3
  %33:_(<4 x s16>), %34:_(<4 x s16>) = G_UNMERGE_VALUES %0:_(<8 x s16>)
  %35:_(<4 x s32>) = G_ZEXT %33:_(<4 x s16>)
  %36:_(<4 x s32>) = G_ZEXT %34:_(<4 x s16>)
  %8:_(<4 x s32>) = G_ADD %35:_, %36:_
  %37:_(<4 x s16>), %38:_(<4 x s16>) = G_UNMERGE_VALUES %1:_(<8 x s16>)
  %39:_(<4 x s32>) = G_ZEXT %37:_(<4 x s16>)
  %40:_(<4 x s32>) = G_ZEXT %38:_(<4 x s16>)
  %12:_(<4 x s32>) = G_ADD %39:_, %40:_
  %13:_(<4 x s32>) = G_ADD %8:_, %12:_
  %41:_(<4 x s16>), %42:_(<4 x s16>) = G_UNMERGE_VALUES %2:_(<8 x s16>)
  %43:_(<4 x s32>) = G_ZEXT %41:_(<4 x s16>)
  %44:_(<4 x s32>) = G_ZEXT %42:_(<4 x s16>)
  %17:_(<4 x s32>) = G_ADD %43:_, %44:_
  %45:_(<4 x s16>), %46:_(<4 x s16>) = G_UNMERGE_VALUES %3:_(<8 x s16>)
  %47:_(<4 x s32>) = G_ZEXT %45:_(<4 x s16>)
  %48:_(<4 x s32>) = G_ZEXT %46:_(<4 x s16>)
  %21:_(<4 x s32>) = G_ADD %47:_, %48:_
  %22:_(<4 x s32>) = G_ADD %17:_, %21:_
  %23:_(<4 x s32>) = G_ADD %13:_, %22:_
  %24:_(s32) = G_VECREDUCE_ADD %23:_(<4 x s32>)
  $w0 = COPY %24:_(s32)
  RET_ReallyLR implicit $w0
                                                                                                                                               

This corresponds to the the shufflevectors perfectly and actually shows that they simplify the unmerges as well, which is a welcome bonus.

if (DstNumElts <= 2)
return false;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do these tests in a callback? Now, if one fails, it will exit the combiner completely rather than continuing on. For now, this is fine, but with more patterns that are close, this might be problems.

; CHECK-GI-NEXT: mov v0.b[2], v3.b[0]
; CHECK-GI-NEXT: mov v0.b[3], v4.b[0]
; CHECK-GI-NEXT: mov v0.b[4], v5.b[0]
; CHECK-GI-NEXT: mov v0.b[5], v6.b[0]
; CHECK-GI-NEXT: mov v0.b[6], v7.b[0]
; CHECK-GI-NEXT: mov v0.b[7], v16.b[0]
; CHECK-GI-NEXT: tbl v0.16b, { v0.16b, v1.16b }, v2.16b
; CHECK-GI-NEXT: mov v0.d[1], v1.d[0]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input LLVM IR is:

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────define <16 x i8> @test_concat_v16i8_v8i8_v16i8(<8 x i8> %x, <16 x i8> %y) #0 {
entry:
  %vecext = extractelement <8 x i8> %x, i32 0
  %vecinit = insertelement <16 x i8> undef, i8 %vecext, i32 0
  %vecext1 = extractelement <8 x i8> %x, i32 1
  %vecinit2 = insertelement <16 x i8> %vecinit, i8 %vecext1, i32 1
  %vecext3 = extractelement <8 x i8> %x, i32 2
  %vecinit4 = insertelement <16 x i8> %vecinit2, i8 %vecext3, i32 2
  %vecext5 = extractelement <8 x i8> %x, i32 3
  %vecinit6 = insertelement <16 x i8> %vecinit4, i8 %vecext5, i32 3
  %vecext7 = extractelement <8 x i8> %x, i32 4
  %vecinit8 = insertelement <16 x i8> %vecinit6, i8 %vecext7, i32 4
  %vecext9 = extractelement <8 x i8> %x, i32 5
  %vecinit10 = insertelement <16 x i8> %vecinit8, i8 %vecext9, i32 5
  %vecext11 = extractelement <8 x i8> %x, i32 6
  %vecinit12 = insertelement <16 x i8> %vecinit10, i8 %vecext11, i32 6
  %vecext13 = extractelement <8 x i8> %x, i32 7
  %vecinit14 = insertelement <16 x i8> %vecinit12, i8 %vecext13, i32 7
  %vecinit30 = shufflevector <16 x i8> %vecinit14, <16 x i8> %y, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 2>
  ret <16 x i8> %vecinit30
}

Our outputted post-combiner IR is:

bb.1.entry:
  liveins: $d0, $q1
  %0:_(<8 x s8>) = COPY $d0
  %1:_(<16 x s8>) = COPY $q1
  %3:_(s64) = G_CONSTANT i64 0
  %2:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %3:_(s64)
  %7:_(s64) = G_CONSTANT i64 1
  %6:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %7:_(s64)
  %10:_(s64) = G_CONSTANT i64 2
  %9:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %10:_(s64)
  %13:_(s64) = G_CONSTANT i64 3
  %12:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %13:_(s64)
  %16:_(s64) = G_CONSTANT i64 4
  %15:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %16:_(s64)
  %19:_(s64) = G_CONSTANT i64 5
  %18:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %19:_(s64)
  %22:_(s64) = G_CONSTANT i64 6
  %21:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %22:_(s64)
  %25:_(s64) = G_CONSTANT i64 7
  %24:_(s8) = G_EXTRACT_VECTOR_ELT %0:_(<8 x s8>), %25:_(s64)
  %34:_(<8 x s8>) = G_BUILD_VECTOR %2:_(s8), %6:_(s8), %9:_(s8), %12:_(s8), %15:_(s8), %18:_(s8), %21:_(s8), %24:_(s8)
  %31:_(<8 x s8>), %32:_(<8 x s8>) = G_UNMERGE_VALUES %1:_(<16 x s8>)
  %27:_(<16 x s8>) = G_CONCAT_VECTORS %34:_(<8 x s8>), %31:_(<8 x s8>)
  $q0 = COPY %27:_(<16 x s8>)
  RET_ReallyLR implicit $q0

The IR is alright, it correctly puts the vector together, however there is a combiner missing that takes the extractions, turns it into a shufflevector and then optimizes it out.

Comment on lines +2140 to 2144
; CHECK-GI-NEXT: // kill: def $d0 killed $d0 def $q0
; CHECK-GI-NEXT: mov s2, v0.s[1]
; CHECK-GI-NEXT: mov v0.s[1], v2.s[0]
; CHECK-GI-NEXT: ldr q2, [x8, :lo12:.LCPI135_0]
; CHECK-GI-NEXT: tbl v0.16b, { v0.16b, v1.16b }, v2.16b
; CHECK-GI-NEXT: mov v0.d[1], v1.d[0]
; CHECK-GI-NEXT: ret
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@ValentijnvdBeek
Copy link
Collaborator Author

To explain the lateness of the reviews: ASUS managed to lose my hawkpoint laptop and have been promising that they will ship it soon (tm) for the past three weeks. I will see if I can get some progress done on my old 4 core

ValentijnvdBeek and others added 2 commits September 23, 2024 23:53
These generators are used to match onto shufflemask for optimizations.
The idea is that each shufflemask essentially encodes a function that
turns one vector into another. Generators are those functions and
allow us to match shufflevectors by generating masks. Since masks
are frequently very similar, this allows to define many masks in
relatively few lines.
@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch from 13cc82a to c38562a Compare September 23, 2024 22:30
@ValentijnvdBeek ValentijnvdBeek force-pushed the vvandebe.shufflevector.pattern.optimization branch from c38562a to ebe6489 Compare September 23, 2024 22:36
@ValentijnvdBeek
Copy link
Collaborator Author

@konstantinschwarz @gbossu friendly review bump. Hopefully the tests don't fail (I forgot where the git-clang-format hook was again) :)

Changes from the last time:

  • Split some changes up into different commits
  • Fixed a few nits (the ones recently in a nit commit as I am bettering my life)
  • Explained the assembly changes outside AIE2
  • Rebased onto current Main

It is a bit late, so I hope I haven't missed anything. Good luck today!

@ValentijnvdBeek
Copy link
Collaborator Author

FYI: there will be a conflict issue whenever LLVM main is merged into this branch due to CombinerHelper::tryCombineShuffleVector being split up into two parts. To incorporate it into this branch, you simply put the stream matching code in the new matchVectorMaskSequence function and it should be fine.

The relevant PR is: llvm/llvm-project#110545

@jsetoain the above PR is interesting for you since it sidesteps the need to convert into shufflevector explicitly. It will optimize sequences of inserts/extracts/buildvector automatically. You may want to cherry-pick the original version which turns into a G_SHUFFLE_VECTOR and then optimizes that rather than the current one that runs the optimizer directly.

Cheers!

@konstantinschwarz
Copy link
Collaborator

Hi @ValentijnvdBeek, I finally came back to this, sorry for the long delay.
In #224 I implemented an alternative combine for matching the G_SHUFFLE_VECTOR -> G_EXTRACT_SUBVECTOR (still G_UNMERGE_VALUES until the next upstream sync).

It uses the already available helper functions for creating contiguous shuffle masks. WDYT?

@ValentijnvdBeek
Copy link
Collaborator Author

Hi @ValentijnvdBeek, I finally came back to this, sorry for the long delay. In #224 I implemented an alternative combine for matching the G_SHUFFLE_VECTOR -> G_EXTRACT_SUBVECTOR (still G_UNMERGE_VALUES until the next upstream sync).

It uses the already available helper functions for creating contiguous shuffle masks. WDYT?

It is fine, I am just happy if the work doesn't totally bit rot. I like the linked code, for the most part, it takes my approach and makes it less of a tech demo (which this sort of is a bit). You do cover a part of the approach though, which admittedly the one that this implements, but there are some worries that I have about the general cost of this combiner in the future. It is pretty expensive and putting it in many different combiners does add a lot of code (and causes it to check the same values a lot). But as an approach within AIE, that is probably fine.

So as of this PR, what could be an idea is that I close it and you guys pick it over for those combiners. I will then take it back, be inspired by your approach and rewrite it with some of my crazier ideas to reduce the complexity so that I can push it upstream directly. Together with maybe #129, that way it is also out of your hair and review todo list. Is that an idea @konstantinschwarz?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:aie Code that modifies AIE code llvm:core Modifies non-AIE specific code llvm:instcombine Code that modifies the combiner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants