[AArch64][GlobalISel] Combine G_EXTRACT_VECTOR_ELT and G_BUILD_VECTOR sequences into G_SHUFFLE_VECTOR #110545

ValentijnvdBeek · 2024-09-30T17:59:44Z

This combine tries to find all the build vectors whose source elements all originate from a G_EXTRACT_VECTOR_ELT from one or two donor vectors. One example where this may happen is for AI chips where there are a lot of matrix multiplications. Typically there vectors are dissected and then rearranged into the right transformation.

E.g.

  %donor1(<2 x s32>) = COPY $d0
  %donor2(<2 x s32>) = COPY $d1
  %ext1 = G_EXTRACT_VECTOR_ELT %donor1, 0
  %ext2 = G_EXTRACT_VECTOR_ELT %donor1, 1
  %ext3 = G_EXTRACT_VECTOR_ELT %donor2, 0
  %ext4 = G_EXTRACT_VECTOR_ELT %donor2, 1
  %vector = G_BUILD_VECTOR %ext1, %ext2, %ext3, %ext4
  ==>
  replace with:  %vector = G_SHUFFLE_VECTOR %donor1, %donor2, shufflemask(0, 1, 2, 3)

It has some speed ups in the AArch64 architecture. This pattern can also occur in other architectures were a lot of vector operations may happen. It can be tempting to express matrix operations in a similar fashion rather than as a one cohesive shufflevector directly.

In the buildShuffleVector method, there is a restriction that the input vectors must be larger than the mask. However, this is not the definition of a ShuffleVector instruction that is used inside of our test suite. For example: shuffle_concat_1 in combine_shuffle_vector.mir: 4xs8 -> 16xs8 v3s8_crash in legalize_insert_vector_elt: 3xs8 -> 12xs8 shuffle_vector_to_concat_vector_45670123 in prelegalizercombiner-shuffle-vector: 4xs32 -> 12xs32

github-actions · 2024-09-30T18:00:06Z

Thank you for submitting a Pull Request (PR) to the LLVM Project!

This PR will be automatically labeled and the relevant teams will be notified.

If you wish to, you can add reviewers by using the "Reviewers" section on this page.

If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using @ followed by their GitHub username.

If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers.

If you have further questions, they may be answered by the LLVM GitHub User Guide.

You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums.

llvmbot · 2024-09-30T18:00:40Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-aarch64

Author: Valentijn (ValentijnvdBeek)

Changes

This combine tries to find all the build vectors whose source elements all originate from a G_EXTRACT_VECTOR_ELT from one or two donor vectors. One example where this may happen is for AI chips where there are a lot of matrix multiplications. Typically there vectors are dissected and then rearranged into the right transformation.

E.g.

  %donor1(&lt;2 x s32&gt;) = COPY $d0
  %donor2(&lt;2 x s32&gt;) = COPY $d1
  %ext1 = G_EXTRACT_VECTOR_ELT %donor1, 0
  %ext2 = G_EXTRACT_VECTOR_ELT %donor1, 1
  %ext3 = G_EXTRACT_VECTOR_ELT %donor2, 0
  %ext4 = G_EXTRACT_VECTOR_ELT %donor2, 1
  %vector = G_BUILD_VECTOR %ext1, %ext2, %ext3, %ext4
  ==&gt;
  replace with:  %vector = G_SHUFFLE_VECTOR %donor1, %donor2, shufflemask(0, 1, 2, 3)

It has some speed ups in the AArch64 architecture. This pattern can also occur in other architectures were a lot of vector operations may happen. It can be tempting to express matrix operations in a similar fashion rather than as a one cohesive shufflevector directly.

Patch is 29.44 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/110545.diff

7 Files Affected:

(modified) llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h (+9)
(modified) llvm/include/llvm/Target/GlobalISel/Combine.td (+12-1)
(modified) llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp (+93)
(modified) llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp (-2)
(modified) llvm/test/CodeGen/AArch64/GlobalISel/combine-build-vector.mir (+302)
(modified) llvm/test/CodeGen/AArch64/arm64-neon-copy.ll (+16-88)
(modified) llvm/test/CodeGen/AArch64/arm64-rev.ll (+6-3)

diff --git a/llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h b/llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h
index 76d51ab819f441..15b0aaa5cf9d03 100644
--- a/llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h
+++ b/llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h
@@ -580,6 +580,15 @@ class CombinerHelper {
   bool matchExtractVecEltBuildVec(MachineInstr &MI, Register &Reg);
   void applyExtractVecEltBuildVec(MachineInstr &MI, Register &Reg);
 
+  /// Combine extracts of two different arrays into one build vector into a
+  /// shuffle vector.
+  bool matchCombineExtractToShuffle(
+      MachineInstr &MI, SmallVectorImpl<std::pair<Register, int>> &MatchInfo,
+      std::pair<Register, Register> &VectorRegisters);
+  void applyCombineExtractToShuffle(
+      MachineInstr &MI, SmallVectorImpl<std::pair<Register, int>> &MatchInfo,
+      std::pair<Register, Register> &VectorRegisters);
+
   bool matchExtractAllEltsFromBuildVector(
       MachineInstr &MI,
       SmallVectorImpl<std::pair<Register, MachineInstr *>> &MatchInfo);
diff --git a/llvm/include/llvm/Target/GlobalISel/Combine.td b/llvm/include/llvm/Target/GlobalISel/Combine.td
index f838c6e62a2ce3..0525bfe1b0ddb2 100644
--- a/llvm/include/llvm/Target/GlobalISel/Combine.td
+++ b/llvm/include/llvm/Target/GlobalISel/Combine.td
@@ -905,6 +905,16 @@ def extract_vec_elt_build_vec : GICombineRule<
     [{ return Helper.matchExtractVecEltBuildVec(*${root}, ${matchinfo}); }]),
   (apply [{ Helper.applyExtractVecEltBuildVec(*${root}, ${matchinfo}); }])>;
 
+def extract_vector_register_to_id_mapping_matchinfo :
+  GIDefMatchData<"SmallVector<std::pair<Register, int>>">;
+def vector_reg_pair_matchinfo :
+  GIDefMatchData<"std::pair<Register, Register>">;
+def extract_vector_element_build_vector_to_shuffle_vector : GICombineRule<
+  (defs root:$root, extract_vector_register_to_id_mapping_matchinfo:$matchinfo, vector_reg_pair_matchinfo:$regpair),
+  (match (wip_match_opcode G_BUILD_VECTOR):$root,
+    [{ return Helper.matchCombineExtractToShuffle(*${root}, ${matchinfo}, ${regpair}); }]),
+  (apply [{ Helper.applyCombineExtractToShuffle(*${root}, ${matchinfo}, ${regpair}); }])>;
+
 // Fold away full elt extracts from a build_vector.
 def extract_all_elts_from_build_vector_matchinfo :
   GIDefMatchData<"SmallVector<std::pair<Register, MachineInstr*>>">;
@@ -916,7 +926,8 @@ def extract_all_elts_from_build_vector : GICombineRule<
 
 def extract_vec_elt_combines : GICombineGroup<[
   extract_vec_elt_build_vec,
-  extract_all_elts_from_build_vector]>;
+  extract_all_elts_from_build_vector,
+  extract_vector_element_build_vector_to_shuffle_vector]>;
 
 def funnel_shift_from_or_shift : GICombineRule<
   (defs root:$root, build_fn_matchinfo:$info),
diff --git a/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp b/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
index c279289f9161bf..1499caeb37d134 100644
--- a/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
@@ -38,8 +38,10 @@
 #include "llvm/Support/MathExtras.h"
 #include "llvm/Target/TargetMachine.h"
 #include <cmath>
+#include <llvm/ADT/SmallVector.h>
 #include <optional>
 #include <tuple>
+#include <utility>
 
 #define DEBUG_TYPE "gi-combiner"
 
@@ -4205,6 +4207,97 @@ void CombinerHelper::applyExtractVecEltBuildVec(MachineInstr &MI,
   replaceSingleDefInstWithReg(MI, Reg);
 }
 
+bool CombinerHelper::matchCombineExtractToShuffle(
+    MachineInstr &MI, SmallVectorImpl<std::pair<Register, int>> &VecIndexPair,
+    std::pair<Register, Register> &VectorRegisters) {
+  assert(MI.getOpcode() == TargetOpcode::G_BUILD_VECTOR);
+  const GBuildVector *Build = cast<GBuildVector>(&MI);
+  // This combine tries to find all the build vectors whose source elements
+  // all originate from a G_EXTRACT_VECTOR_ELT from one or two donor vectors.
+  // One example where this may happen is for AI chips where there are a lot
+  // of matrix multiplications. Typically there vectors are disected and then
+  // rearranged into the right transformation.
+  // E.g.
+  //  %donor1(<2 x s32>) = COPY $d0
+  //  %donor2(<2 x s32>) = COPY $d1
+  //  %ext1 = G_EXTRACT_VECTOR_ELT %donor1, 0
+  //  %ext2 = G_EXTRACT_VECTOR_ELT %donor1, 1
+  //  %ext3 = G_EXTRACT_VECTOR_ELT %donor2, 0
+  //  %ext4 = G_EXTRACT_VECTOR_ELT %donor2, 1
+  /// %vector = G_BUILD_VECTOR %ext1, %ext2, %ext3, %ext4
+  // ==>
+  // replace with:
+  //   %vector = G_SHUFFLE_VECTOR %donor1, %donor2, shufflemask(0, 1, 2, 3)
+  SmallSetVector<Register, 2> RegisterVector;
+  const unsigned NumElements = Build->getNumSources();
+  for (unsigned Index = 0; Index < NumElements; Index++) {
+    Register SrcReg = peekThroughBitcast(Build->getSourceReg(Index), MRI);
+    auto *ExtractInstr = getOpcodeDef<GExtractVectorElement>(SrcReg, MRI);
+    if (!ExtractInstr)
+      return false;
+
+    // For shufflemasks we need to know exactly what index to place each element
+    // so if it this build vector doesn't use exclusively constants than we
+    // can't replace with a shufflevector
+    auto Cst = getIConstantVRegVal(ExtractInstr->getIndexReg(), MRI);
+    if (!Cst)
+      return false;
+    unsigned Idx = Cst->getZExtValue();
+
+    Register VectorReg = ExtractInstr->getVectorReg();
+    RegisterVector.insert(VectorReg);
+    VecIndexPair.emplace_back(std::make_pair(VectorReg, Idx));
+  }
+
+  // Create a pair so that we don't need to look for them later. This code is
+  // incorrect if we have more than two vectors in the set. Since we can only
+  // put two vectors in a shuffle, we reject any solution with more than two
+  // anyways.
+  VectorRegisters =
+      std::make_pair(RegisterVector.front(), RegisterVector.back());
+
+  // We check that they're the same type before running. We can also grow the
+  // smaller one to the target size, but there isn't an elegant way to do that
+  // until we have a good lowering for G_EXTRACT_SUBVECTOR.
+  if (MRI.getType(VectorRegisters.first) != MRI.getType(VectorRegisters.second))
+    return false;
+
+  return RegisterVector.size() <= 2;
+}
+
+void CombinerHelper::applyCombineExtractToShuffle(
+    MachineInstr &MI, SmallVectorImpl<std::pair<Register, int>> &MatchInfo,
+    std::pair<Register, Register> &VectorRegisters) {
+  assert(MI.getOpcode() == TargetOpcode::G_BUILD_VECTOR);
+
+  const Register FirstRegister = VectorRegisters.first;
+  const LLT FirstRegisterType = MRI.getType(FirstRegister);
+  const unsigned VectorSize = FirstRegisterType.getNumElements();
+  SmallVector<int, 32> ShuffleMask;
+  for (auto &Pair : MatchInfo) {
+    const Register VectorReg = Pair.first;
+    int Idx = Pair.second;
+
+    if (VectorReg != VectorRegisters.first) {
+      Idx += VectorSize;
+    }
+    ShuffleMask.emplace_back(Idx);
+  }
+
+  // We could reuse the same vector register and shuffle them both together
+  // but it is nicer for later optimizations to explicitely make it undef.
+  const GBuildVector *BuildVector = cast<GBuildVector>(&MI);
+  Register SecondRegister = VectorRegisters.second;
+  if (FirstRegister == SecondRegister) {
+    SecondRegister = MRI.createGenericVirtualRegister(FirstRegisterType);
+    Builder.buildUndef(SecondRegister);
+  }
+
+  Builder.buildShuffleVector(BuildVector->getOperand(0), FirstRegister,
+                             SecondRegister, ShuffleMask);
+  MI.eraseFromParent();
+}
+
 bool CombinerHelper::matchExtractAllEltsFromBuildVector(
     MachineInstr &MI,
     SmallVectorImpl<std::pair<Register, MachineInstr *>> &SrcDstPairs) {
diff --git a/llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp b/llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp
index 59f2fc633f5de7..1ddecefa173838 100644
--- a/llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp
@@ -772,8 +772,6 @@ MachineInstrBuilder MachineIRBuilder::buildShuffleVector(const DstOp &Res,
   LLT DstTy = Res.getLLTTy(*getMRI());
   LLT Src1Ty = Src1.getLLTTy(*getMRI());
   LLT Src2Ty = Src2.getLLTTy(*getMRI());
-  assert((size_t)(Src1Ty.getNumElements() + Src2Ty.getNumElements()) >=
-         Mask.size());
   assert(DstTy.getElementType() == Src1Ty.getElementType() &&
          DstTy.getElementType() == Src2Ty.getElementType());
   (void)DstTy;
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/combine-build-vector.mir b/llvm/test/CodeGen/AArch64/GlobalISel/combine-build-vector.mir
index 93f6051c3bd3b7..3cc836b9718297 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/combine-build-vector.mir
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/combine-build-vector.mir
@@ -151,3 +151,305 @@ body:             |
     RET_ReallyLR implicit $x0
 
 ...
+---
+name: reverse_concat_buildvector_shuffle
+tracksRegLiveness: true
+body:             |
+  bb.1:
+    liveins: $q0, $q1
+    ; CHECK-LABEL: name: reverse_concat_buildvector_shuffle
+    ; CHECK: liveins: $q0, $q1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+    ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<8 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[COPY1]], shufflemask(3, 2, 1, 0, 7, 6, 5, 4)
+    ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<8 x s32>)
+    %0:_(<4 x s32>) = COPY $q0
+    %1:_(<4 x s32>) = COPY $q1
+    %2:_(s64) = G_CONSTANT i64 0
+    %3:_(s64) = G_CONSTANT i64 1
+    %4:_(s64) = G_CONSTANT i64 2
+    %5:_(s64) = G_CONSTANT i64 3
+    %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %5:_(s64)
+    %11:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+    %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+    %13:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %2:_(s64)
+    %14:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %5:_(s64)
+    %15:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %4:_(s64)
+    %16:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %3:_(s64)
+    %17:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64)
+    %18:_(<8 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_, %14:_, %15:_, %16:_, %17:_
+    RET_ReallyLR implicit %18
+...
+---
+name: reverse_interweave_buildvector_shuffle
+tracksRegLiveness: true
+body:             |
+  bb.1:
+    liveins: $q0, $q1
+    ; CHECK-LABEL: name: reverse_interweave_buildvector_shuffle
+    ; CHECK: liveins: $q0, $q1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+    ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<8 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[COPY1]], shufflemask(3, 6, 1, 4, 7, 2, 5, 0)
+    ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<8 x s32>)
+    %0:_(<4 x s32>) = COPY $q0
+    %1:_(<4 x s32>) = COPY $q1
+    %2:_(s64) = G_CONSTANT i64 0
+    %3:_(s64) = G_CONSTANT i64 1
+    %4:_(s64) = G_CONSTANT i64 2
+    %5:_(s64) = G_CONSTANT i64 3
+    %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %5:_(s64)
+    %11:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %4:_(s64)
+    %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+    %13:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64)
+    %14:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %5:_(s64)
+    %15:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+    %16:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %3:_(s64)
+    %17:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %2:_(s64)
+    %18:_(<8 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_, %14:_, %15:_, %16:_, %17:_
+    RET_ReallyLR implicit %18
+...
+
+---
+name: reverse_interweave_same_size_as_dest_buildvector_shuffle
+tracksRegLiveness: true
+body:             |
+  bb.1:
+    liveins: $q0, $q1
+    ; CHECK-LABEL: name: reverse_interweave_same_size_as_dest_buildvector_shuffle
+    ; CHECK: liveins: $q0, $q1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+    ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<4 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[COPY1]], shufflemask(3, 6, 1, 4)
+    ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<4 x s32>)
+    %0:_(<4 x s32>) = COPY $q0
+    %1:_(<4 x s32>) = COPY $q1
+    %2:_(s64) = G_CONSTANT i64 0
+    %3:_(s64) = G_CONSTANT i64 1
+    %4:_(s64) = G_CONSTANT i64 2
+    %5:_(s64) = G_CONSTANT i64 3
+    %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %5:_(s64)
+    %11:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %4:_(s64)
+    %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+    %13:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64)
+    %14:_(<4 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_
+    RET_ReallyLR implicit %14
+...
+---
+name: reverse_interweave_half_size_as_dest_buildvector_shuffle
+tracksRegLiveness: true
+body:             |
+  bb.1:
+    liveins: $q0, $q1
+    ; CHECK-LABEL: name: reverse_interweave_half_size_as_dest_buildvector_shuffle
+    ; CHECK: liveins: $q0, $q1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+    ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<2 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[COPY1]], shufflemask(3, 4)
+    ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<2 x s32>)
+    %0:_(<4 x s32>) = COPY $q0
+    %1:_(<4 x s32>) = COPY $q1
+    %2:_(s64) = G_CONSTANT i64 0
+    %3:_(s64) = G_CONSTANT i64 3
+    %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+    %11:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64)
+    %12:_(<2 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_
+    RET_ReallyLR implicit %12
+...
+---
+name: reverse_concat_single_buildvector_shuffle
+tracksRegLiveness: true
+body:             |
+  bb.1:
+    liveins: $q0, $q1
+    ; CHECK-LABEL: name: reverse_concat_single_buildvector_shuffle
+    ; CHECK: liveins: $q0, $q1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:_(<4 x s32>) = G_IMPLICIT_DEF
+    ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<4 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[DEF]], shufflemask(3, 1, 0, 2)
+    ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<4 x s32>)
+    %0:_(<4 x s32>) = COPY $q0
+    %1:_(s64) = G_CONSTANT i64 0
+    %2:_(s64) = G_CONSTANT i64 1
+    %3:_(s64) = G_CONSTANT i64 2
+    %4:_(s64) = G_CONSTANT i64 3
+    %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+    %11:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %2:_(s64)
+    %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %1:_(s64)
+    %13:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+    %18:_(<4 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_
+    RET_ReallyLR implicit %18
+...
+---
+name: reverse_concat_double_buildvector_shuffle
+tracksRegLiveness: true
+body:             |
+  bb.1:
+    liveins: $q0, $q1
+    ; CHECK-LABEL: name: reverse_concat_double_buildvector_shuffle
+    ; CHECK: liveins: $q0, $q1
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+    ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<16 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[COPY1]], shufflemask(3, 2, 1, 0, 6, 4, 5, 7, 1, 0, 2, 0, 5, 4, 1, 7)
+    ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<16 x s32>)
+    %0:_(<4 x s32>) = COPY $q0
+    %1:_(<4 x s32>) = COPY $q1
+    %2:_(s64) = G_CONSTANT i64 0
+    %3:_(s64) = G_CONSTANT i64 1
+    %4:_(s64) = G_CONSTANT i64 2
+    %5:_(s64) = G_CONSTANT i64 3
+    %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %5:_(s64)
+    %11:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+    %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+    %13:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %2:_(s64)
+    %14:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %4:_(s64)
+    %15:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64)
+    %16:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %3:_(s64)
+    %17:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %5:_(s64)
+    %18:_(<16 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_, %14:_, %15:_, %16:_, %17:_, %12:_, %13:_, %11:_, %13:_, %16:_, %15:_, %12:_, %17:_
+    RET_ReallyLR implicit %18
+...
+---
+name: reverse_concat_buildvector_shuffle_three_sources
+tracksRegLiveness: true
+body:             |
+  bb.1:
+    liveins: $q0, $q1, $q2
+    ; CHECK-LABEL: name: reverse_concat_buildvector_shuffle_three_sources
+    ; CHECK: liveins: $q0, $q1, $q2
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+    ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+    ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(<4 x s32>) = COPY $q2
+    ; CHECK-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
+    ; CHECK-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
+    ; CHECK-NEXT: [[EVEC:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C1]](s64)
+    ; CHECK-NEXT: [[EVEC1:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C]](s64)
+    ; CHECK-NEXT: [[EVEC2:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY1]](<4 x s32>), [[C1]](s64)
+    ; CHECK-NEXT: [[EVEC3:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY1]](<4 x s32>), [[C]](s64)
+    ; CHECK-NEXT: [[EVEC4:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY2]](<4 x s32>), [[C1]](s64)
+    ; CHECK-NEXT: [[EVEC5:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY2]](<4 x s32>), [[C]](s64)
+    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<8 x s32>) = G_BUILD_VECTOR [[EVEC1]](s32), [[EVEC]](s32), [[EVEC1]](s32), [[EVEC2]](s32), [[EVEC3]](s32), [[EVEC4]](s32), [[EVEC5]](s32), [[EVEC1]](s32)
+    ; CHECK-NEXT: RET_ReallyLR implicit [[BUILD_VECTOR]](<8 x s32>)
+    %0:_(<4 x s32>) = COPY $q0
+    %1:_(<4 x s32>) = COPY $q1
+    %2:_(<4 x s32>) = COPY $q2
+    %3:_(s64) = G_CONSTANT i64 1
+    %4:_(s64) = G_CONSTANT i64 2
+    %11:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+    %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+    %13:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %4:_(s64)
+    %14:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %3:_(s64)
+    %15:_(s32) = G_EXTRACT_VECTOR_ELT %2:_(<4 x s32>), %4:_(s64)
+    %16:_(s32) = G_EXTRACT_VECTOR_ELT %2:_(<4 x s32>), %3:_(s64)
+    %18:_(<8 x s32>) = G_BUILD_VECTOR %12:_(s32), %11:_, %12:_, %13:_, %14:_, %15:_, %16:_, %12:_
+    RET_ReallyLR implicit %18
+...
+---
+name: reverse_concat_buildvector_shuffle_different_element_size
+tracksRegLiveness: true
+body:             |
+  bb.1:
+    liveins: $q0, $d0
+    ; CHECK-LABEL: name: reverse_concat_buildvector_shuffle_different_element_size
+    ; CHECK: liveins: $q0, $d0
+    ; CHECK-NEXT: {{  $}}
+    ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+    ; CHECK-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
+    ; CHECK-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
+    ; CHECK-NEXT: [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
+    ; CHECK-NEXT: [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 3
+    ; CHECK-NEXT: [[EVEC:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C3]](s64)
+    ; CHECK-NEXT: [[EVEC1:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C2]](s64)
+    ; CHECK-NEXT: [[EVEC2:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C1]](s64)
+    ; CHECK-NEXT: [[EVEC3:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C]](s64)
+    ; CHECK-NEXT: [[DEF:%[0-9]+]]:_(s32) = G_IMPLICIT_DEF
+    ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<8 x s32>) = G_BUILD_VECTOR [[EVEC]](s32), [[EVEC1]](s32), [[EVEC2]](s32), [[EVEC3]](s32), [[DEF]](s32), [[DEF]](s32), [[EVEC1]](s32), [[EVEC2]](s32)
+    ; CHECK-NEXT: RET_ReallyLR implicit [[BUILD_VECTOR]](<8 x s32>)
+    %0:_(<4 x s32>) = COPY $q0
+    %1:_(<2 x s32>) = COPY $d0
+    %2:_(s64) = G_CONSTANT i64 0
+    %3:_(s64) = G_CONSTANT i64 1
+    %4:_(s64) = G_CONSTANT i64 2
+    %5:_(s64) = G_CONSTANT i64 3
+    %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %5:_(s64)
+    %11:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+    %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+    %13:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %2:_(s64)
+    %14:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<2 x s32>), %5:_(s64)
+    %15:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<2 x s32>), %4:_(s64)
+    %18:_(<8 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_, %14:_, %15:_, %11:_, %12:_
+    RET_ReallyLR implicit %18
+...
+---
+name: reverse_concat_buil...
[truncated]

ValentijnvdBeek · 2024-09-30T18:10:44Z

llvm/test/CodeGen/AArch64/arm64-rev.ll

+; CHECK-GI-NEXT:    adrp x8, .LCPI27_0
 ; CHECK-GI-NEXT:    ldr q0, [x0]
-; CHECK-GI-NEXT:    add x8, x1, #2
-; CHECK-GI-NEXT:    st1.h { v0 }[6], [x1]
-; CHECK-GI-NEXT:    st1.h { v0 }[5], [x8]
+; CHECK-GI-NEXT:    ldr q2, [x8, :lo12:.LCPI27_0]
+; CHECK-GI-NEXT:    tbl.16b v0, { v0, v1 }, v2
+; CHECK-GI-NEXT:    mov h1, v0[1]
+; CHECK-GI-NEXT:    str h0, [x1]
+; CHECK-GI-NEXT:    str h1, [x1, #2]
 ; CHECK-GI-NEXT:    ret
 entry:
  %tmp2 = load <8 x i16>, ptr %source, align 4


This code outputs the following from IR translator:

bb.1.entry: liveins: $x0, $x1 %0:_(p0) = COPY $x0 %1:_(p0) = COPY $x1 %4:_(s64) = G_CONSTANT i64 6 %6:_(<2 x s16>) = G_IMPLICIT_DEF %7:_(s64) = G_CONSTANT i64 0 %9:_(s64) = G_CONSTANT i64 5 %11:_(s64) = G_CONSTANT i64 1 %2:_(<8 x s16>) = G_LOAD %0:_(p0) :: (load (<8 x s16>) from %ir.source, align 4) %3:_(s16) = G_EXTRACT_VECTOR_ELT %2:_(<8 x s16>), %4:_(s64) %5:_(<2 x s16>) = G_INSERT_VECTOR_ELT %6:_, %3:_(s16), %7:_(s64) %8:_(s16) = G_EXTRACT_VECTOR_ELT %2:_(<8 x s16>), %9:_(s64) %10:_(<2 x s16>) = G_INSERT_VECTOR_ELT %5:_, %8:_(s16), %11:_(s64) G_STORE %10:_(<2 x s16>), %1:_(p0) :: (store (<2 x s16>) into %ir.dst) RET_ReallyLR

Which is then combined into:

bb.1.entry: liveins: $x0, $x1 %0:_(p0) = COPY $x0 %1:_(p0) = COPY $x1 %2:_(<8 x s16>) = G_LOAD %0:_(p0) :: (load (<8 x s16>) from %ir.source, align 4) %12:_(<8 x s16>) = G_IMPLICIT_DEF %10:_(<2 x s16>) = G_SHUFFLE_VECTOR %2:_(<8 x s16>), %12:_, shufflemask(6, 5) G_STORE %10:_(<2 x s16>), %1:_(p0) :: (store (<2 x s16>) into %ir.dst) RET_ReallyLR

Which is the code that we would expect, since it creates a vector with the correct elements in the right place. It is not exactly SDAG, since it misses an optimization for shufflevector where it unmerges twice into the third half of the first vector and then reverses it.

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp

ValentijnvdBeek · 2024-09-30T18:18:28Z

llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp

@@ -772,8 +772,6 @@ MachineInstrBuilder MachineIRBuilder::buildShuffleVector(const DstOp &Res,
  LLT DstTy = Res.getLLTTy(*getMRI());
  LLT Src1Ty = Src1.getLLTTy(*getMRI());
  LLT Src2Ty = Src2.getLLTTy(*getMRI());
-  assert((size_t)(Src1Ty.getNumElements() + Src2Ty.getNumElements()) >=


As outlined in the commit, there are a couple of locations in the AArch64 where this assertion doesn't hold. Therefore, I removed it to bring the build function in line with the real(?) world usage of ShuffleVector.

llvm-project/llvm/test/CodeGen/AArch64/GlobalISel/combine-shufflevector.mir

Line 35 in 5d45815

%x:_(<16 x s8>) = G_SHUFFLE_VECTOR %a:_(<4 x s8>), %b:_, shufflemask(0, 1, 2, 3, 4, 5, 6, 7, undef, undef, undef, undef, undef, undef, undef, undef)

llvm-project/llvm/test/CodeGen/AArch64/GlobalISel/legalize-insert-vector-elt.mir

Line 285 in 5d45815

%9:_(<12 x s8>) = G_SHUFFLE_VECTOR %6(<3 x s8>), %10, shufflemask(0, 3, 3, 3, 1, 3, 3, 3, 2, 3, 3, 3)

llvm-project/llvm/test/CodeGen/AArch64/GlobalISel/prelegalizercombiner-shuffle-vector.mir

Line 192 in 5d45815

%2:_(<12 x s32>) = G_SHUFFLE_VECTOR %0(<4 x s32>), %1(<4 x s32>), shufflemask(4,5,6,7,0,1,2,3,4,5,6,7)

Why did this assert start failing with your PR? It seemed to work before.

I/we build our LLVM for development with -DLLVM_ENABLE_ASSERTIONS=On. If an assert hits, my first thought is: Oh wait, I broke something. My thought is not: this is a legacy assert and I should remove it. Please don't remove asserts. You should think about fixing your code/contribution.

You are completely right, that the code should be checked with asserts and that they are an important part of ensuring quality going further. Thanks for the reminder, I think that is never unwelcome. However, in this case my PR doesn't break the assertion per se. Rather the examples of G_SHUFFLE_VECTOR, I mentioned before don't follow the assertion. The code asserts that the sum of the vectors is larger than the mask, but for none of those are true. Since those patterns exist, I want to support them and since we have a buildShuffleVector I want to use it (instead of reimplementing it).

Anyways, I am completely willing to drop the commit, you know much more than me about things like this. But my reason for removing it was to pull into line with how the opcode is used, not because I thought it was a legacy assert :)

The source vectors each must be the same size, and the result number of elements must exactly match the number of elements in the mask. The assert doesn't quite match the machine verifier's rules for G_SHUFFLE_VECTOR

ValentijnvdBeek · 2024-09-30T18:22:11Z

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp

+  // We check that they're the same type before running. We can also grow the
+  // smaller one to the target size, but there isn't an elegant way to do that
+  // until we have a good lowering for G_EXTRACT_SUBVECTOR.
+  if (MRI.getType(VectorRegisters.first) != MRI.getType(VectorRegisters.second))
+    return false;
+


In the best case scenario, you would cast up the lower one into the right size by adding G_CONCAT_VECTOR instructions until they are in the same size. A few months ago the G_EXTRACT_SUBVECTOR was added and, very recently, a lot of work has gone into implementing it for the RISCV backend. But I don't think there is a nice fallback legalization strategy, which is what I would want before relying on it in the Combiner.

What has worked well (?) for me before is adding G_CONCAT_VECTOR and G_UNMERGE until the target has been reached. But, I am not sure if that is the best strategy. If anybody has a good idea, please let me know so that I can work on it.

tschuett · 2024-09-30T18:52:26Z

llvm/include/llvm/Target/GlobalISel/Combine.td

+  GIDefMatchData<"std::pair<Register, Register>">;
+def extract_vector_element_build_vector_to_shuffle_vector : GICombineRule<
+  (defs root:$root, extract_vector_register_to_id_mapping_matchinfo:$matchinfo, vector_reg_pair_matchinfo:$regpair),
+  (match (wip_match_opcode G_BUILD_VECTOR):$root,


There will be comments about style and algorithms. What I really don't like is the pattern. The pattern matches on build vectors, but the combine is about build vectors of extract vector elts.

def extract_vector_element_build_to_shuffle_vector2 : GICombineRule< (defs root:$root, build_fn_matchinfo:$matchinfo), (match (G_EXTRACT_VECTOR_ELT $el2, $vec2, $idx2), (G_EXTRACT_VECTOR_ELT $el1, $vec1, $idx1), (G_BUILD_VECTOR $root, $el1, $el2):$build, [{ return Helper.matchCombineExtractToShuffle(*${build}, ${matchinfo}, ${regpair}); }]), ( apply [{ Helper.applyCombineExtractToShuffle(*${build}, ${matchinfo}, ${regpair}); }] )>;

This a matcher, but we would copy and paste for many build vector sizes ...

https://llvm.org/docs/GlobalISel/MIRPatterns.html#gallery

Thanks for the review, sorry for committing something that you really don't like. I hope that are things about it you do enjoy. I really appreciate the link. I missed it when I looked for something like it before and I am happy with any information about tablegen that I can find.

In this case, I am not entirely sure if it is exactly the right solution. As you mention, there are a lot of sizes to take into account. For the backend that I am the most for familiar with, AI Engine, the vector size is a maximum of 2048 and a minimum scalar size of 8. So that has 256 patterns to match for both this combiner and the three around it (that use the same pattern). I think we only need to do the powers of 2 and 3 (AMDGPU), so that makes it about ~14 patterns or ~42 in total.

Another reservation that I have is that in this case, complete enumeration is a bit difficult. There is always possible that there is an architecture that has a vector size that just a bit larger or has a different size than the others. By enumerating explicitly, future/alternatives backends need to remember/know to update this combiner specifically.

However, this is my first contribution to LLVM. So these are just, somewhat layman, considerations. If you think that they are not as relevant as I think, than I more willing to change the pattern and also update the combiners around it with the same pattern.

We are interested in compile-time. The combiner engine is more efficient than the C++ code. That is why we strive for precise patterns. If we jump for every build vector into the C++ code, the success rate of the C++ code will be extremely low. It is waste of compile-time. For precise patterns, we only jump into C++ code when there is good chance to succeed. If the pattern of this combine is build vector and the chance of success of the combine is really low, then we have to try something else.

Thanks for the review, sorry for committing something that you really don't like. I hope that are things about it you do enjoy. I really appreciate the link. I missed it when I looked for something like it before and I am happy with any information about tablegen that I can find.

Please don't be discouraged, Thorsten was only referring to the approach itself and it's common and expected in LLVM code reviews to critique the code in that manner. It's not a reflection on whether we want your contribution at all, usually it's just a matter of making sure it's done in the best way possible. We'd love to have more people like yourself contribute to the project!

I haven't thought about this patch in detail yet but once I do I'll give my feedback.

@tschuett I fixed this in a fixup commit for powers of two (from what I could find also AMDGPU uses those as registers sizes). There seems to be a cap for the amount of patterns that can be supported (1024x8, 2048x16, 4096x32) and it does add a lot of extra lines. Just this pattern now takes 25-30% of the file and patterns around it should also be changed to this.

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp

arsenm · 2024-10-01T08:17:13Z

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp

+  // but it is nicer for later optimizations to explicitely make it undef.
+  const GBuildVector *BuildVector = cast<GBuildVector>(&MI);
+  Register SecondRegister = VectorRegisters.second;
+  if (FirstRegister == SecondRegister) {


Probably should also have a separate combine for this

Thanks for the review! What exactly should be a separate combine? This sequence with one source? Or turning shufflevectors with the same source registers into a source and an undefined register?

Shufflevectors with the same source registers into a source and undefined (this came up on another review recently too)

Sure, I will remove the check from the current PR. Sounds like a great target for me to work on when I have a bit of free time, thanks for the suggestion! I appreciate it.

arsenm · 2024-10-01T08:17:38Z

llvm/test/CodeGen/AArch64/GlobalISel/combine-build-vector.mir

+    %11:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64)
+    %12:_(<2 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_
+    RET_ReallyLR implicit %12
+...


Test with an unexpected other user

I think that I did what you asked. I added a commit where one of the arguments is not a G_EXTRACT_VECTOR_ELT and on where it is a G_BUILD_VECTOR_TRUNC (which I don't target yet). But I am not entirely sure what you mean with other user. I must I admit that I am not entirely familiar with the LLVM lingo yet :)

Is this resolved with the new commit, then I can close it.

tschuett · 2024-10-02T08:46:02Z

There are no legality checks.

aemerson · 2024-10-02T17:24:38Z

So while I see some nice code quality improvements in the tests, I'm not convinced this is a good transformation to make in the general case. The problem is that vector extracts->inserts are simple operations and a shuffle is in general a large and expensive operation, unless it can be pattern matched to a more precise variant like zip. For AArch64, tbl is an expensive operation that requires a load from a constant pool to materialize the indices, as well as the actual tbl itself.

In your example in this PR description, your input could have been optimized instead into a G_CONCAT_VECTOR right? More specific operations are usually faster and they're also easier for other optimizations to reason about. Shuffles suffer from the problem that they're more opaque unless we do expensive analysis.

ValentijnvdBeek · 2024-10-02T23:36:56Z

llvm/test/CodeGen/AArch64/arm64-neon-copy.ll

+; CHECK-LABEL: test_concat_v16i8_v8i8_v8i8:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    // kill: def $d0 killed $d0 def $q0
+; CHECK-NEXT:    // kill: def $d1 killed $d1 def $q1
+; CHECK-NEXT:    mov v0.d[1], v1.d[0]
+; CHECK-NEXT:    ret


This turns into the following Generic IR, by first turning it into a shufflevector and then running the G_CONCAT_VECTORS shufflevector combiner.

bb.1.entry: liveins: $d0, $d1 %0:_(<8 x s8>) = COPY $d0 %1:_(<8 x s8>) = COPY $d1 %51:_(<16 x s8>) = G_CONCAT_VECTORS %0:_(<8 x s8>), %1:_(<8 x s8>) $q0 = COPY %51:_(<16 x s8>) RET_ReallyLR implicit $q0

ValentijnvdBeek · 2024-10-03T00:01:08Z

First of thanks for the great detailed comment Amara, I don't have a lot of experience with AArch64, so your perspective is very helpful.

So while I see some nice code quality improvements in the tests, I'm not convinced this is a good transformation to make in the general case. The problem is that vector extracts->inserts are simple operations and a shuffle is in general a large and expensive operation, unless it can be pattern matched to a more precise variant like zip.

The combiner based on the lowering of the shuffle vector. When you lower a shuffle vector it will turn into a sequence of extract->buildvector operations which is exactly what I match here. So any backend either uses this lowering and performs the same, or has a more efficient implementation of shuffle vector and performs better. However, I am not sure if this assumption actually holds in reality, so hopefully you can shine a light on that.

In your example in this PR description, your input could have been optimized instead into a G_CONCAT_VECTOR right?

You are right and it does actually. Another combiner turns the shufflevector into a G_CONCAT_VECTORS, which is a consequence of this combiner being there. This is where the performance improvements in the tests come from. For example, it turns the large sequence of operations in arm64-neon-copy.ll into a shufflevector which is then turned into a G_CONCAT_VECTORS

I didn't mention it since it a consequence of another combiner kicking in, although this is an intentional effect. Probably should have put a comment explaining the speedups for the cases where it conforms to the Selection DAG output, but I didn't want to overload the PR with additional comments. Apologies for the confusion I caused by that oversight.

More specific operations are usually faster and they're also easier for other optimizations to reason about. Shuffles suffer from the problem that they're more opaque unless we do expensive analysis.

The background of the PR is that I worked during my internship on extending the analysis of shufflevectors and replacing them with more specific opcodes. By pulling these sequences into a shufflevector, the analysis is run on it and then either the sequence is replaced by an equivalent opcode or the shufflevector is lowered, in the generic case, into the exact same sequence as it was before. You mention that is the analysis is expensive, which is true, but we are already running this analysis for shufflevectors and I am hoping that the speed improvements in the generated code will make up for the additional compile overhead.

With some luck my previous company will let me upstream that code and allow me to implement some of the ideas I have with regards of reducing the cost of this analysis. But no promises.

aemerson · 2024-10-03T18:03:11Z

First of thanks for the great detailed comment Amara, I don't have a lot of experience with AArch64, so your perspective is very helpful.

So while I see some nice code quality improvements in the tests, I'm not convinced this is a good transformation to make in the general case. The problem is that vector extracts->inserts are simple operations and a shuffle is in general a large and expensive operation, unless it can be pattern matched to a more precise variant like zip.

The combiner based on the lowering of the shuffle vector. When you lower a shuffle vector it will turn into a sequence of extract->buildvector operations which is exactly what I match here. So any backend either uses this lowering and performs the same, or has a more efficient implementation of shuffle vector and performs better. However, I am not sure if this assumption actually holds in reality, so hopefully you can shine a light on that.

Yes that's true, but let's think about this from first principles. In LLVM IR, we don't have separate instructions for things like vector concat, zip, etc... because the general shuffle can implement those. We rely on optimizations later on in codegen to select the most optimal instructions for some patterns.

In the GISel MIR, we don't have the same philosophy as the IR. Our goal is not to have a minimal MIR but to instead aid in lowering and selecting good target instructions. So we decompose some complex operations that targets don't have native support for into simpler ops, because now one of the factors we have to take into account is the idea of instruction legality.

If we were to perform this transform, we would be going from simpler (but more numerous) operations into more complex single operation. In order to get performant output, we rely on another transform to then re-optimize our output into the idea form. Therefore, our transform is not so much an optimization but instead a form of canonicalization.

If we decided to instead directly pattern match into a vector concat for example, we're skipping the canonicalization step and going straight to the instruction we want to emit. There's no ambiguity here about what's the best final output, in this case it's always a G_CONCAT_VECTOR. So going directly has benefits in 1) not having to spend time going through the intermediate step, and b) having the guaranteed output generated.

In your example in this PR description, your input could have been optimized instead into a G_CONCAT_VECTOR right?

You are right and it does actually. Another combiner turns the shufflevector into a G_CONCAT_VECTORS, which is a consequence of this combiner being there. This is where the performance improvements in the tests come from. For example, it turns the large sequence of operations in arm64-neon-copy.ll into a shufflevector which is then turned into a G_CONCAT_VECTORS

I didn't mention it since it a consequence of another combiner kicking in, although this is an intentional effect. Probably should have put a comment explaining the speedups for the cases where it conforms to the Selection DAG output, but I didn't want to overload the PR with additional comments. Apologies for the confusion I caused by that oversight.

More specific operations are usually faster and they're also easier for other optimizations to reason about. Shuffles suffer from the problem that they're more opaque unless we do expensive analysis.

The background of the PR is that I worked during my internship on extending the analysis of shufflevectors and replacing them with more specific opcodes. By pulling these sequences into a shufflevector, the analysis is run on it and then either the sequence is replaced by an equivalent opcode or the shufflevector is lowered, in the generic case, into the exact same sequence as it was before. You mention that is the analysis is expensive, which is true, but we are already running this analysis for shufflevectors and I am hoping that the speed improvements in the generated code will make up for the additional compile overhead.

With some luck my previous company will let me upstream that code and allow me to implement some of the ideas I have with regards of reducing the cost of this analysis. But no promises.

Improving pattern matching of shuffle idioms so that we lower into more optimal instructions sounds like a good idea. I don't think that means that we need to canonicalize everything into shufflevectors however.

…iner

arsenm · 2024-10-10T20:09:16Z

FWIW the DAG does try to do this: https://github.com/llvm/llvm-project/blob/f0ed31ce4b63a5530fd1de875c0d1467d4d2c6ea/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L24096

I can see this pattern appearing if some operations require some degree of vector splitting, and others do not

aemerson · 2024-10-10T22:50:44Z

FWIW the DAG does try to do this: https://github.com/llvm/llvm-project/blob/f0ed31ce4b63a5530fd1de875c0d1467d4d2c6ea/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L24096

I can see this pattern appearing if some operations require some degree of vector splitting, and others do not

Yeah but I don't think it's a good transformation to do if we can express the shuffle in a simpler & equally compact way. I can see that being helpful for long, complex sequences of extracts/inserts but if we know the idiom (e.g. concat) we should go directly.

ValentijnvdBeek · 2024-10-11T12:49:42Z

llvm/include/llvm/Target/GlobalISel/Combine.td

-  extract_vector_element_build_vector_to_shuffle_vector]>;
+  extract_vector_element_build_vector_to_shuffle_vector,
+  extract_all_elts_from_build_vector]>;


nit: useless change, will be removed when squashed so this can be ignored

ValentijnvdBeek · 2024-10-11T12:51:45Z

llvm/test/CodeGen/AArch64/arm64-neon-copy.ll

-; CHECK-LABEL: getl:
-; CHECK:       // %bb.0:
-; CHECK-NEXT:    // kill: def $d0 killed $d0 killed $q0
-; CHECK-NEXT:    ret
+; CHECK-SD-LABEL: getl:
+; CHECK-SD:       // %bb.0:
+; CHECK-SD-NEXT:    // kill: def $d0 killed $d0 killed $q0
+; CHECK-SD-NEXT:    ret
+;
+; CHECK-GI-LABEL: getl:
+; CHECK-GI:       // %bb.0:
+; CHECK-GI-NEXT:    mov b2, v0.b[1]
+; CHECK-GI-NEXT:    mov v1.b[0], v0.b[0]
+; CHECK-GI-NEXT:    mov b3, v0.b[2]
+; CHECK-GI-NEXT:    mov v1.b[1], v2.b[0]
+; CHECK-GI-NEXT:    mov b2, v0.b[3]
+; CHECK-GI-NEXT:    mov v1.b[2], v3.b[0]
+; CHECK-GI-NEXT:    mov b3, v0.b[4]
+; CHECK-GI-NEXT:    mov v1.b[3], v2.b[0]
+; CHECK-GI-NEXT:    mov b2, v0.b[5]
+; CHECK-GI-NEXT:    mov v1.b[4], v3.b[0]
+; CHECK-GI-NEXT:    mov b3, v0.b[6]
+; CHECK-GI-NEXT:    mov b0, v0.b[7]
+; CHECK-GI-NEXT:    mov v1.b[5], v2.b[0]
+; CHECK-GI-NEXT:    mov v1.b[6], v3.b[0]
+; CHECK-GI-NEXT:    mov v1.b[7], v0.b[0]
+; CHECK-GI-NEXT:    fmov d0, d1
+; CHECK-GI-NEXT:    ret


This regression comes from the legalizer step of the shufflevector missing. In the previous iteration of the code, it would create a shufflevector which would be legalized into the correct size. It then adds undefs to the mask until it is the same size, rerun the combiner which is then able to turn it into a concat vectors. Since we are now running the analysis directly, we can't do that (yet, this pattern is covered in my additions to the analysis code)

ValentijnvdBeek · 2024-10-11T12:52:56Z

llvm/test/CodeGen/AArch64/arm64-rev.ll

+; CHECK-GI-NEXT:    add x8, x1, #2
+; CHECK-GI-NEXT:    st1.h { v0 }[6], [x1]
+; CHECK-GI-NEXT:    st1.h { v0 }[5], [x8]


This is inline with @aemerson comment about tbl, it no longer turns non-optimizable sequences into shufflevectors so it keeps previous iteration of the code.

ValentijnvdBeek · 2024-10-11T13:24:05Z

If we decided to instead directly pattern match into a vector concat for example, we're skipping the canonicalization step and going straight to the instruction we want to emit. There's no ambiguity here about what's the best final output, in this case it's always a G_CONCAT_VECTOR. So going directly has benefits in 1) not having to spend time going through the intermediate step, and b) having the guaranteed output generated.

Thanks for your clear explanation of GISelm MIR, I appreciate the time you took and I understand a lot better now. I have updated the code so that it directly runs the analysis rather than doing a step in the middle. That has the benefit that we can still share code without actually having to turn into shufflevector.

FWIW the DAG does try to do this: https://github.com/llvm/llvm-project/blob/f0ed31ce4b63a5530fd1de875c0d1467d4d2c6ea/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L24096

I can see this pattern appearing if some operations require some degree of vector splitting, and others do not

Yeah but I don't think it's a good transformation to do if we can express the shuffle in a simpler & equally compact way. I can see that being helpful for long, complex sequences of extracts/inserts but if we know the idiom (e.g. concat) we should go directly.

I mean it does both. The PR aims to simplify long, complex sequences that occur in matrix operations like the ones that were common for the backend that I used to work on (as mentioned in the PR description). With the side benefit of the code improvements that the shufflevector optimizations bring. Since this is not the backend I worked on and I don't how common it is in other backends, it isn't my focus atm.

But, both of you know more than I ever forget, so I will eagerly await the result of this discussion. One suggestion that I have is that, in line with the comment from @tschuett, I can use specific combiners for each size. Smaller sizes are optimized directly, but larger sequences (say >32) are turned into shufflevector. Does that sound like an idea?

arsenm · 2024-10-11T13:31:42Z

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp

@@ -4205,6 +4218,76 @@ void CombinerHelper::applyExtractVecEltBuildVec(MachineInstr &MI,
  replaceSingleDefInstWithReg(MI, Reg);
 }

+bool CombinerHelper::matchCombineExtractToShuffle(


I suggest to follow along with the DAG version; you'll want something similar to buildLegalVectorShuffle, which will avoid creating shuffles which will just be broken up again https://github.com/llvm/llvm-project/blob/33c14f19656c751bbbc083e4a168ab898e583bfd/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L23501

Thanks for the suggestion, I am now avoiding making a shufflevector after input from Amara. But I will be sure to check this if I need to make a shufflevector in the future.

aemerson

LGTM with some nits, thanks!

aemerson · 2024-10-18T17:49:33Z

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp

-  //       we could still emit an extract vector element in that case.
-  if (DstNumElts < 2 * SrcNumElts && DstNumElts != 1)
-    return false;
+bool CombinerHelper::matchVectorMaskSequence(


Since this is now a shared utility, could you give it a more specific name and also add a doc comment to explain what it does.

Added a comment and changed the name slightly, it is still a bit broad to make sure that it doesn't get pidgeonholed as just concatenation. Please tell me if you would like it to be changed

aemerson · 2024-10-18T17:50:59Z

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp

+  // One example where this may happen is for AI chips where there are a lot
+  // of matrix multiplications. Typically there vectors are disected and then


We don't particularly attach specific use cases like "AI chips" to comments, the transform description should be enough.

… sequences into G_SHUFFLE_VECTOR This combine tries to find all the build vectors whose source elements all originate from a G_EXTRACT_VECTOR_ELT from one or two donor vectors. One example where this may happen is for AI chips where there are a lot of matrix multiplications. Typically there vectors are disected and then rearranged into the right transformation. E.g. %donor1(<2 x s32>) = COPY $d0 %donor2(<2 x s32>) = COPY $d1 %ext1 = G_EXTRACT_VECTOR_ELT %donor1, 0 %ext2 = G_EXTRACT_VECTOR_ELT %donor1, 1 %ext3 = G_EXTRACT_VECTOR_ELT %donor2, 0 %ext4 = G_EXTRACT_VECTOR_ELT %donor2, 1 %vector = G_BUILD_VECTOR %ext1, %ext2, %ext3, %ext4 ==> replace with: %vector = G_SHUFFLE_VECTOR %donor1, %donor2, shufflemask(0, 1, 2, 3)

ValentijnvdBeek · 2024-10-22T23:11:39Z

LGTM with some nits, thanks!

I addresses the nits. Thanks for the great feedback that you gave me, they were really helpful and I learned about LLVM from them!

ValentijnvdBeek changed the title ~~Valentijnbeek combine extract build~~ [AArch64][GlobalISel] Combine G_EXTRACT_VECTOR_ELT and G_BUILD_VECTOR sequences into G_SHUFFLE_VECTOR Sep 30, 2024

llvmbot added backend:AArch64 llvm:globalisel labels Sep 30, 2024

ValentijnvdBeek commented Sep 30, 2024

View reviewed changes

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp Outdated Show resolved Hide resolved

ValentijnvdBeek commented Sep 30, 2024

View reviewed changes

tschuett requested review from davemgreen, arsenm, aemerson and tschuett September 30, 2024 18:35

tschuett reviewed Sep 30, 2024

View reviewed changes

arsenm reviewed Oct 1, 2024

View reviewed changes

ValentijnvdBeek commented Oct 2, 2024

View reviewed changes

[GISel] Factor out the mask matching code from the shufflevector comb…

0d5d582

…iner

ValentijnvdBeek force-pushed the valentijnbeek-combine-extract-build branch from 15949a1 to 4ecb536 Compare October 11, 2024 12:47

ValentijnvdBeek commented Oct 11, 2024

View reviewed changes

arsenm reviewed Oct 11, 2024

View reviewed changes

aemerson approved these changes Oct 18, 2024

View reviewed changes

ValentijnvdBeek mentioned this pull request Oct 22, 2024

[LLVM] Optimize G_SHUFFLE_VECTOR into more efficient generic opcodes Xilinx/llvm-aie#41

Open

4 tasks

ValentijnvdBeek added 8 commits October 22, 2024 23:52

Fixup! Rename function to a clearer name

4f2fdcd

Fixup! Add comment to function

633b8fc

Fixup! Remove redudant G_BUILD_VECTOR assert

6cb6d38

Fixup! Add aditional tests

f93503c

Fixup! Create undef using buildUndef

35500c6

fixup! Directly run the shuffle vector analysis code

1ef5773

Nit! remove use case from comment

3bdebfa

ValentijnvdBeek force-pushed the valentijnbeek-combine-extract-build branch from 4ecb536 to 9f8d0a4 Compare October 22, 2024 23:05

Fixup! Use the gallery MIR pattern to be more precise

baa7cd8

ValentijnvdBeek force-pushed the valentijnbeek-combine-extract-build branch from 9f8d0a4 to baa7cd8 Compare October 22, 2024 23:06

		// One example where this may happen is for AI chips where there are a lot
		// of matrix multiplications. Typically there vectors are disected and then

[AArch64][GlobalISel] Combine G_EXTRACT_VECTOR_ELT and G_BUILD_VECTOR sequences into G_SHUFFLE_VECTOR #110545

Are you sure you want to change the base?

[AArch64][GlobalISel] Combine G_EXTRACT_VECTOR_ELT and G_BUILD_VECTOR sequences into G_SHUFFLE_VECTOR #110545

Conversation

ValentijnvdBeek commented Sep 30, 2024

github-actions bot commented Sep 30, 2024

llvmbot commented Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tschuett commented Oct 2, 2024

aemerson commented Oct 2, 2024

Choose a reason for hiding this comment

ValentijnvdBeek commented Oct 3, 2024

aemerson commented Oct 3, 2024 • edited Loading

arsenm commented Oct 10, 2024

aemerson commented Oct 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ValentijnvdBeek commented Oct 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aemerson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ValentijnvdBeek commented Oct 22, 2024

llvmbot commented Sep 30, 2024 •

edited

Loading

aemerson commented Oct 3, 2024 •

edited

Loading