-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AArch64][GlobalISel] Combine G_EXTRACT_VECTOR_ELT and G_BUILD_VECTOR sequences into G_SHUFFLE_VECTOR #110545
base: main
Are you sure you want to change the base?
[AArch64][GlobalISel] Combine G_EXTRACT_VECTOR_ELT and G_BUILD_VECTOR sequences into G_SHUFFLE_VECTOR #110545
Conversation
In the buildShuffleVector method, there is a restriction that the input vectors must be larger than the mask. However, this is not the definition of a ShuffleVector instruction that is used inside of our test suite. For example: shuffle_concat_1 in combine_shuffle_vector.mir: 4xs8 -> 16xs8 v3s8_crash in legalize_insert_vector_elt: 3xs8 -> 12xs8 shuffle_vector_to_concat_vector_45670123 in prelegalizercombiner-shuffle-vector: 4xs32 -> 12xs32
Thank you for submitting a Pull Request (PR) to the LLVM Project! This PR will be automatically labeled and the relevant teams will be notified. If you wish to, you can add reviewers by using the "Reviewers" section on this page. If this is not working for you, it is probably because you do not have write permissions for the repository. In which case you can instead tag reviewers by name in a comment by using If you have received no comments on your PR for a week, you can request a review by "ping"ing the PR by adding a comment “Ping”. The common courtesy "ping" rate is once a week. Please remember that you are asking for valuable time from other developers. If you have further questions, they may be answered by the LLVM GitHub User Guide. You can also ask questions in a comment on this PR, on the LLVM Discord or on the forums. |
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-aarch64 Author: Valentijn (ValentijnvdBeek) ChangesThis combine tries to find all the build vectors whose source elements all originate from a G_EXTRACT_VECTOR_ELT from one or two donor vectors. One example where this may happen is for AI chips where there are a lot of matrix multiplications. Typically there vectors are dissected and then rearranged into the right transformation. E.g.
It has some speed ups in the AArch64 architecture. This pattern can also occur in other architectures were a lot of vector operations may happen. It can be tempting to express matrix operations in a similar fashion rather than as a one cohesive shufflevector directly. Patch is 29.44 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/110545.diff 7 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h b/llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h
index 76d51ab819f441..15b0aaa5cf9d03 100644
--- a/llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h
+++ b/llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h
@@ -580,6 +580,15 @@ class CombinerHelper {
bool matchExtractVecEltBuildVec(MachineInstr &MI, Register &Reg);
void applyExtractVecEltBuildVec(MachineInstr &MI, Register &Reg);
+ /// Combine extracts of two different arrays into one build vector into a
+ /// shuffle vector.
+ bool matchCombineExtractToShuffle(
+ MachineInstr &MI, SmallVectorImpl<std::pair<Register, int>> &MatchInfo,
+ std::pair<Register, Register> &VectorRegisters);
+ void applyCombineExtractToShuffle(
+ MachineInstr &MI, SmallVectorImpl<std::pair<Register, int>> &MatchInfo,
+ std::pair<Register, Register> &VectorRegisters);
+
bool matchExtractAllEltsFromBuildVector(
MachineInstr &MI,
SmallVectorImpl<std::pair<Register, MachineInstr *>> &MatchInfo);
diff --git a/llvm/include/llvm/Target/GlobalISel/Combine.td b/llvm/include/llvm/Target/GlobalISel/Combine.td
index f838c6e62a2ce3..0525bfe1b0ddb2 100644
--- a/llvm/include/llvm/Target/GlobalISel/Combine.td
+++ b/llvm/include/llvm/Target/GlobalISel/Combine.td
@@ -905,6 +905,16 @@ def extract_vec_elt_build_vec : GICombineRule<
[{ return Helper.matchExtractVecEltBuildVec(*${root}, ${matchinfo}); }]),
(apply [{ Helper.applyExtractVecEltBuildVec(*${root}, ${matchinfo}); }])>;
+def extract_vector_register_to_id_mapping_matchinfo :
+ GIDefMatchData<"SmallVector<std::pair<Register, int>>">;
+def vector_reg_pair_matchinfo :
+ GIDefMatchData<"std::pair<Register, Register>">;
+def extract_vector_element_build_vector_to_shuffle_vector : GICombineRule<
+ (defs root:$root, extract_vector_register_to_id_mapping_matchinfo:$matchinfo, vector_reg_pair_matchinfo:$regpair),
+ (match (wip_match_opcode G_BUILD_VECTOR):$root,
+ [{ return Helper.matchCombineExtractToShuffle(*${root}, ${matchinfo}, ${regpair}); }]),
+ (apply [{ Helper.applyCombineExtractToShuffle(*${root}, ${matchinfo}, ${regpair}); }])>;
+
// Fold away full elt extracts from a build_vector.
def extract_all_elts_from_build_vector_matchinfo :
GIDefMatchData<"SmallVector<std::pair<Register, MachineInstr*>>">;
@@ -916,7 +926,8 @@ def extract_all_elts_from_build_vector : GICombineRule<
def extract_vec_elt_combines : GICombineGroup<[
extract_vec_elt_build_vec,
- extract_all_elts_from_build_vector]>;
+ extract_all_elts_from_build_vector,
+ extract_vector_element_build_vector_to_shuffle_vector]>;
def funnel_shift_from_or_shift : GICombineRule<
(defs root:$root, build_fn_matchinfo:$info),
diff --git a/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp b/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
index c279289f9161bf..1499caeb37d134 100644
--- a/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
@@ -38,8 +38,10 @@
#include "llvm/Support/MathExtras.h"
#include "llvm/Target/TargetMachine.h"
#include <cmath>
+#include <llvm/ADT/SmallVector.h>
#include <optional>
#include <tuple>
+#include <utility>
#define DEBUG_TYPE "gi-combiner"
@@ -4205,6 +4207,97 @@ void CombinerHelper::applyExtractVecEltBuildVec(MachineInstr &MI,
replaceSingleDefInstWithReg(MI, Reg);
}
+bool CombinerHelper::matchCombineExtractToShuffle(
+ MachineInstr &MI, SmallVectorImpl<std::pair<Register, int>> &VecIndexPair,
+ std::pair<Register, Register> &VectorRegisters) {
+ assert(MI.getOpcode() == TargetOpcode::G_BUILD_VECTOR);
+ const GBuildVector *Build = cast<GBuildVector>(&MI);
+ // This combine tries to find all the build vectors whose source elements
+ // all originate from a G_EXTRACT_VECTOR_ELT from one or two donor vectors.
+ // One example where this may happen is for AI chips where there are a lot
+ // of matrix multiplications. Typically there vectors are disected and then
+ // rearranged into the right transformation.
+ // E.g.
+ // %donor1(<2 x s32>) = COPY $d0
+ // %donor2(<2 x s32>) = COPY $d1
+ // %ext1 = G_EXTRACT_VECTOR_ELT %donor1, 0
+ // %ext2 = G_EXTRACT_VECTOR_ELT %donor1, 1
+ // %ext3 = G_EXTRACT_VECTOR_ELT %donor2, 0
+ // %ext4 = G_EXTRACT_VECTOR_ELT %donor2, 1
+ /// %vector = G_BUILD_VECTOR %ext1, %ext2, %ext3, %ext4
+ // ==>
+ // replace with:
+ // %vector = G_SHUFFLE_VECTOR %donor1, %donor2, shufflemask(0, 1, 2, 3)
+ SmallSetVector<Register, 2> RegisterVector;
+ const unsigned NumElements = Build->getNumSources();
+ for (unsigned Index = 0; Index < NumElements; Index++) {
+ Register SrcReg = peekThroughBitcast(Build->getSourceReg(Index), MRI);
+ auto *ExtractInstr = getOpcodeDef<GExtractVectorElement>(SrcReg, MRI);
+ if (!ExtractInstr)
+ return false;
+
+ // For shufflemasks we need to know exactly what index to place each element
+ // so if it this build vector doesn't use exclusively constants than we
+ // can't replace with a shufflevector
+ auto Cst = getIConstantVRegVal(ExtractInstr->getIndexReg(), MRI);
+ if (!Cst)
+ return false;
+ unsigned Idx = Cst->getZExtValue();
+
+ Register VectorReg = ExtractInstr->getVectorReg();
+ RegisterVector.insert(VectorReg);
+ VecIndexPair.emplace_back(std::make_pair(VectorReg, Idx));
+ }
+
+ // Create a pair so that we don't need to look for them later. This code is
+ // incorrect if we have more than two vectors in the set. Since we can only
+ // put two vectors in a shuffle, we reject any solution with more than two
+ // anyways.
+ VectorRegisters =
+ std::make_pair(RegisterVector.front(), RegisterVector.back());
+
+ // We check that they're the same type before running. We can also grow the
+ // smaller one to the target size, but there isn't an elegant way to do that
+ // until we have a good lowering for G_EXTRACT_SUBVECTOR.
+ if (MRI.getType(VectorRegisters.first) != MRI.getType(VectorRegisters.second))
+ return false;
+
+ return RegisterVector.size() <= 2;
+}
+
+void CombinerHelper::applyCombineExtractToShuffle(
+ MachineInstr &MI, SmallVectorImpl<std::pair<Register, int>> &MatchInfo,
+ std::pair<Register, Register> &VectorRegisters) {
+ assert(MI.getOpcode() == TargetOpcode::G_BUILD_VECTOR);
+
+ const Register FirstRegister = VectorRegisters.first;
+ const LLT FirstRegisterType = MRI.getType(FirstRegister);
+ const unsigned VectorSize = FirstRegisterType.getNumElements();
+ SmallVector<int, 32> ShuffleMask;
+ for (auto &Pair : MatchInfo) {
+ const Register VectorReg = Pair.first;
+ int Idx = Pair.second;
+
+ if (VectorReg != VectorRegisters.first) {
+ Idx += VectorSize;
+ }
+ ShuffleMask.emplace_back(Idx);
+ }
+
+ // We could reuse the same vector register and shuffle them both together
+ // but it is nicer for later optimizations to explicitely make it undef.
+ const GBuildVector *BuildVector = cast<GBuildVector>(&MI);
+ Register SecondRegister = VectorRegisters.second;
+ if (FirstRegister == SecondRegister) {
+ SecondRegister = MRI.createGenericVirtualRegister(FirstRegisterType);
+ Builder.buildUndef(SecondRegister);
+ }
+
+ Builder.buildShuffleVector(BuildVector->getOperand(0), FirstRegister,
+ SecondRegister, ShuffleMask);
+ MI.eraseFromParent();
+}
+
bool CombinerHelper::matchExtractAllEltsFromBuildVector(
MachineInstr &MI,
SmallVectorImpl<std::pair<Register, MachineInstr *>> &SrcDstPairs) {
diff --git a/llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp b/llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp
index 59f2fc633f5de7..1ddecefa173838 100644
--- a/llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp
+++ b/llvm/lib/CodeGen/GlobalISel/MachineIRBuilder.cpp
@@ -772,8 +772,6 @@ MachineInstrBuilder MachineIRBuilder::buildShuffleVector(const DstOp &Res,
LLT DstTy = Res.getLLTTy(*getMRI());
LLT Src1Ty = Src1.getLLTTy(*getMRI());
LLT Src2Ty = Src2.getLLTTy(*getMRI());
- assert((size_t)(Src1Ty.getNumElements() + Src2Ty.getNumElements()) >=
- Mask.size());
assert(DstTy.getElementType() == Src1Ty.getElementType() &&
DstTy.getElementType() == Src2Ty.getElementType());
(void)DstTy;
diff --git a/llvm/test/CodeGen/AArch64/GlobalISel/combine-build-vector.mir b/llvm/test/CodeGen/AArch64/GlobalISel/combine-build-vector.mir
index 93f6051c3bd3b7..3cc836b9718297 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/combine-build-vector.mir
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/combine-build-vector.mir
@@ -151,3 +151,305 @@ body: |
RET_ReallyLR implicit $x0
...
+---
+name: reverse_concat_buildvector_shuffle
+tracksRegLiveness: true
+body: |
+ bb.1:
+ liveins: $q0, $q1
+ ; CHECK-LABEL: name: reverse_concat_buildvector_shuffle
+ ; CHECK: liveins: $q0, $q1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+ ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<8 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[COPY1]], shufflemask(3, 2, 1, 0, 7, 6, 5, 4)
+ ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<8 x s32>)
+ %0:_(<4 x s32>) = COPY $q0
+ %1:_(<4 x s32>) = COPY $q1
+ %2:_(s64) = G_CONSTANT i64 0
+ %3:_(s64) = G_CONSTANT i64 1
+ %4:_(s64) = G_CONSTANT i64 2
+ %5:_(s64) = G_CONSTANT i64 3
+ %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %5:_(s64)
+ %11:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+ %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+ %13:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %2:_(s64)
+ %14:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %5:_(s64)
+ %15:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %4:_(s64)
+ %16:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %3:_(s64)
+ %17:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64)
+ %18:_(<8 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_, %14:_, %15:_, %16:_, %17:_
+ RET_ReallyLR implicit %18
+...
+---
+name: reverse_interweave_buildvector_shuffle
+tracksRegLiveness: true
+body: |
+ bb.1:
+ liveins: $q0, $q1
+ ; CHECK-LABEL: name: reverse_interweave_buildvector_shuffle
+ ; CHECK: liveins: $q0, $q1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+ ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<8 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[COPY1]], shufflemask(3, 6, 1, 4, 7, 2, 5, 0)
+ ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<8 x s32>)
+ %0:_(<4 x s32>) = COPY $q0
+ %1:_(<4 x s32>) = COPY $q1
+ %2:_(s64) = G_CONSTANT i64 0
+ %3:_(s64) = G_CONSTANT i64 1
+ %4:_(s64) = G_CONSTANT i64 2
+ %5:_(s64) = G_CONSTANT i64 3
+ %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %5:_(s64)
+ %11:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %4:_(s64)
+ %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+ %13:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64)
+ %14:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %5:_(s64)
+ %15:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+ %16:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %3:_(s64)
+ %17:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %2:_(s64)
+ %18:_(<8 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_, %14:_, %15:_, %16:_, %17:_
+ RET_ReallyLR implicit %18
+...
+
+---
+name: reverse_interweave_same_size_as_dest_buildvector_shuffle
+tracksRegLiveness: true
+body: |
+ bb.1:
+ liveins: $q0, $q1
+ ; CHECK-LABEL: name: reverse_interweave_same_size_as_dest_buildvector_shuffle
+ ; CHECK: liveins: $q0, $q1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+ ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<4 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[COPY1]], shufflemask(3, 6, 1, 4)
+ ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<4 x s32>)
+ %0:_(<4 x s32>) = COPY $q0
+ %1:_(<4 x s32>) = COPY $q1
+ %2:_(s64) = G_CONSTANT i64 0
+ %3:_(s64) = G_CONSTANT i64 1
+ %4:_(s64) = G_CONSTANT i64 2
+ %5:_(s64) = G_CONSTANT i64 3
+ %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %5:_(s64)
+ %11:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %4:_(s64)
+ %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+ %13:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64)
+ %14:_(<4 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_
+ RET_ReallyLR implicit %14
+...
+---
+name: reverse_interweave_half_size_as_dest_buildvector_shuffle
+tracksRegLiveness: true
+body: |
+ bb.1:
+ liveins: $q0, $q1
+ ; CHECK-LABEL: name: reverse_interweave_half_size_as_dest_buildvector_shuffle
+ ; CHECK: liveins: $q0, $q1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+ ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<2 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[COPY1]], shufflemask(3, 4)
+ ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<2 x s32>)
+ %0:_(<4 x s32>) = COPY $q0
+ %1:_(<4 x s32>) = COPY $q1
+ %2:_(s64) = G_CONSTANT i64 0
+ %3:_(s64) = G_CONSTANT i64 3
+ %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+ %11:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64)
+ %12:_(<2 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_
+ RET_ReallyLR implicit %12
+...
+---
+name: reverse_concat_single_buildvector_shuffle
+tracksRegLiveness: true
+body: |
+ bb.1:
+ liveins: $q0, $q1
+ ; CHECK-LABEL: name: reverse_concat_single_buildvector_shuffle
+ ; CHECK: liveins: $q0, $q1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[DEF:%[0-9]+]]:_(<4 x s32>) = G_IMPLICIT_DEF
+ ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<4 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[DEF]], shufflemask(3, 1, 0, 2)
+ ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<4 x s32>)
+ %0:_(<4 x s32>) = COPY $q0
+ %1:_(s64) = G_CONSTANT i64 0
+ %2:_(s64) = G_CONSTANT i64 1
+ %3:_(s64) = G_CONSTANT i64 2
+ %4:_(s64) = G_CONSTANT i64 3
+ %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+ %11:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %2:_(s64)
+ %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %1:_(s64)
+ %13:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+ %18:_(<4 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_
+ RET_ReallyLR implicit %18
+...
+---
+name: reverse_concat_double_buildvector_shuffle
+tracksRegLiveness: true
+body: |
+ bb.1:
+ liveins: $q0, $q1
+ ; CHECK-LABEL: name: reverse_concat_double_buildvector_shuffle
+ ; CHECK: liveins: $q0, $q1
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+ ; CHECK-NEXT: [[SHUF:%[0-9]+]]:_(<16 x s32>) = G_SHUFFLE_VECTOR [[COPY]](<4 x s32>), [[COPY1]], shufflemask(3, 2, 1, 0, 6, 4, 5, 7, 1, 0, 2, 0, 5, 4, 1, 7)
+ ; CHECK-NEXT: RET_ReallyLR implicit [[SHUF]](<16 x s32>)
+ %0:_(<4 x s32>) = COPY $q0
+ %1:_(<4 x s32>) = COPY $q1
+ %2:_(s64) = G_CONSTANT i64 0
+ %3:_(s64) = G_CONSTANT i64 1
+ %4:_(s64) = G_CONSTANT i64 2
+ %5:_(s64) = G_CONSTANT i64 3
+ %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %5:_(s64)
+ %11:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+ %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+ %13:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %2:_(s64)
+ %14:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %4:_(s64)
+ %15:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64)
+ %16:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %3:_(s64)
+ %17:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %5:_(s64)
+ %18:_(<16 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_, %14:_, %15:_, %16:_, %17:_, %12:_, %13:_, %11:_, %13:_, %16:_, %15:_, %12:_, %17:_
+ RET_ReallyLR implicit %18
+...
+---
+name: reverse_concat_buildvector_shuffle_three_sources
+tracksRegLiveness: true
+body: |
+ bb.1:
+ liveins: $q0, $q1, $q2
+ ; CHECK-LABEL: name: reverse_concat_buildvector_shuffle_three_sources
+ ; CHECK: liveins: $q0, $q1, $q2
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[COPY1:%[0-9]+]]:_(<4 x s32>) = COPY $q1
+ ; CHECK-NEXT: [[COPY2:%[0-9]+]]:_(<4 x s32>) = COPY $q2
+ ; CHECK-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
+ ; CHECK-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
+ ; CHECK-NEXT: [[EVEC:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C1]](s64)
+ ; CHECK-NEXT: [[EVEC1:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C]](s64)
+ ; CHECK-NEXT: [[EVEC2:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY1]](<4 x s32>), [[C1]](s64)
+ ; CHECK-NEXT: [[EVEC3:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY1]](<4 x s32>), [[C]](s64)
+ ; CHECK-NEXT: [[EVEC4:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY2]](<4 x s32>), [[C1]](s64)
+ ; CHECK-NEXT: [[EVEC5:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY2]](<4 x s32>), [[C]](s64)
+ ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<8 x s32>) = G_BUILD_VECTOR [[EVEC1]](s32), [[EVEC]](s32), [[EVEC1]](s32), [[EVEC2]](s32), [[EVEC3]](s32), [[EVEC4]](s32), [[EVEC5]](s32), [[EVEC1]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit [[BUILD_VECTOR]](<8 x s32>)
+ %0:_(<4 x s32>) = COPY $q0
+ %1:_(<4 x s32>) = COPY $q1
+ %2:_(<4 x s32>) = COPY $q2
+ %3:_(s64) = G_CONSTANT i64 1
+ %4:_(s64) = G_CONSTANT i64 2
+ %11:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+ %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+ %13:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %4:_(s64)
+ %14:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %3:_(s64)
+ %15:_(s32) = G_EXTRACT_VECTOR_ELT %2:_(<4 x s32>), %4:_(s64)
+ %16:_(s32) = G_EXTRACT_VECTOR_ELT %2:_(<4 x s32>), %3:_(s64)
+ %18:_(<8 x s32>) = G_BUILD_VECTOR %12:_(s32), %11:_, %12:_, %13:_, %14:_, %15:_, %16:_, %12:_
+ RET_ReallyLR implicit %18
+...
+---
+name: reverse_concat_buildvector_shuffle_different_element_size
+tracksRegLiveness: true
+body: |
+ bb.1:
+ liveins: $q0, $d0
+ ; CHECK-LABEL: name: reverse_concat_buildvector_shuffle_different_element_size
+ ; CHECK: liveins: $q0, $d0
+ ; CHECK-NEXT: {{ $}}
+ ; CHECK-NEXT: [[COPY:%[0-9]+]]:_(<4 x s32>) = COPY $q0
+ ; CHECK-NEXT: [[C:%[0-9]+]]:_(s64) = G_CONSTANT i64 0
+ ; CHECK-NEXT: [[C1:%[0-9]+]]:_(s64) = G_CONSTANT i64 1
+ ; CHECK-NEXT: [[C2:%[0-9]+]]:_(s64) = G_CONSTANT i64 2
+ ; CHECK-NEXT: [[C3:%[0-9]+]]:_(s64) = G_CONSTANT i64 3
+ ; CHECK-NEXT: [[EVEC:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C3]](s64)
+ ; CHECK-NEXT: [[EVEC1:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C2]](s64)
+ ; CHECK-NEXT: [[EVEC2:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C1]](s64)
+ ; CHECK-NEXT: [[EVEC3:%[0-9]+]]:_(s32) = G_EXTRACT_VECTOR_ELT [[COPY]](<4 x s32>), [[C]](s64)
+ ; CHECK-NEXT: [[DEF:%[0-9]+]]:_(s32) = G_IMPLICIT_DEF
+ ; CHECK-NEXT: [[BUILD_VECTOR:%[0-9]+]]:_(<8 x s32>) = G_BUILD_VECTOR [[EVEC]](s32), [[EVEC1]](s32), [[EVEC2]](s32), [[EVEC3]](s32), [[DEF]](s32), [[DEF]](s32), [[EVEC1]](s32), [[EVEC2]](s32)
+ ; CHECK-NEXT: RET_ReallyLR implicit [[BUILD_VECTOR]](<8 x s32>)
+ %0:_(<4 x s32>) = COPY $q0
+ %1:_(<2 x s32>) = COPY $d0
+ %2:_(s64) = G_CONSTANT i64 0
+ %3:_(s64) = G_CONSTANT i64 1
+ %4:_(s64) = G_CONSTANT i64 2
+ %5:_(s64) = G_CONSTANT i64 3
+ %10:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %5:_(s64)
+ %11:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %4:_(s64)
+ %12:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %3:_(s64)
+ %13:_(s32) = G_EXTRACT_VECTOR_ELT %0:_(<4 x s32>), %2:_(s64)
+ %14:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<2 x s32>), %5:_(s64)
+ %15:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<2 x s32>), %4:_(s64)
+ %18:_(<8 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_, %12:_, %13:_, %14:_, %15:_, %11:_, %12:_
+ RET_ReallyLR implicit %18
+...
+---
+name: reverse_concat_buil...
[truncated]
|
; CHECK-GI-NEXT: adrp x8, .LCPI27_0 | ||
; CHECK-GI-NEXT: ldr q0, [x0] | ||
; CHECK-GI-NEXT: add x8, x1, #2 | ||
; CHECK-GI-NEXT: st1.h { v0 }[6], [x1] | ||
; CHECK-GI-NEXT: st1.h { v0 }[5], [x8] | ||
; CHECK-GI-NEXT: ldr q2, [x8, :lo12:.LCPI27_0] | ||
; CHECK-GI-NEXT: tbl.16b v0, { v0, v1 }, v2 | ||
; CHECK-GI-NEXT: mov h1, v0[1] | ||
; CHECK-GI-NEXT: str h0, [x1] | ||
; CHECK-GI-NEXT: str h1, [x1, #2] | ||
; CHECK-GI-NEXT: ret | ||
entry: | ||
%tmp2 = load <8 x i16>, ptr %source, align 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code outputs the following from IR translator:
bb.1.entry:
liveins: $x0, $x1
%0:_(p0) = COPY $x0
%1:_(p0) = COPY $x1
%4:_(s64) = G_CONSTANT i64 6
%6:_(<2 x s16>) = G_IMPLICIT_DEF
%7:_(s64) = G_CONSTANT i64 0
%9:_(s64) = G_CONSTANT i64 5
%11:_(s64) = G_CONSTANT i64 1
%2:_(<8 x s16>) = G_LOAD %0:_(p0) :: (load (<8 x s16>) from %ir.source, align 4)
%3:_(s16) = G_EXTRACT_VECTOR_ELT %2:_(<8 x s16>), %4:_(s64)
%5:_(<2 x s16>) = G_INSERT_VECTOR_ELT %6:_, %3:_(s16), %7:_(s64)
%8:_(s16) = G_EXTRACT_VECTOR_ELT %2:_(<8 x s16>), %9:_(s64)
%10:_(<2 x s16>) = G_INSERT_VECTOR_ELT %5:_, %8:_(s16), %11:_(s64)
G_STORE %10:_(<2 x s16>), %1:_(p0) :: (store (<2 x s16>) into %ir.dst)
RET_ReallyLR
Which is then combined into:
bb.1.entry:
liveins: $x0, $x1
%0:_(p0) = COPY $x0
%1:_(p0) = COPY $x1
%2:_(<8 x s16>) = G_LOAD %0:_(p0) :: (load (<8 x s16>) from %ir.source, align 4)
%12:_(<8 x s16>) = G_IMPLICIT_DEF
%10:_(<2 x s16>) = G_SHUFFLE_VECTOR %2:_(<8 x s16>), %12:_, shufflemask(6, 5)
G_STORE %10:_(<2 x s16>), %1:_(p0) :: (store (<2 x s16>) into %ir.dst)
RET_ReallyLR
Which is the code that we would expect, since it creates a vector with the correct elements in the right place. It is not exactly SDAG, since it misses an optimization for shufflevector where it unmerges twice into the third half of the first vector and then reverses it.
@@ -772,8 +772,6 @@ MachineInstrBuilder MachineIRBuilder::buildShuffleVector(const DstOp &Res, | |||
LLT DstTy = Res.getLLTTy(*getMRI()); | |||
LLT Src1Ty = Src1.getLLTTy(*getMRI()); | |||
LLT Src2Ty = Src2.getLLTTy(*getMRI()); | |||
assert((size_t)(Src1Ty.getNumElements() + Src2Ty.getNumElements()) >= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As outlined in the commit, there are a couple of locations in the AArch64 where this assertion doesn't hold. Therefore, I removed it to bring the build function in line with the real(?) world usage of ShuffleVector.
%x:_(<16 x s8>) = G_SHUFFLE_VECTOR %a:_(<4 x s8>), %b:_, shufflemask(0, 1, 2, 3, 4, 5, 6, 7, undef, undef, undef, undef, undef, undef, undef, undef) |
llvm-project/llvm/test/CodeGen/AArch64/GlobalISel/legalize-insert-vector-elt.mir
Line 285 in 5d45815
%9:_(<12 x s8>) = G_SHUFFLE_VECTOR %6(<3 x s8>), %10, shufflemask(0, 3, 3, 3, 1, 3, 3, 3, 2, 3, 3, 3) |
llvm-project/llvm/test/CodeGen/AArch64/GlobalISel/prelegalizercombiner-shuffle-vector.mir
Line 192 in 5d45815
%2:_(<12 x s32>) = G_SHUFFLE_VECTOR %0(<4 x s32>), %1(<4 x s32>), shufflemask(4,5,6,7,0,1,2,3,4,5,6,7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this assert start failing with your PR? It seemed to work before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I/we build our LLVM for development with -DLLVM_ENABLE_ASSERTIONS=On
. If an assert hits, my first thought is: Oh wait, I broke something. My thought is not: this is a legacy assert and I should remove it. Please don't remove asserts. You should think about fixing your code/contribution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are completely right, that the code should be checked with asserts and that they are an important part of ensuring quality going further. Thanks for the reminder, I think that is never unwelcome. However, in this case my PR doesn't break the assertion per se. Rather the examples of G_SHUFFLE_VECTOR
, I mentioned before don't follow the assertion. The code asserts that the sum of the vectors is larger than the mask, but for none of those are true. Since those patterns exist, I want to support them and since we have a buildShuffleVector
I want to use it (instead of reimplementing it).
Anyways, I am completely willing to drop the commit, you know much more than me about things like this. But my reason for removing it was to pull into line with how the opcode is used, not because I thought it was a legacy assert :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source vectors each must be the same size, and the result number of elements must exactly match the number of elements in the mask. The assert doesn't quite match the machine verifier's rules for G_SHUFFLE_VECTOR
// We check that they're the same type before running. We can also grow the | ||
// smaller one to the target size, but there isn't an elegant way to do that | ||
// until we have a good lowering for G_EXTRACT_SUBVECTOR. | ||
if (MRI.getType(VectorRegisters.first) != MRI.getType(VectorRegisters.second)) | ||
return false; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the best case scenario, you would cast up the lower one into the right size by adding G_CONCAT_VECTOR
instructions until they are in the same size. A few months ago the G_EXTRACT_SUBVECTOR
was added and, very recently, a lot of work has gone into implementing it for the RISCV backend. But I don't think there is a nice fallback legalization strategy, which is what I would want before relying on it in the Combiner.
What has worked well (?) for me before is adding G_CONCAT_VECTOR
and G_UNMERGE
until the target has been reached. But, I am not sure if that is the best strategy. If anybody has a good idea, please let me know so that I can work on it.
GIDefMatchData<"std::pair<Register, Register>">; | ||
def extract_vector_element_build_vector_to_shuffle_vector : GICombineRule< | ||
(defs root:$root, extract_vector_register_to_id_mapping_matchinfo:$matchinfo, vector_reg_pair_matchinfo:$regpair), | ||
(match (wip_match_opcode G_BUILD_VECTOR):$root, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be comments about style and algorithms. What I really don't like is the pattern. The pattern matches on build vectors, but the combine is about build vectors of extract vector elts.
def extract_vector_element_build_to_shuffle_vector2 : GICombineRule<
(defs root:$root, build_fn_matchinfo:$matchinfo),
(match
(G_EXTRACT_VECTOR_ELT $el2, $vec2, $idx2),
(G_EXTRACT_VECTOR_ELT $el1, $vec1, $idx1),
(G_BUILD_VECTOR $root, $el1, $el2):$build,
[{ return Helper.matchCombineExtractToShuffle(*${build}, ${matchinfo}, ${regpair}); }]),
( apply [{ Helper.applyCombineExtractToShuffle(*${build}, ${matchinfo}, ${regpair}); }] )>;
This a matcher, but we would copy and paste for many build vector sizes ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, sorry for committing something that you really don't like. I hope that are things about it you do enjoy. I really appreciate the link. I missed it when I looked for something like it before and I am happy with any information about tablegen that I can find.
In this case, I am not entirely sure if it is exactly the right solution. As you mention, there are a lot of sizes to take into account. For the backend that I am the most for familiar with, AI Engine, the vector size is a maximum of 2048 and a minimum scalar size of 8. So that has 256 patterns to match for both this combiner and the three around it (that use the same pattern). I think we only need to do the powers of 2 and 3 (AMDGPU), so that makes it about ~14 patterns or ~42 in total.
Another reservation that I have is that in this case, complete enumeration is a bit difficult. There is always possible that there is an architecture that has a vector size that just a bit larger or has a different size than the others. By enumerating explicitly, future/alternatives backends need to remember/know to update this combiner specifically.
However, this is my first contribution to LLVM. So these are just, somewhat layman, considerations. If you think that they are not as relevant as I think, than I more willing to change the pattern and also update the combiners around it with the same pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are interested in compile-time. The combiner engine is more efficient than the C++ code. That is why we strive for precise patterns. If we jump for every build vector into the C++ code, the success rate of the C++ code will be extremely low. It is waste of compile-time. For precise patterns, we only jump into C++ code when there is good chance to succeed. If the pattern of this combine is build vector and the chance of success of the combine is really low, then we have to try something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review, sorry for committing something that you really don't like. I hope that are things about it you do enjoy. I really appreciate the link. I missed it when I looked for something like it before and I am happy with any information about tablegen that I can find.
Please don't be discouraged, Thorsten was only referring to the approach itself and it's common and expected in LLVM code reviews to critique the code in that manner. It's not a reflection on whether we want your contribution at all, usually it's just a matter of making sure it's done in the best way possible. We'd love to have more people like yourself contribute to the project!
I haven't thought about this patch in detail yet but once I do I'll give my feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tschuett I fixed this in a fixup commit for powers of two (from what I could find also AMDGPU uses those as registers sizes). There seems to be a cap for the amount of patterns that can be supported (1024x8, 2048x16, 4096x32) and it does add a lot of extra lines. Just this pattern now takes 25-30% of the file and patterns around it should also be changed to this.
// but it is nicer for later optimizations to explicitely make it undef. | ||
const GBuildVector *BuildVector = cast<GBuildVector>(&MI); | ||
Register SecondRegister = VectorRegisters.second; | ||
if (FirstRegister == SecondRegister) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should also have a separate combine for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! What exactly should be a separate combine? This sequence with one source? Or turning shufflevectors with the same source registers into a source and an undefined register?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shufflevectors with the same source registers into a source and undefined (this came up on another review recently too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will remove the check from the current PR. Sounds like a great target for me to work on when I have a bit of free time, thanks for the suggestion! I appreciate it.
%11:_(s32) = G_EXTRACT_VECTOR_ELT %1:_(<4 x s32>), %2:_(s64) | ||
%12:_(<2 x s32>) = G_BUILD_VECTOR %10:_(s32), %11:_ | ||
RET_ReallyLR implicit %12 | ||
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test with an unexpected other user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that I did what you asked. I added a commit where one of the arguments is not a G_EXTRACT_VECTOR_ELT
and on where it is a G_BUILD_VECTOR_TRUNC
(which I don't target yet). But I am not entirely sure what you mean with other user. I must I admit that I am not entirely familiar with the LLVM lingo yet :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this resolved with the new commit, then I can close it.
There are no legality checks. |
So while I see some nice code quality improvements in the tests, I'm not convinced this is a good transformation to make in the general case. The problem is that vector extracts->inserts are simple operations and a shuffle is in general a large and expensive operation, unless it can be pattern matched to a more precise variant like zip. For AArch64, In your example in this PR description, your input could have been optimized instead into a G_CONCAT_VECTOR right? More specific operations are usually faster and they're also easier for other optimizations to reason about. Shuffles suffer from the problem that they're more opaque unless we do expensive analysis. |
; CHECK-LABEL: test_concat_v16i8_v8i8_v8i8: | ||
; CHECK: // %bb.0: // %entry | ||
; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0 | ||
; CHECK-NEXT: // kill: def $d1 killed $d1 def $q1 | ||
; CHECK-NEXT: mov v0.d[1], v1.d[0] | ||
; CHECK-NEXT: ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turns into the following Generic IR, by first turning it into a shufflevector and then running the G_CONCAT_VECTORS shufflevector combiner.
bb.1.entry:
liveins: $d0, $d1
%0:_(<8 x s8>) = COPY $d0
%1:_(<8 x s8>) = COPY $d1
%51:_(<16 x s8>) = G_CONCAT_VECTORS %0:_(<8 x s8>), %1:_(<8 x s8>)
$q0 = COPY %51:_(<16 x s8>)
RET_ReallyLR implicit $q0
First of thanks for the great detailed comment Amara, I don't have a lot of experience with AArch64, so your perspective is very helpful.
The combiner based on the lowering of the shuffle vector. When you lower a shuffle vector it will turn into a sequence of extract->buildvector operations which is exactly what I match here. So any backend either uses this lowering and performs the same, or has a more efficient implementation of shuffle vector and performs better. However, I am not sure if this assumption actually holds in reality, so hopefully you can shine a light on that.
You are right and it does actually. Another combiner turns the shufflevector into a I didn't mention it since it a consequence of another combiner kicking in, although this is an intentional effect. Probably should have put a comment explaining the speedups for the cases where it conforms to the Selection DAG output, but I didn't want to overload the PR with additional comments. Apologies for the confusion I caused by that oversight.
The background of the PR is that I worked during my internship on extending the analysis of shufflevectors and replacing them with more specific opcodes. By pulling these sequences into a shufflevector, the analysis is run on it and then either the sequence is replaced by an equivalent opcode or the shufflevector is lowered, in the generic case, into the exact same sequence as it was before. You mention that is the analysis is expensive, which is true, but we are already running this analysis for shufflevectors and I am hoping that the speed improvements in the generated code will make up for the additional compile overhead. With some luck my previous company will let me upstream that code and allow me to implement some of the ideas I have with regards of reducing the cost of this analysis. But no promises. |
Yes that's true, but let's think about this from first principles. In LLVM IR, we don't have separate instructions for things like vector concat, zip, etc... because the general shuffle can implement those. We rely on optimizations later on in codegen to select the most optimal instructions for some patterns. In the GISel MIR, we don't have the same philosophy as the IR. Our goal is not to have a minimal MIR but to instead aid in lowering and selecting good target instructions. So we decompose some complex operations that targets don't have native support for into simpler ops, because now one of the factors we have to take into account is the idea of instruction legality. If we were to perform this transform, we would be going from simpler (but more numerous) operations into more complex single operation. In order to get performant output, we rely on another transform to then re-optimize our output into the idea form. Therefore, our transform is not so much an optimization but instead a form of canonicalization. If we decided to instead directly pattern match into a vector concat for example, we're skipping the canonicalization step and going straight to the instruction we want to emit. There's no ambiguity here about what's the best final output, in this case it's always a G_CONCAT_VECTOR. So going directly has benefits in 1) not having to spend time going through the intermediate step, and b) having the guaranteed output generated.
Improving pattern matching of shuffle idioms so that we lower into more optimal instructions sounds like a good idea. I don't think that means that we need to canonicalize everything into shufflevectors however. |
FWIW the DAG does try to do this: https://github.com/llvm/llvm-project/blob/f0ed31ce4b63a5530fd1de875c0d1467d4d2c6ea/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L24096 I can see this pattern appearing if some operations require some degree of vector splitting, and others do not |
Yeah but I don't think it's a good transformation to do if we can express the shuffle in a simpler & equally compact way. I can see that being helpful for long, complex sequences of extracts/inserts but if we know the idiom (e.g. concat) we should go directly. |
15949a1
to
4ecb536
Compare
extract_vector_element_build_vector_to_shuffle_vector]>; | ||
extract_vector_element_build_vector_to_shuffle_vector, | ||
extract_all_elts_from_build_vector]>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: useless change, will be removed when squashed so this can be ignored
; CHECK-LABEL: getl: | ||
; CHECK: // %bb.0: | ||
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0 | ||
; CHECK-NEXT: ret | ||
; CHECK-SD-LABEL: getl: | ||
; CHECK-SD: // %bb.0: | ||
; CHECK-SD-NEXT: // kill: def $d0 killed $d0 killed $q0 | ||
; CHECK-SD-NEXT: ret | ||
; | ||
; CHECK-GI-LABEL: getl: | ||
; CHECK-GI: // %bb.0: | ||
; CHECK-GI-NEXT: mov b2, v0.b[1] | ||
; CHECK-GI-NEXT: mov v1.b[0], v0.b[0] | ||
; CHECK-GI-NEXT: mov b3, v0.b[2] | ||
; CHECK-GI-NEXT: mov v1.b[1], v2.b[0] | ||
; CHECK-GI-NEXT: mov b2, v0.b[3] | ||
; CHECK-GI-NEXT: mov v1.b[2], v3.b[0] | ||
; CHECK-GI-NEXT: mov b3, v0.b[4] | ||
; CHECK-GI-NEXT: mov v1.b[3], v2.b[0] | ||
; CHECK-GI-NEXT: mov b2, v0.b[5] | ||
; CHECK-GI-NEXT: mov v1.b[4], v3.b[0] | ||
; CHECK-GI-NEXT: mov b3, v0.b[6] | ||
; CHECK-GI-NEXT: mov b0, v0.b[7] | ||
; CHECK-GI-NEXT: mov v1.b[5], v2.b[0] | ||
; CHECK-GI-NEXT: mov v1.b[6], v3.b[0] | ||
; CHECK-GI-NEXT: mov v1.b[7], v0.b[0] | ||
; CHECK-GI-NEXT: fmov d0, d1 | ||
; CHECK-GI-NEXT: ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regression comes from the legalizer step of the shufflevector missing. In the previous iteration of the code, it would create a shufflevector which would be legalized into the correct size. It then adds undefs to the mask until it is the same size, rerun the combiner which is then able to turn it into a concat vectors. Since we are now running the analysis directly, we can't do that (yet, this pattern is covered in my additions to the analysis code)
; CHECK-GI-NEXT: add x8, x1, #2 | ||
; CHECK-GI-NEXT: st1.h { v0 }[6], [x1] | ||
; CHECK-GI-NEXT: st1.h { v0 }[5], [x8] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inline with @aemerson comment about tbl
, it no longer turns non-optimizable sequences into shufflevectors so it keeps previous iteration of the code.
Thanks for your clear explanation of GISelm MIR, I appreciate the time you took and I understand a lot better now. I have updated the code so that it directly runs the analysis rather than doing a step in the middle. That has the benefit that we can still share code without actually having to turn into shufflevector.
I mean it does both. The PR aims to simplify long, complex sequences that occur in matrix operations like the ones that were common for the backend that I used to work on (as mentioned in the PR description). With the side benefit of the code improvements that the shufflevector optimizations bring. Since this is not the backend I worked on and I don't how common it is in other backends, it isn't my focus atm. But, both of you know more than I ever forget, so I will eagerly await the result of this discussion. One suggestion that I have is that, in line with the comment from @tschuett, I can use specific combiners for each size. Smaller sizes are optimized directly, but larger sequences (say >32) are turned into shufflevector. Does that sound like an idea? |
@@ -4205,6 +4218,76 @@ void CombinerHelper::applyExtractVecEltBuildVec(MachineInstr &MI, | |||
replaceSingleDefInstWithReg(MI, Reg); | |||
} | |||
|
|||
bool CombinerHelper::matchCombineExtractToShuffle( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to follow along with the DAG version; you'll want something similar to buildLegalVectorShuffle, which will avoid creating shuffles which will just be broken up again https://github.com/llvm/llvm-project/blob/33c14f19656c751bbbc083e4a168ab898e583bfd/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp#L23501
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I am now avoiding making a shufflevector after input from Amara. But I will be sure to check this if I need to make a shufflevector in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some nits, thanks!
// we could still emit an extract vector element in that case. | ||
if (DstNumElts < 2 * SrcNumElts && DstNumElts != 1) | ||
return false; | ||
bool CombinerHelper::matchVectorMaskSequence( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is now a shared utility, could you give it a more specific name and also add a doc comment to explain what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment and changed the name slightly, it is still a bit broad to make sure that it doesn't get pidgeonholed as just concatenation. Please tell me if you would like it to be changed
// One example where this may happen is for AI chips where there are a lot | ||
// of matrix multiplications. Typically there vectors are disected and then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't particularly attach specific use cases like "AI chips" to comments, the transform description should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
… sequences into G_SHUFFLE_VECTOR This combine tries to find all the build vectors whose source elements all originate from a G_EXTRACT_VECTOR_ELT from one or two donor vectors. One example where this may happen is for AI chips where there are a lot of matrix multiplications. Typically there vectors are disected and then rearranged into the right transformation. E.g. %donor1(<2 x s32>) = COPY $d0 %donor2(<2 x s32>) = COPY $d1 %ext1 = G_EXTRACT_VECTOR_ELT %donor1, 0 %ext2 = G_EXTRACT_VECTOR_ELT %donor1, 1 %ext3 = G_EXTRACT_VECTOR_ELT %donor2, 0 %ext4 = G_EXTRACT_VECTOR_ELT %donor2, 1 %vector = G_BUILD_VECTOR %ext1, %ext2, %ext3, %ext4 ==> replace with: %vector = G_SHUFFLE_VECTOR %donor1, %donor2, shufflemask(0, 1, 2, 3)
4ecb536
to
9f8d0a4
Compare
9f8d0a4
to
baa7cd8
Compare
I addresses the nits. Thanks for the great feedback that you gave me, they were really helpful and I learned about LLVM from them! |
This combine tries to find all the build vectors whose source elements all originate from a G_EXTRACT_VECTOR_ELT from one or two donor vectors. One example where this may happen is for AI chips where there are a lot of matrix multiplications. Typically there vectors are dissected and then rearranged into the right transformation.
E.g.
It has some speed ups in the AArch64 architecture. This pattern can also occur in other architectures were a lot of vector operations may happen. It can be tempting to express matrix operations in a similar fashion rather than as a one cohesive shufflevector directly.