-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Working PR for AVX128 #3720
Conversation
02e56f4
to
ed551c5
Compare
|
97f829d
to
3c2a2fc
Compare
|
Do you want to define either a helper taking a lambda, or a macro, or maybe some template I wouldn't know how to get working, for: void AVXBinop(OpcodeArgs, HELPER) {
const auto SrcSize = GetSrcSize(Op);
const auto Is128Bit = SrcSize == Core::CPUState::XMM_SSE_REG_SIZE;
auto Src1 = AVX128_LoadSource_WithOpSize(Op, Op->Src[0], Op->Flags, !Is128Bit);
auto Src2 = AVX128_LoadSource_WithOpSize(Op, Op->Src[1], Op->Flags, !Is128Bit);
RefPair Result {};
Result.Low = HELPER(ElementSize, Src1.Low, Src2.Low);
if (Is128Bit) {
Result.High = LoadAndCacheNamedVectorConstant(OpSize::i128Bit, FEXCore::IR::NamedVectorConstant::NAMED_VECTOR_ZERO);
} else {
Result.High = HELPER(ElementSize, Src1.High, Src2.High);
}
AVX128_StoreResult_WithOpSize(Op, Op->Dest, Result);
} Should let you deduplicate a lot of binops. Alternatively, can we implement VPUNPCKL, VPACKUS, etc with the usual VADD path? That would probably get most of the win with the least effort. But there's still a lot of boilerplate for implementations like VADDSUBP and VANDN. |
Looks like the SVE-256 impls have an |
e9c86fe
to
b2bc452
Compare
Did it. Lots of duplication removed. |
b2bc452
to
f695f58
Compare
Looks great! |
296eed3
to
94c22d5
Compare
|
aa83c9d
to
8be871a
Compare
|
6c10b3b
to
7f74c83
Compare
Wasn't exposed before since we couldn't unit test the SVE256 implementation.
Previously we could always tell the size of the operation depending on how this effects the operating size of the instruction. Converting 64-bit down to 32-bit as an example. AVX gather instructions are the first instruction class that can't infer this information. The element load size is determined by the W flag but the operating size of 128-bit or 256-bit is determined by other means. Expose this flag so we can determine this difference. The FMA instructions are going to need this flag as well.
93813a3
to
85f6978
Compare
This does a gather load three ways, SVE256, SVE128, and ASIMD. This operation is a bit special since it it can't quite handle all gather loadstores in the 256-bit case and requires the frontend to decompose the operation in the case that the striding hits a mode that SVE doesn't support! The 128-bit case is a lot simpler since both support all the cases where stride doesn't match. I find this to be a nice compromise while there aren't any SVE256 products on the market. In the 128-bit case there is an SVE path which is utilized if the passed in stride supports what SVE understands, otherwise it falls back to an ASIMD implementation which manually emulates everything that is necessary. This instruction is very explicitly doing basically exactly what AVX gather instructions want, because it's complex enough that we don't want to try and make this a generic solution.
This is the last family of instructions that we needed to implement for AVX2 to be properly advertised!
Just to ensure we still have feature parity.
2e1a7d0
to
13ee363
Compare
ADDSUB didn't cover this new variant.
13ee363
to
1b6adb5
Compare
We now no longer care about AVX versions, consolidate them in to a single config option which enables both.
This enables AVX, AVX2, FMA3 for the entire CPUID! ```bash $ FEX_HOSTFEATURES=enableavx,enableavx2 ./Bin/FEXInterpreter /usr/bin/cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Cortex-A78AE stepping : 0 microcode : 0x0 cpu MHz : 3000 cache size : 512 KB physical id : 0 siblings : 12 core id : 0 cpu cores : 12 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht tm syscall nx mmxext fxsr_opt rdtscp lm 3dnow 3dnowext constant_tsc art rep_good nopl xtoplogy nonstop_tsc cpuid tsc_known_freq pni pclmulqdq dtes64 monitor tm2 ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx hypervisor lahf_lm cmp_legacy extapic abm 3dnowprefetc h tce fsgsbase bmi1 avx2 smep bmi2 erms invpcid adx clflushopt clwb sha_ni clzero arat vpclmulqdq rdpid fsrm bugs : bogomips : 8000.0 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ``` Notice avx, avx2, and fma
1b6adb5
to
fd10436
Compare
Now that everything is fanned out into PRs, closing this one |
This is the working branch for AVX128 and I'll be peeling off commits as other PRs get merged.
This is to allow visibility in to where the implementation is currently at to inform other PR reviews.