WIP: Working PR for AVX128 #3720

Sonicadvance1 · 2024-06-18T03:39:48Z

This is the working branch for AVX128 and I'll be peeling off commits as other PRs get merged.

This is to allow visibility in to where the implementation is currently at to inform other PR reviews.

Sonicadvance1 · 2024-06-18T06:31:19Z

46% tests passed, 530 tests failed out of 986

Sonicadvance1 · 2024-06-18T12:08:20Z

61% tests passed, 386 tests failed out of 986

alyssarosenzweig · 2024-06-18T17:01:38Z

Do you want to define either a helper taking a lambda, or a macro, or maybe some template I wouldn't know how to get working, for:

void AVXBinop(OpcodeArgs, HELPER) {
  const auto SrcSize = GetSrcSize(Op);
  const auto Is128Bit = SrcSize == Core::CPUState::XMM_SSE_REG_SIZE;

  auto Src1 = AVX128_LoadSource_WithOpSize(Op, Op->Src[0], Op->Flags, !Is128Bit);
  auto Src2 = AVX128_LoadSource_WithOpSize(Op, Op->Src[1], Op->Flags, !Is128Bit);

  RefPair Result {};

  Result.Low = HELPER(ElementSize, Src1.Low, Src2.Low);
  if (Is128Bit) {
    Result.High = LoadAndCacheNamedVectorConstant(OpSize::i128Bit, FEXCore::IR::NamedVectorConstant::NAMED_VECTOR_ZERO);
  } else {
    Result.High = HELPER(ElementSize, Src1.High, Src2.High);
  }

  AVX128_StoreResult_WithOpSize(Op, Op->Dest, Result);
}

Should let you deduplicate a lot of binops.

Alternatively, can we implement VPUNPCKL, VPACKUS, etc with the usual VADD path? That would probably get most of the win with the least effort. But there's still a lot of boilerplate for implementations like VADDSUBP and VANDN.

alyssarosenzweig · 2024-06-18T17:57:24Z

Do you want to define either a helper taking a lambda, or a macro, or maybe some template I wouldn't know how to get working, for:
void AVXBinop(OpcodeArgs, HELPER) {
  const auto SrcSize = GetSrcSize(Op);
  const auto Is128Bit = SrcSize == Core::CPUState::XMM_SSE_REG_SIZE;

  auto Src1 = AVX128_LoadSource_WithOpSize(Op, Op->Src[0], Op->Flags, !Is128Bit);
  auto Src2 = AVX128_LoadSource_WithOpSize(Op, Op->Src[1], Op->Flags, !Is128Bit);

  RefPair Result {};

  Result.Low = HELPER(ElementSize, Src1.Low, Src2.Low);
  if (Is128Bit) {
    Result.High = LoadAndCacheNamedVectorConstant(OpSize::i128Bit, FEXCore::IR::NamedVectorConstant::NAMED_VECTOR_ZERO);
  } else {
    Result.High = HELPER(ElementSize, Src1.High, Src2.High);
  }

  AVX128_StoreResult_WithOpSize(Op, Op->Dest, Result);
}
Should let you deduplicate a lot of binops.

Alternatively, can we implement VPUNPCKL, VPACKUS, etc with the usual VADD path? That would probably get most of the win with the least effort. But there's still a lot of boilerplate for implementations like VADDSUBP and VANDN.

Looks like the SVE-256 impls have an ALUR helper for reversed bin ops, used for VANDN. Seems like we'll want to do that for avx128 too

Sonicadvance1 · 2024-06-18T22:57:54Z

Do you want to define either a helper taking a lambda, or a macro, or maybe some template I wouldn't know how to get working, for:
void AVXBinop(OpcodeArgs, HELPER) {
  const auto SrcSize = GetSrcSize(Op);
  const auto Is128Bit = SrcSize == Core::CPUState::XMM_SSE_REG_SIZE;

  auto Src1 = AVX128_LoadSource_WithOpSize(Op, Op->Src[0], Op->Flags, !Is128Bit);
  auto Src2 = AVX128_LoadSource_WithOpSize(Op, Op->Src[1], Op->Flags, !Is128Bit);

  RefPair Result {};

  Result.Low = HELPER(ElementSize, Src1.Low, Src2.Low);
  if (Is128Bit) {
    Result.High = LoadAndCacheNamedVectorConstant(OpSize::i128Bit, FEXCore::IR::NamedVectorConstant::NAMED_VECTOR_ZERO);
  } else {
    Result.High = HELPER(ElementSize, Src1.High, Src2.High);
  }

  AVX128_StoreResult_WithOpSize(Op, Op->Dest, Result);
}
Should let you deduplicate a lot of binops.

Alternatively, can we implement VPUNPCKL, VPACKUS, etc with the usual VADD path? That would probably get most of the win with the least effort. But there's still a lot of boilerplate for implementations like VADDSUBP and VANDN.

Did it. Lots of duplication removed.

alyssarosenzweig · 2024-06-18T23:02:22Z

Looks great!

Sonicadvance1 · 2024-06-19T01:19:26Z

65% tests passed, 344 tests failed out of 986
66 more handlers to be implemented still.
Slowing down a bit since I'm now hitting the more complex instruction implementations.

Sonicadvance1 · 2024-06-19T15:02:17Z

85% tests passed, 152 tests failed out of 986
11 more handlers to implement.

…,rcp}s{s,d}

Wasn't exposed before since we couldn't unit test the SVE256 implementation.

Previously we could always tell the size of the operation depending on how this effects the operating size of the instruction. Converting 64-bit down to 32-bit as an example. AVX gather instructions are the first instruction class that can't infer this information. The element load size is determined by the W flag but the operating size of 128-bit or 256-bit is determined by other means. Expose this flag so we can determine this difference. The FMA instructions are going to need this flag as well.

This does a gather load three ways, SVE256, SVE128, and ASIMD. This operation is a bit special since it it can't quite handle all gather loadstores in the 256-bit case and requires the frontend to decompose the operation in the case that the striding hits a mode that SVE doesn't support! The 128-bit case is a lot simpler since both support all the cases where stride doesn't match. I find this to be a nice compromise while there aren't any SVE256 products on the market. In the 128-bit case there is an SVE path which is utilized if the passed in stride supports what SVE understands, otherwise it falls back to an ASIMD implementation which manually emulates everything that is necessary. This instruction is very explicitly doing basically exactly what AVX gather instructions want, because it's complex enough that we don't want to try and make this a generic solution.

This is the last family of instructions that we needed to implement for AVX2 to be properly advertised!

Just to ensure we still have feature parity.

ADDSUB didn't cover this new variant.

We now no longer care about AVX versions, consolidate them in to a single config option which enables both.

This enables AVX, AVX2, FMA3 for the entire CPUID! ```bash $ FEX_HOSTFEATURES=enableavx,enableavx2 ./Bin/FEXInterpreter /usr/bin/cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Cortex-A78AE stepping : 0 microcode : 0x0 cpu MHz : 3000 cache size : 512 KB physical id : 0 siblings : 12 core id : 0 cpu cores : 12 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 22 wp : yes flags : fpu vme tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht tm syscall nx mmxext fxsr_opt rdtscp lm 3dnow 3dnowext constant_tsc art rep_good nopl xtoplogy nonstop_tsc cpuid tsc_known_freq pni pclmulqdq dtes64 monitor tm2 ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx hypervisor lahf_lm cmp_legacy extapic abm 3dnowprefetc h tce fsgsbase bmi1 avx2 smep bmi2 erms invpcid adx clflushopt clwb sha_ni clzero arat vpclmulqdq rdpid fsrm bugs : bogomips : 8000.0 TLB size : 2560 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ``` Notice avx, avx2, and fma

alyssarosenzweig · 2024-06-26T22:04:34Z

Now that everything is fanned out into PRs, closing this one

Sonicadvance1 force-pushed the avx128_working branch from 02e56f4 to ed551c5 Compare June 18, 2024 03:47

Sonicadvance1 force-pushed the avx128_working branch from 97f829d to 3c2a2fc Compare June 18, 2024 07:40

Sonicadvance1 force-pushed the avx128_working branch 3 times, most recently from e9c86fe to b2bc452 Compare June 18, 2024 22:57

Sonicadvance1 force-pushed the avx128_working branch from b2bc452 to f695f58 Compare June 18, 2024 22:58

Sonicadvance1 force-pushed the avx128_working branch 2 times, most recently from 296eed3 to 94c22d5 Compare June 18, 2024 23:34

Sonicadvance1 force-pushed the avx128_working branch 4 times, most recently from aa83c9d to 8be871a Compare June 19, 2024 14:57

Sonicadvance1 force-pushed the avx128_working branch 6 times, most recently from 6c10b3b to 7f74c83 Compare June 21, 2024 07:55

Sonicadvance1 added 4 commits June 21, 2024 11:13

AVX128: Implement support for v{u,}comis{s,d}

9a4611c

AVX128: Implement support for v{add,sub,mul,fmin,fmax,fdiv,sqrt,rsqrt…

839a1ec

…,rcp}s{s,d}

AVX128: Implement support for vcmpp{s,d}

d26e30b

AVX128: Implement support for vcmps{s,d}

1301620

Sonicadvance1 added 2 commits June 21, 2024 11:13

CPUID: Expose support for VPCLMULQDQ

545915d

Wasn't exposed before since we couldn't unit test the SVE256 implementation.

Sonicadvance1 force-pushed the avx128_working branch from 93813a3 to 85f6978 Compare June 21, 2024 18:16

Sonicadvance1 added 6 commits June 22, 2024 06:39

X86Tables: Describe VPGather in the VEX tables

3fce072

AVX128: Implement support for gather load instructions

3eb6d12

This is the last family of instructions that we needed to implement for AVX2 to be properly advertised!

AVX128: Fix SPDX license. Which commit messed this up?

615e4ae

OpcodeDispatcher: Implement AVX gathers with SVE256

77b6040

Just to ensure we still have feature parity.

InstcountCI: Update for SVE256 gathers!

1818908

Sonicadvance1 force-pushed the avx128_working branch 2 times, most recently from 2e1a7d0 to 13ee363 Compare June 23, 2024 20:40

Sonicadvance1 added 3 commits June 23, 2024 15:24

IR: Adds support for new SUBADD FMA constants

13c2338

ADDSUB didn't cover this new variant.

X86Tables: Describe FMA3 instructions

752aee9

ARM64: Adds new FMA vector instructions

4cda611

Sonicadvance1 force-pushed the avx128_working branch from 13ee363 to 1b6adb5 Compare June 23, 2024 23:09

Sonicadvance1 added 12 commits June 23, 2024 17:25

AVX128: Implement FMA3 instructions

b011699

SVE258: Implement support for FMA3

71901d3

unittests: Convert HostFeatures from AVX2 to AVX

18510cb

unittests: Adds FMA3 unittests

d1995e7

HostFeatures: Removes distinction between AVX and AVX2

47af41f

We now no longer care about AVX versions, consolidate them in to a single config option which enables both.

HostFeatures: Allow enabling AVX without SVE256

b679996

CPUID: Enable support for FMA3 when AVX is enabled

17a6a24

CPUID: Enable support for AVX2 when AVX is enabled

6b1f4a9

InstcountCI: Update for SVE256 FMA implementation

58a8b81

InstcountCI: Support AVX flag

14f8b4d

InstCountCI: Adds AVX128 tests

fd10436

Sonicadvance1 force-pushed the avx128_working branch from 1b6adb5 to fd10436 Compare June 24, 2024 00:27

alyssarosenzweig closed this Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Working PR for AVX128 #3720

WIP: Working PR for AVX128 #3720

Sonicadvance1 commented Jun 18, 2024

Sonicadvance1 commented Jun 18, 2024

Sonicadvance1 commented Jun 18, 2024

alyssarosenzweig commented Jun 18, 2024 •

edited

Loading

alyssarosenzweig commented Jun 18, 2024

Sonicadvance1 commented Jun 18, 2024

alyssarosenzweig commented Jun 18, 2024

Sonicadvance1 commented Jun 19, 2024

Sonicadvance1 commented Jun 19, 2024

alyssarosenzweig commented Jun 26, 2024

WIP: Working PR for AVX128 #3720

WIP: Working PR for AVX128 #3720

Conversation

Sonicadvance1 commented Jun 18, 2024

Sonicadvance1 commented Jun 18, 2024

Sonicadvance1 commented Jun 18, 2024

alyssarosenzweig commented Jun 18, 2024 • edited Loading

alyssarosenzweig commented Jun 18, 2024

Sonicadvance1 commented Jun 18, 2024

alyssarosenzweig commented Jun 18, 2024

Sonicadvance1 commented Jun 19, 2024

Sonicadvance1 commented Jun 19, 2024

alyssarosenzweig commented Jun 26, 2024

alyssarosenzweig commented Jun 18, 2024 •

edited

Loading