Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Working PR for AVX128 #3720

Closed
wants to merge 87 commits into from

Conversation

Sonicadvance1
Copy link
Member

This is the working branch for AVX128 and I'll be peeling off commits as other PRs get merged.

This is to allow visibility in to where the implementation is currently at to inform other PR reviews.

@Sonicadvance1
Copy link
Member Author

46% tests passed, 530 tests failed out of 986

@Sonicadvance1
Copy link
Member Author

61% tests passed, 386 tests failed out of 986

@alyssarosenzweig
Copy link
Collaborator

alyssarosenzweig commented Jun 18, 2024

Do you want to define either a helper taking a lambda, or a macro, or maybe some template I wouldn't know how to get working, for:

void AVXBinop(OpcodeArgs, HELPER) {
  const auto SrcSize = GetSrcSize(Op);
  const auto Is128Bit = SrcSize == Core::CPUState::XMM_SSE_REG_SIZE;

  auto Src1 = AVX128_LoadSource_WithOpSize(Op, Op->Src[0], Op->Flags, !Is128Bit);
  auto Src2 = AVX128_LoadSource_WithOpSize(Op, Op->Src[1], Op->Flags, !Is128Bit);

  RefPair Result {};

  Result.Low = HELPER(ElementSize, Src1.Low, Src2.Low);
  if (Is128Bit) {
    Result.High = LoadAndCacheNamedVectorConstant(OpSize::i128Bit, FEXCore::IR::NamedVectorConstant::NAMED_VECTOR_ZERO);
  } else {
    Result.High = HELPER(ElementSize, Src1.High, Src2.High);
  }

  AVX128_StoreResult_WithOpSize(Op, Op->Dest, Result);
}

Should let you deduplicate a lot of binops.


Alternatively, can we implement VPUNPCKL, VPACKUS, etc with the usual VADD path? That would probably get most of the win with the least effort. But there's still a lot of boilerplate for implementations like VADDSUBP and VANDN.

@alyssarosenzweig
Copy link
Collaborator

Do you want to define either a helper taking a lambda, or a macro, or maybe some template I wouldn't know how to get working, for:

void AVXBinop(OpcodeArgs, HELPER) {
  const auto SrcSize = GetSrcSize(Op);
  const auto Is128Bit = SrcSize == Core::CPUState::XMM_SSE_REG_SIZE;

  auto Src1 = AVX128_LoadSource_WithOpSize(Op, Op->Src[0], Op->Flags, !Is128Bit);
  auto Src2 = AVX128_LoadSource_WithOpSize(Op, Op->Src[1], Op->Flags, !Is128Bit);

  RefPair Result {};

  Result.Low = HELPER(ElementSize, Src1.Low, Src2.Low);
  if (Is128Bit) {
    Result.High = LoadAndCacheNamedVectorConstant(OpSize::i128Bit, FEXCore::IR::NamedVectorConstant::NAMED_VECTOR_ZERO);
  } else {
    Result.High = HELPER(ElementSize, Src1.High, Src2.High);
  }

  AVX128_StoreResult_WithOpSize(Op, Op->Dest, Result);
}

Should let you deduplicate a lot of binops.

Alternatively, can we implement VPUNPCKL, VPACKUS, etc with the usual VADD path? That would probably get most of the win with the least effort. But there's still a lot of boilerplate for implementations like VADDSUBP and VANDN.

Looks like the SVE-256 impls have an ALUR helper for reversed bin ops, used for VANDN. Seems like we'll want to do that for avx128 too

@Sonicadvance1 Sonicadvance1 force-pushed the avx128_working branch 3 times, most recently from e9c86fe to b2bc452 Compare June 18, 2024 22:57
@Sonicadvance1
Copy link
Member Author

Do you want to define either a helper taking a lambda, or a macro, or maybe some template I wouldn't know how to get working, for:

void AVXBinop(OpcodeArgs, HELPER) {
  const auto SrcSize = GetSrcSize(Op);
  const auto Is128Bit = SrcSize == Core::CPUState::XMM_SSE_REG_SIZE;

  auto Src1 = AVX128_LoadSource_WithOpSize(Op, Op->Src[0], Op->Flags, !Is128Bit);
  auto Src2 = AVX128_LoadSource_WithOpSize(Op, Op->Src[1], Op->Flags, !Is128Bit);

  RefPair Result {};

  Result.Low = HELPER(ElementSize, Src1.Low, Src2.Low);
  if (Is128Bit) {
    Result.High = LoadAndCacheNamedVectorConstant(OpSize::i128Bit, FEXCore::IR::NamedVectorConstant::NAMED_VECTOR_ZERO);
  } else {
    Result.High = HELPER(ElementSize, Src1.High, Src2.High);
  }

  AVX128_StoreResult_WithOpSize(Op, Op->Dest, Result);
}

Should let you deduplicate a lot of binops.

Alternatively, can we implement VPUNPCKL, VPACKUS, etc with the usual VADD path? That would probably get most of the win with the least effort. But there's still a lot of boilerplate for implementations like VADDSUBP and VANDN.

Did it. Lots of duplication removed.

@alyssarosenzweig
Copy link
Collaborator

Looks great!

@Sonicadvance1 Sonicadvance1 force-pushed the avx128_working branch 2 times, most recently from 296eed3 to 94c22d5 Compare June 18, 2024 23:34
@Sonicadvance1
Copy link
Member Author

65% tests passed, 344 tests failed out of 986
66 more handlers to be implemented still.
Slowing down a bit since I'm now hitting the more complex instruction implementations.

@Sonicadvance1 Sonicadvance1 force-pushed the avx128_working branch 4 times, most recently from aa83c9d to 8be871a Compare June 19, 2024 14:57
@Sonicadvance1
Copy link
Member Author

85% tests passed, 152 tests failed out of 986
11 more handlers to implement.

@Sonicadvance1 Sonicadvance1 force-pushed the avx128_working branch 6 times, most recently from 6c10b3b to 7f74c83 Compare June 21, 2024 07:55
Wasn't exposed before since we couldn't unit test the SVE256
implementation.
Previously we could always tell the size of the operation depending on
how this effects the operating size of the instruction. Converting
64-bit down to 32-bit as an example.

AVX gather instructions are the first instruction class that can't infer
this information. The element load size is determined by the W flag but
the operating size of 128-bit or 256-bit is determined by other means.

Expose this flag so we can determine this difference. The FMA
instructions are going to need this flag as well.
This does a gather load three ways, SVE256, SVE128, and ASIMD.

This operation is a bit special since it it can't quite handle all
gather loadstores in the 256-bit case and requires the frontend to
decompose the operation in the case that the striding hits a mode that
SVE doesn't support!

The 128-bit case is a lot simpler since both support all the cases where
stride doesn't match. I find this to be a nice compromise while there
aren't any SVE256 products on the market.

In the 128-bit case there is an SVE path which is utilized if the passed
in stride supports what SVE understands, otherwise it falls back to an
ASIMD implementation which manually emulates everything that is
necessary.

This instruction is very explicitly doing basically exactly what AVX
gather instructions want, because it's complex enough that we don't want
to try and make this a generic solution.
This is the last family of instructions that we needed to implement for
AVX2 to be properly advertised!
Just to ensure we still have feature parity.
@Sonicadvance1 Sonicadvance1 force-pushed the avx128_working branch 2 times, most recently from 2e1a7d0 to 13ee363 Compare June 23, 2024 20:40
We now no longer care about AVX versions, consolidate them in to a
single config option which enables both.
This enables AVX, AVX2, FMA3 for the entire CPUID!

```bash
$ FEX_HOSTFEATURES=enableavx,enableavx2 ./Bin/FEXInterpreter /usr/bin/cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Cortex-A78AE
stepping        : 0
microcode       : 0x0
cpu MHz         : 3000
cache size      : 512 KB
physical id     : 0
siblings        : 12
core id         : 0
cpu cores       : 12
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht tm syscall nx mmxext fxsr_opt rdtscp lm 3dnow 3dnowext constant_tsc art rep_good nopl xtoplogy nonstop_tsc cpuid tsc_known_freq pni pclmulqdq dtes64 monitor tm2 ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx hypervisor lahf_lm cmp_legacy extapic abm 3dnowprefetc
h tce fsgsbase bmi1 avx2 smep bmi2 erms invpcid adx clflushopt clwb sha_ni clzero arat vpclmulqdq rdpid fsrm
bugs            :
bogomips        : 8000.0
TLB size        : 2560 4K pages
clflush size    : 64
cache_alignment  : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
```

Notice avx, avx2, and fma
@alyssarosenzweig
Copy link
Collaborator

Now that everything is fanned out into PRs, closing this one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants