Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm64jit: Avoid fused multiplies in vcrsp.t #18249

Merged
merged 1 commit into from
Sep 27, 2023

Conversation

unknownbrackets
Copy link
Collaborator

@unknownbrackets unknownbrackets commented Sep 27, 2023

Note: this is for the old/current arm64 jit, not the new IR-based arm64 jit, which didn't have this problem.

With this change, issues in Harvest Moon with teleporting animals seem to disappear. It was causing some differences in signs of zeros in results, and slightly different result values.

I'm not sure what value was doing it, but none of the differences involved NANs or infinities, so I assume it was one of the above two cases. For posterity and anyone interested, it was:

Harvest Moon (US)
PC=08910b34
vcrsp.t C000, C100, C110

Adding a FlushAll before and after made it easy to confirm it wasn't regcache related. Some of the different results (filtering out cases where the string format is identical, even if the bytes are different):

Different results encountered: 5.791998, 0.000000, -0.000000 (should be 5.791998, 0.000000, 0.000000)
Different results encountered: 2.152147, 0.000000, -0.000000 (should be 2.152147, 0.000000, 0.000000)
Different results encountered: 2.075424, 0.000000, -0.000000 (should be 2.075424, 0.000000, 0.000000)
Different results encountered: -0.000000, 0.000000, -0.000000 (should be -0.000000, 0.000000, 0.000000)
Different results encountered: -0.000000, -0.000000, -0.000000 (should be -0.000000, -0.000000, 0.000000)
Different results encountered: 0.000000, 0.000001, -0.000000 (should be 0.000000, 0.000001, 0.000000)

-[Unknown]

With this change, issues in Harvest Moon with teleporting animals seem to
disappear.  It was causing some differences in signs of zeros in results,
and slightly different result values.
@unknownbrackets unknownbrackets added the arm64jit Occurs with JIT on 64-bit ARM devices, but not another CPU backend. label Sep 27, 2023
Copy link
Owner

@hrydgard hrydgard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Fused muls are quite scary, we probably should avoid them since they likely don't happen on the real PSP even internally in ops like this..

@hrydgard hrydgard added this to the v1.16.5 milestone Sep 27, 2023
@hrydgard hrydgard merged commit d6a8bfd into hrydgard:master Sep 27, 2023
18 checks passed
@unknownbrackets
Copy link
Collaborator Author

unknownbrackets commented Sep 27, 2023

I'm pretty convinced by the way prefixes happen on this instruction that it's actually just mapped to a series of vdots. I actually think a lot of instructions are prefix hacks for vdot.

In this case, the prefix bits are overridden such that the T swizzle is forced to yxwz, and the x (i.e. second position only) is negated, and this only applies to the third result. No prefixes apply to the first or second result. I think vcrsp.t actually just generates 3 vdots, in reverse order. I even think that it would have overlap hazards on hardware for this reason, but I haven't checked that part.

So in other words, I think it does that exponent-line-up thing that vdots do, which might lose precision compared to multiplies rather than gaining as in fused.

-[Unknown]

@unknownbrackets unknownbrackets deleted the arm64jit-vcrsp branch September 27, 2023 06:53
@hrydgard
Copy link
Owner

yeah, agree that they probably utilize that exponent-lining vdot hardware whenever they can in ops like this, since it's gonna be fewer cycles than a sequence of singular ops.

I wonder how fast it's possible to get a carefully asm-optimized implementation of those... Though IIRC we still have some slight inaccuracies even in our best emulation of them?

@unknownbrackets
Copy link
Collaborator Author

Maybe I need to check again. It was either accurate to all tests or off by the lowest bit or two in a percentage of them...

-[Unknown]

@hrydgard
Copy link
Owner

Maybe we can get fp64 to have a look in the latter case, hehe (which is what I seem to remember...)

@fp64
Copy link
Contributor

fp64 commented Sep 30, 2023

Sure, I can have a look, assuming there's data to look at.

On subject of FMA, I'm not sure how important (if at all) floating-point determinism is for most of the C++ code, but -ffp-contract=off is not currently set (it can default to fast even without -ffast-math; though it probably only affects things if FMA itself is enabled, e.g. -march=native).

@fp64
Copy link
Contributor

fp64 commented Oct 1, 2023

I wonder how fast it's possible to get a carefully asm-optimized implementation of those...

FWIW, I did an SSE2 reimplementation of vfpu_dot, and the speedup is visible, but not that impressive (this is about the best case, the win smaller on some CPUs):

CPU: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
Testing correctness... ok.
Testing speed.
vfpu_dot:            58.54 ns/call
vfpu_dot_sse2<0>:    43.70 ns/call
vfpu_dot_sse2<1>:    35.29 ns/call

Note: vfpu_dot_sse2<1> assumes round-to-nearest-ties-to-even, vfpu_dot_sse2<0> does its own rounding (using same code as vfpu_dot).

Edit: you can probably get more with AVX2, but maybe not much.

Can do a PR if you want this.

@unknownbrackets
Copy link
Collaborator Author

Sure, I can have a look, assuming there's data to look at.

So I went back and checked and my memory was indeed wrong: it's still off on the rounding sometimes, at least per my earlier notes.

A challenge here is the range of inputs. A vdot takes 8 total floats as inputs to produce a single output. Part of my validation strategy (since it was all int math anyway) was actually that I simply compiled the emulated vdot as MIPS PSP code, and validated it directly there, without generating huge data files. But that doesn't help anyone without access to a Allegrex processor.

I suppose for generating data, it makes the most sense to cover all the exponent combinations with a few mantissa values, although some good mantissa cases are needed to properly exercise rounding. Could also exclude exponents not within 32 of the base, outside a few samples to prove that rule and maybe some random values for coverage. So basically 64K for first lane, then times 1024^3 for the others? That's still a ton...

Maybe the better thing is to just export out a chunk of values the current one gets wrong, but that's probably harder to test against...

On subject of FMA, I'm not sure how important (if at all) floating-point determinism is for most of the C++ code, but -ffp-contract=off is not currently set (it can default to fast even without -ffast-math; though it probably only affects things if FMA itself is enabled, e.g. -march=native).

Most of the code, probably not that critical - a lot of it ends up going to the GPU where we don't have determinism anyway, so it's a losing battle. It does matter for CPU emulation, and would be interesting eventually to have determinism in the software renderer.

FWIW, I did an SSE2 reimplementation of vfpu_dot, and the speedup is visible, but not that impressive (this is about the best case, the win smaller on some CPUs):

Well, this seems good, though not sure if we can rely on the rounding mode or not. But it's actually way more interesting to make neon versions that SSE here, because that's really where it probably has the biggest observed performance impact to use the correct dots. I wonder if adding some vector_size attributes might be enough to coax clang into vectorizing arm (though probably not entirely optimally) for us...

Anyway, if it matches all the values, probably worth a PR. If we can prove on MSVC that it's safe to assume nearest then that's almost double the speed, which is good.

-[Unknown]

@hrydgard
Copy link
Owner

hrydgard commented Oct 2, 2023

As for the 8 inputs, I think that can be dimension-reduced a lot, assuming things are symmetric, which they should be - depending on which part we are less confident in. Below I name the two input vectors a and b, and exclude infs and nans from consideration.

If it's the actual multiplies, it should be enough to only vary say the a.x and b.x, and hold the other ones at 0. If it's the rounding when aligning the mantissa and shifting bits out before summing, we should be able to create a focused set of inputs for that too.. say, hold a.x and b.x at some values producing a messy product, then set a.y to 1 and vary b.y through a large range. Although may be hard to isolate that from the rounding after the sum?

Just some thoughts..

@fp64
Copy link
Contributor

fp64 commented Oct 2, 2023

A challenge here is the range of inputs.

Combining the above suggestions with some of my own ideas:

  • completely random bitpatterns for a, b
  • random bitpatterns in small exponent range (both all-[1;0f;8.0f] and all-[2^-65;2^-63] to test denormals)
  • a=[1,1,1,1], b=random small range (including all-denormals)
  • a=[x,0,0,0], b=[y,0,0,0], random small-exp-range x,y
  • a=[x,z,0,0], b=[y,w,0,0], random small-exp-range x,y,z,w (also the case where x*y and z*w overflow, but the result doesn't)
  • special values (+0,-0,inf,nan,2^-126,2^-139,1,1+(2^-23), etc.), including [1,1,1,1]*[-0,-0,-0,-0]
  • a=[x,1,0,0], b=[y,z,0,0] (FMA-like)
  • the values current implementation gets wrong
  • permutations (same for a and b) to test if result is order-independent (e.g. dpps is specced to (A+B)+(C+D))

not sure if we can rely on the rounding mode or not.

I wonder if it makes sense to if(fegetround()==FE_TONEAREST) or just check a result of adding 2 volatile floats (well, 2 pairs).

But it's actually way more interesting to make neon versions that SSE here

I neither know much NEON, nor have somewhere to test it (other than trying NEON_2_SSE.h or SIMDe locally). I may try pure GCC vector extensions version (which are annoyingly slightly different on GCC vs. clang).

then that's almost double the speed, which is good.

It does vary a lot depending on system (e.g. speed-up is 4% for vfpu_dot_sse2<0>, and 12% for vfpu_dot_sse2<1> on my machine).

@fp64
Copy link
Contributor

fp64 commented Oct 2, 2023

That version actually has botched testing for NaNs, but it looks easy to fix.

@fp64
Copy link
Contributor

fp64 commented Oct 2, 2023

Hmm, clang seems ok at vector extension version of multiply-high, GCC... not so much (on both x86 and ARM). Both autovectorize C++ version poorly. Both look ok for ARM variable per-lane shift.
https://godbolt.org/z/PvhsP9YhT

fp64 added a commit to fp64/ppsspp that referenced this pull request Oct 2, 2023
See hrydgard#18249. Speedup for this function ranges 10%..100%,
depending on system. Updated verification and speed measurements:
https://godbolt.org/z/W1z3sj6hz
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arm64jit Occurs with JIT on 64-bit ARM devices, but not another CPU backend.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants