-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RyuJIT: Vector.Dot not being inlined / converted to SIMD instructions? #6797
Comments
I'm not sure what |
Its not a substitute. Horizontal add is needed, but I was timing it just to see if I could still offset the cost for big sizes even at the expense of the multiplication and shuffles. Was definitely not expecting how bad it behaves in this case for all cases. |
@dotnet/jit-contrib |
@redknightlois - We definitely can improve Vector.Dot() for AVX case using phaddd. In the above screen sheet there is another label that reads "Also this generates worse code".
Are you referring to |
@sivarv Using The actual code is here: https://github.com/Corvalius/ravendb/blob/v4.0/bench/Micro.Benchmark/PageLocatorImpl/PageLocatorV2.cs But looking forward to be proven wrong :D |
@redknightlois - the code pointer that you have given is using Vector instead of Vector. AFAIK, AVX supports instruction to perform horizontal add of 16-bit and 32-bit type vectors. I am assuming that you are open to using I have created the below PR to recognize Please elaborate what you meant by "a bit worse code" when using Vector.One.
As you can see using static fields 'one' and 'zero' will read 32-bytes from memory in each iteration through the loop, which would be inefficient than using Vector.One directly. My guess is that using _indexes static field in Vector.Dot() is resulting in a call to helper CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE and a null check inside the loop. Since the call would trash upper 128-bits of YMM regs that are live, LSRA would add vextractf/vinsert to save/restore upper 128-bits of those YMM registers around the call. You can workaround this by assigning _indexes static field to a local outside the loop and using that local in the loop instead. |
@sivarv Thanks for taking a look into it. Yes, whatever is faster is fine by me :) ... Just let me know what version of the JIT I have to select when a build is up on nuget and I will take it for a spin using Looking at the code again, I may have messed up when investigating the issue and mix the binaries for
These are the results using the indexes as a local instead. Note though that V2 is what V4 was used to be and V7 is our current version (not V1 which is the original used in this post). |
@redknightlois - the PR meant to recognize Vector.Dot as JIT intrinsic is merged. Regarding Vector.One and Vector.Zero - to further optimize the loop, you can declare two locals outside while-loop
and use 'one' and 'zero' within the loop. This avoids the overhead of constructing these constant vectors in each iteration through the loop. Hoisting of constant SIMD vectors like this out of loops is something we are tracking as an issue https://github.com/dotnet/coreclr/issues/7422. Till that is fixed, you would have to hand optimize it. |
@redknightlois - did you get a chance to try out new JIT and measure perf? |
@sivarv not yet. I need dotnet/BenchmarkDotNet#292 to be resolved to get the proper benchmark data out. For |
@redknightlois - did you get a chance to measure perf of your benchmark after modifying it to use |
@sivarv Benchmarking tool not working yet with new builds. As soon as I make it work, will post results. |
We are trying SIMD alternatives to our very fast locality cache (as it is starting to shows up on our profiling runs ) and got very bad results.
Given that I couldnt find any way to achieve the codegen to write
_mm_hadd_epi32
I thought I could useVector.Dot
which essentially achieves the same on a premium even ifdpps
is not available (SSE 4.1).While we weren't really expecting to beat the current code, there was a chance though. Surprisingly our benchmark code results were plain awful.
This is a dry run, but a proper one doesnt change the results by much.
But then trying to understand why the results looked like this (they shouldnt be that bad). I found this:
Eventually I got around to fire-up the profile to double-check that.
And just for the purpose of completeness
Tomorrow I will probably play a bit with a native profiler to have a better idea of where the microarchitecture costs are, but this call doesnt look good. Any idea?
PS: And not to beat a dead tree here, but this is yet another case of missing instructions.
The text was updated successfully, but these errors were encountered: