-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize and tidy up affine transform code. #3663
Conversation
Looks like SSE2 (old code path) wasn't hit on fishtest and failed to compile because I forgot that |
clang 12.0.0:
gcc 10.3.0:
|
on Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
|
Another result
|
Some more results on an AWS cascade lake box:
|
|
And same box, using 45 threads (so a bit more noise):
|
@JavaMast out of curiosity, could you also test -bmi2 on the rocket lake CPU you have there? |
Speed for old network architecture: #3457 (comment) |
My BMI2 results on Skylake-X:
Would it make sense to put simd.h in the src\nnue\layers directory? |
I think the recent number with avx512 and vnni are not really much better than avx2, I wonder if we should consider dropping those implementations, and maybe revisit once the difference is more significant? |
A problem with currently available CPUs that are capable of AVX512 is the high dependency on luck (aka silicon lottery). Some chips just get a bit hotter and poof, the CPU throttles down so much that it may even lose speed compared to AVX2. Earlier models go as far as to always throttle once AVX512 instructions are used without a setting to change how aggressive it is in doing so. Then of course several other factors are at play such as thermal paste, cooler, and airflow of the PC case/server rack. |
Inline assembly constraints are incorrect in almost every place: they're writing to input operands, which can potentially confuse the compiler because it assumes input operands don't change. They probably should be changed to input/output operands with non-overlapping constraint, that is, |
Hmm, you're right. Someone mentioned it in the bug report too. I changed it now and inspected the assembly with XED, the results are the same aside from register assignment differences. I don't think a retest will be necessary for this? |
so, I will merge this as I think separating out some of the simd functionality is a good step. It is somewhat unfortunate that we need to use inline asm to work around the gcc issues, that should be considered an intermediate fix, and not the direction of things. The question on removing avx512/vnni support is still open. If it isn't useful (or if intel drops reduces support in hardware), we probably should drop it. |
There are some processors that do benefit from AVX512, and the situation with the next gen ryzens is uncertain. |
The new network caused some issues initially due to the very narrow neuron set between the first two FC layers. Necessary changes were hacked together to make it work. This patch is a mature approach to make the affine transform code faster, more readable, and easier to maintain should the layer sizes change again. The following changes were made:
Overall it should be a minor speedup (or at least neutral) with AVX2/BMI2 targets. The biggest gains are expected with SSSE3, AVX512, VNNI256, VNNI512 targets. The VNNI targets have particularily been impacted, due to a bug in GCC linked above.
Example hot path comparison with x86-64-vnni256 target (GCC 10.1 in MSYS2):
master:
patch:
Comparison for x86-64-modern:
(here the difference is harder to spot because in the patch 2x more work is done in each loop iteration. Notably it doesn't use the shuffles, which are slow, with everything else being about the same.)
master:
patch:
Bench of x86-64-modern target on my i7-920:
VNNI and AVX512 benchmarks are welcome.
No functional changes.