You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to extending the tests for different features flags #1822, I wanted to take another look at the avx512 feature and its performance. Benchmarks were run on an i9-11900KB @ 3Ghz (turbo disabled) with
(the second flag might require some explanation, it disables the prefer-256-bit feature, which makes llvm use the full 512 bit vectors)
For some reason the second benchmark is always significantly slower when run together, running them separately gives the same (higher) performance and the assembly looks identical except for the and/or. I'm guessing branch predictor or allocator related.
Generated assembly for simd and avx512 looks identical, the loop calculates 512bits/64 bytes. The auto-vectorized version instead gets unrolled 4 times, which reduces the loop overhead, so each iteration processes 4x512bits.
Describe the solution you'd like
With these benchmark results it seems that we can remove the avx512 feature and simplify the buffer code.
Probably the compiler auto-vectorization got improved since we added the avx512 feature, or the creation of buffers using from_trusted_len_iter lead to some improvements.
Describe alternatives you've considered
An avx512 feature for other kernels would still be very useful. Avx512 for example has instructions that basically implement the filter kernel for primitives in a single instruction and it is unlikely that these will be supported in a portable way soon (rust-lang/portable-simd#240).
The text was updated successfully, but these errors were encountered:
For some reason the second benchmark is always significantly slower when run together, running them separately gives the same (higher) performance and the assembly looks identical except for the and/or. I'm guessing branch predictor or allocator related.
You might want to do some sampling on CPU frequencies using lscpu -e or something while running these benchmarks. Since AVX-512 SIMD instructions consume much more power than regular 64 byte instructions (registers are eight times longer), they can produce more heat and CPU cores can reduce the base frequencies.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to extending the tests for different features flags #1822, I wanted to take another look at the avx512 feature and its performance. Benchmarks were run on an i9-11900KB @ 3Ghz (turbo disabled) with
(the second flag might require some explanation, it disables the
prefer-256-bit
feature, which makes llvm use the full 512 bit vectors)For some reason the second benchmark is always significantly slower when run together, running them separately gives the same (higher) performance and the assembly looks identical except for the and/or. I'm guessing branch predictor or allocator related.
Default features
Simd feature
Avx512 feature
Generated assembly for
simd
andavx512
looks identical, the loop calculates 512bits/64 bytes. The auto-vectorized version instead gets unrolled 4 times, which reduces the loop overhead, so each iteration processes 4x512bits.Describe the solution you'd like
With these benchmark results it seems that we can remove the
avx512
feature and simplify the buffer code.Probably the compiler auto-vectorization got improved since we added the
avx512
feature, or the creation of buffers usingfrom_trusted_len_iter
lead to some improvements.Describe alternatives you've considered
An
avx512
feature for other kernels would still be very useful. Avx512 for example has instructions that basically implement the filter kernel for primitives in a single instruction and it is unlikely that these will be supported in a portable way soon (rust-lang/portable-simd#240).The text was updated successfully, but these errors were encountered: