-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion for a low-effort way to take advantage of SIMD and other architecture specific tricks LLVM knows about #42432
Comments
Note "0.0 as f32" is usually written 0f32 or 0.0_f32 or something like that. |
@leonardo-m pushed that fix now. Should probably use that in a bunch of other places as well in my code |
Searched around and this is how the GNU toolchain does runtime dispatching of different implementations of functions in a way that avoids the dispatch cost on every call: |
Hi @pedrocr, I have been working on something almost exactly as you describe using a procedural macro. It's not finished but I can publish it when I get home today if you would like to have a look / give feedback. |
@parched sounds very interesting indeed :) |
This is an RFC-level change to the language; I'd encourage you to move this to a thread on internals. Thanks! |
Issue #27731 already tracks the fine work being done to expose SIMD in ways that are explicit to the programmer. If you're able to code in those specific ways big gains can be obtained. However there is something simple can be done before to performance sensitive code that sometimes greatly improves its speed, just tell LLVM to take advantage of those instructions. The speedup from that is free in developer time and can be quite large. I extracted a simple benchmark from one of the computationally expensive functions in rawloader, matrix multiplying camera RGB values to get XYZ:
https://github.com/pedrocr/rustc-math-bench
I programmed the same multiplication over a 100MP image in both C and rust. Here are the results. All values are in ms/megapixel run on a i5-6200U. The
runbench
script in the repository will compile and run the tests for you with no other interaction.So rust nightly is faster than clang (but that's probably llvm 3.8 vs 4.0) and the reduction in runtime is quite worthwile. The problem with doing this of course is that now the binary is not portable to architectures lower than mine and it's not optimized for archictures above it either.
My suggestion is to allow the developer to do something like
#[makefast] fn(...)
. Anything that gets annotated like that gets compiled multiple times for each of the architecture levels and then at runtime, depending on the machine being used, the highest level gets used. Ideally also patch the call sites on program startup (or use ELF trickery) so the dispatch penalty disappears.The text was updated successfully, but these errors were encountered: