Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for a low-effort way to take advantage of SIMD and other architecture specific tricks LLVM knows about #42432

Closed
pedrocr opened this issue Jun 4, 2017 · 7 comments

Comments

@pedrocr
Copy link

pedrocr commented Jun 4, 2017

Issue #27731 already tracks the fine work being done to expose SIMD in ways that are explicit to the programmer. If you're able to code in those specific ways big gains can be obtained. However there is something simple can be done before to performance sensitive code that sometimes greatly improves its speed, just tell LLVM to take advantage of those instructions. The speedup from that is free in developer time and can be quite large. I extracted a simple benchmark from one of the computationally expensive functions in rawloader, matrix multiplying camera RGB values to get XYZ:

https://github.com/pedrocr/rustc-math-bench

I programmed the same multiplication over a 100MP image in both C and rust. Here are the results. All values are in ms/megapixel run on a i5-6200U. The runbench script in the repository will compile and run the tests for you with no other interaction.

Compiler -O3 -O3 -march=native
rustc 1.19.0-nightly (e0cc22b 2017-05-31) 11.76 6.92 (-41%)
clang 3.8.0-2ubuntu4 13.31 5.69 (-57%)
gcc 5.4.0 20160609 7.77 4.70 (-40%)

So rust nightly is faster than clang (but that's probably llvm 3.8 vs 4.0) and the reduction in runtime is quite worthwile. The problem with doing this of course is that now the binary is not portable to architectures lower than mine and it's not optimized for archictures above it either.

My suggestion is to allow the developer to do something like #[makefast] fn(...). Anything that gets annotated like that gets compiled multiple times for each of the architecture levels and then at runtime, depending on the machine being used, the highest level gets used. Ideally also patch the call sites on program startup (or use ELF trickery) so the dispatch penalty disappears.

@leonardo-m
Copy link

Note "0.0 as f32" is usually written 0f32 or 0.0_f32 or something like that.

@pedrocr
Copy link
Author

pedrocr commented Jun 4, 2017

@leonardo-m pushed that fix now. Should probably use that in a bunch of other places as well in my code

@pedrocr
Copy link
Author

pedrocr commented Jun 4, 2017

Searched around and this is how the GNU toolchain does runtime dispatching of different implementations of functions in a way that avoids the dispatch cost on every call:

http://www.agner.org/optimize/blog/read.php?i=167

@parched
Copy link
Contributor

parched commented Jun 5, 2017

Hi @pedrocr, I have been working on something almost exactly as you describe using a procedural macro. It's not finished but I can publish it when I get home today if you would like to have a look / give feedback.

@pedrocr
Copy link
Author

pedrocr commented Jun 5, 2017

@parched sounds very interesting indeed :)

@steveklabnik
Copy link
Member

This is an RFC-level change to the language; I'd encourage you to move this to a thread on internals. Thanks!

@pedrocr
Copy link
Author

pedrocr commented Jun 5, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants