-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow multiple target features in is_target_feature_detected #348
Comments
This is unfortunately gonna be really hard to do without writing a full-blown procedural macro, and that's one that would have to live in rustc itself most likely. |
A perma-unstable macro that splits strings at compile-time (the opposite of |
It might be enough for now to mention in the docs that between the interfaces for #[cfg(target_feature = ..)], is_target_feature_detected! and #[target_feature(enable = "")], only the later accepts (requires?) multiple inputs. Another possible advantage to using multiple is that it'll make it easier to minimize the cost of the dynamic check. It appears that each of these is_target_feature_detected is being run separately, but they could be merged into a single check against the features cache. Though maybe some extra inlines would let the compiler merge those checks without code changes. Alternately each is_target_feature_detected! could have it's own cache which would be a single check. If the multiple checks are merged into single check mask then it might be worth, merger the two 32bit caches, in the x86 detect code, into a single 64 bit one when compiling on x86_64. |
Hm, I hadn't thought of this. Right now each time we test the cache we perform a relaxed load. AFAICT, LLVM is able to merge multiple consecutive relaxed loads into a single load. For example, check this out: https://godbolt.org/g/jgi6R3 . Here,
What does that buy us? That multiplies the memory requirements of the SIMD run-time with the number of features (~64x increase on
That is how it is done, or what am I missing? On
@alexcrichton we can add an example to the docs that uses two target features to show this. This is probably obvious for those who have used |
I didn't mean a load per feature but one for each is_target_feature_detected. So an AtomicBool, which would be a single load on x86. Now it could use the current system for singular features, which I think is already a single load even on x86, and AtomicBool for multiple features. Or even a global AtomicBool per used set of features. While that is an potentially exponential amount of feature sets it's one bool per is_target_feature_detected, which already uses more memory for the generated instructions. But since it appears that LLVM can merge the relaxed loads and statically construct the feature mask, then it's probably not a gain outside maybe one fewer load on x86. I now expect the cache inefficiency will make it detrimental. I was originally more worried about wasting branch predictor capacity, but it already appears to compile to a single branch. Technically even that branch can be removed in some cases, by using a AtomicUsize as a function pointer, which I've seen an example macro somewhere for the older simd crate. Alex Crichton might even have written it. The basic idea is you have a set of functions with identical inputs, that you want to select the fastest version function for at run time. Initial the function pointer points to a detection function, that changes to function pointer to the fastest version, then calls that version. On clang/gcc that type of check can even be done at load time with the ifunc attribute. Note the reason I'm focusing on the cost of is_target_feature_detected is that when the check was used on a ~20 instruction function, the dynamic checks showed up in perf half as much as the original function, though it was still cheaper than without simd. So I'm curious how much of that overhead can be reduced. Of course on my end moving the check out of the hot loop to a higher level is the performance optimal solution. Now that I think about I probably shouldn't see "std::stdsimd::arch::detect::check_for" in perf to begin with. The atomic check would ideally be inlined into the caller of is_target_feature_detected, but that doesn't seem to be the case.
Opps I just missed the 64 cache code when scanning through code. |
I really don't understand what you mean here. Could you ping me on IRC? Maybe it's easier if we just discuss this.
Can LLVM do this? I haven't checked that.
This is something that can be easily build on top of the current system without overhead. The RFC2045 had an example that builds this IIRC, and there are some crates that implement this with a proc-macro, but I don't know if these have been migrated to use
Do you have a minimal working example of this? |
Definitely! I'm all for adding more examples :) FWIW how much do we have to gain from optimizing these checks much more? I was under the impression that if they're called regularly throughout SIMD code it destroys performance anyway? |
It is unclear to me that this is possible. @Cocalus mentioned:
So IIUC (which I am not sure I do), @Cocalus proposes to use the current system for calls to That is, Either we would store these
First, it's not one Then @Cocalus states:
On In any case, if each load of an So... while I think that there might be some potential for improvement, I don't think this potential is high. A judicious use of
As long as LLVM can hoist and merge the checks they should be cheap or free. Within the crate boundary this should happen often, across crates it would depend whether the relevant parts of the crate interface are #[inline] or whether one gets lucky with ThinLTO/LTO. |
I think if the relaxed load and check could be inlined it would speed up things with lots of dynamic checks. the other ideas are probably detrimental.
I might have time to take a swing at it, but I'm not having luck pointing an example at a local clone of stdsimd master. [dependencies]
rand = "0.4.2"
stdsimd = {path = "../stdsimd"}
also if I try to run cargo test on stdsimd I get
This is on x86_64 ubuntu 14.04 linux, rustc 1.26.0-nightly (2789b067d 2018-03-06) |
I've noticed this as well. I have fixed it locally, will send a PR to fix this soon. The temporary workaround is to export a |
We should support:
so that users don't need to write
cc @Cocalus
The text was updated successfully, but these errors were encountered: