-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorizing sqrt #9672
Comments
I personally would use Apart from this, it may be straightforward to vectorize the nan-check as well. This would not require a front-end vectorizer. Instead,
where Alternatively, the loop could also be rewritten to
which may be faster since preconditions may be more tedious to check, and several |
It seems like this is another case where utilising the hardware floating point exception machinery, instead of explicit range checking, would be beneficial (see discussion in #5234). |
I discovered that LLVM 3.5 can vectorize
on a 4th Generation Core i7, with arrays of 100,000 elements in cache, removing the branch got me almost 4x speedup for I've started a discussion in the LLVM developer forum on the subject of removing superfluous floating-point checks. |
Note: I plan to have |
Good to hear. |
Does |
Yes it does. It is now up to LLVM to vectorize the |
Yes, LLVM 3.5.0 will vectorize it. Here is an example. |
Do we need to keep this issue open then? |
Let's close it. 😄 Also related, I submitted a patch to the LLVM community that removes the domain check in some obvious cases. |
This is issue is for discussing what needs to be done to vectorize
sqrt
effectively.There are two separate show-stoppers:
sqrt
. [Updated 2015-1-8 to say "3.3"]sqrt
iintroduces a branch to deal with a negative argument.The first item is a general LLVM issue and we can take it up with the LLVM community. Oddly, LLVM's "BB Vectorizer" seem to have logic to vectorize intrinsics, but the "Loop Vectorizer" does not.
The second item would seem to be more specific to Julia. Some thoughts on ways to address it:
sqrt(x^2+y^2)
is a common idiom and some users would be happy to writesqrt(abs(z))
if that improved performance oversqrt(z)
. It would be straightforward to write a pass to do floating-point range propagation through the SSA graph. That's probably too expensive to do everywhere in a JIT environment, but inside@simd
loops it seems worth the effort.@nothrow
macro that turns that causes domain checking to be turned off, and NaNs returned instead.@simd
semantics are defined to allow out of order checking of iterations. Exploiting this would let us vectorize other functions with domain checks too. But it's such a radical upgrade of the LLVM loop vectorizer that the LLVM community would likely not accept unless C/C++ adopt similar semantics for SIMD loops. But we could implement it via a "front end vectorizer", i.e. vectorizing before lowering from Julia IR to LLVM IR, much like ISPC does.@simd
loops, but leaves metadata to mark their place. After the vectorizer runs, we could reinsert the checks in vectorized form.Anyway, that's my initial brain dump. What's other people's take on this?
The text was updated successfully, but these errors were encountered: