-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorization of a FOR loop using "@simd" with nested "norm" #11037
Comments
There's a couple of issues here that thwart the vectorizer. Unfortunately I was not able to get past all of them, but here is as far as I got.
After these changes the loop generates better code, but to my chagrin, the vectorizer still refuses to vectorize it. It might be cost-model issue. I'll need to take a closer look at it when I have time later this week. Though I'm guessing the code above, even unvectorized, is likely faster than the original. |
This question is better suited for julia-users, our mailing list: https://groups.google.com/forum/#!forum/julia-users I also took a crack at it, though since Arch has already chimed in, I think he's covered everything that I was going to say, plus relevant autovectorizer knowledge. |
Follow-up for @ArchRobison, if you don't mind: I tested your code without the transposed |
Thanks for the follow up @ArchRobison and @pao I also did the following tests. I modified my prune function as follows: function prune_range!(DSMPerihelion::Array{Float64,2},SolarRadius::Float64,DSMr::Array{Float64})
bySradii=1.0/SolarRadius
len = size(DSMPerihelion)[2]
@inbounds @simd for i = 1:len
DSMr[i] = sqrt(DSMPerihelion[1,i]^2+DSMPerihelion[2,i]^2+DSMPerihelion[3,i]^2)*bySradii
end
end and got the performance results: Next I also tested @ArchRobison version: function prune_range2!(DSMPerihelion_Transpose::Array{Float64,2},SolarRadius::Float64,DSMr::Array{Float64})
bySradii=1.0/SolarRadius
len = size(DSMPerihelion_Transpose)[1]
@inbounds @simd for i = 1:len
r1 = DSMPerihelion_Transpose[i,1]
r2 = DSMPerihelion_Transpose[i,2]
r3 = DSMPerihelion_Transpose[i,3]
DSMr[i] = sqrt(r1^2+r2^2+r3^2)*bySradii
end
end I couldn't use So now I am little confused :
thanks |
I'm hoping we get to the point where we can do fast, small-vector norms without having to spell them out longhand. If we can reinterpret a slice as a fixed-size array, like those proposed in #7568, then we should be able to dispatch to a specialized implementation. As to your question (1), this gets back to what Arch mentions in his reply about the cost model. If you go back to the original article, one of the points he makes is that the transformation must be "profitable", which is to say, there are heuristics involved in deciding whether to vectorize and they may need to be tuned a bit. |
It will be great if we can tell to the vectorizer to bypass its internal heuristics and always optimize a loop. Maybe like I have not played with the ImmutableArrays package but I hope that it will take care of the issue of writing longhand, spelled-out code until the compiler becomes more smart. I plan to do some tests late today and see if Immutable Arrays can be used with the norm function. |
ImmutableArrays.jl does provide a |
@ArchRobison's code is vectorized for me with LLVM 3.6 but not LLVM 3.3. The untransposed version appears to vectorize as well (with |
That is both my favorite label and least favorite label! Thanks for checking with a new LLVM. |
Awesome... thanks guys. I will try with LLVM 3.6 and get back later today with some benchmark numbers. |
With respect to @GravityAssisted's question about strided vs. unstrided: If the loop is not vectorized, the transposition of I concur with @pao's comment that it would be nice to be able to do small-vector norms. I thought it was just a matter of more clever inlining (particularly recognizing short fixed-count loops), but now I see that |
Hi all, Sorry I was on vacations for a week ! Hence the late reply. Following @ArchRobison advice I made the switch to LLVM3.6 and was able to get it vectorized 👍 . Thanks for the explanation on cache misses. |
Is this addressed now? |
SIMD tests have been added |
but not this test case |
Hi All,
I am a new Julia user and am trying to get performance improvement from using the @simd instructions in julia-0.3. I have the following function with a nested
norm
function call:Its runs on an input 2d array:
DSMPerihelion
of size(3 , 20979)
and I get the following performance:@time prune_range!(DSMPerihelion,SolarRadiiCutoff,DsmR)
elapsed time: 0.00252588 seconds (2181976 bytes allocated)
Now when I try to vectorize the code as follows:
I get almost the same performance:
elapsed time: 0.002716825 seconds (2181976 bytes allocated)
I looked at the
code_llvm
and the code doesn't seem to get vectorized. Is there something I am missing for vectorizing this code ? I followed the Intel Blog(https://software.intel.com/en-us/articles/vectorization-in-julia) by @ArchRobison to understand vectorization in Julia.This function will be called millions of times so its imp. it performs well; hence my effort to vectorize it.
Thanks for the help.
The text was updated successfully, but these errors were encountered: