-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable SLPVectorizer and make it work for tuples #6271
Conversation
This is very exciting – it's lovely to see such nice, short code generated. |
Cool! How does the vectorizer work with vectorized load / stores? Could this be collapsed into a vectorized load / add / store? I assume performance will suffer somewhat due to Julia array alignment issues for larger vector widths. function vadd_one!(arr::Array{Float64, 1})
len = length(arr) # assuming len multiple of 4
one = (1.0, 1.0, 1.0, 1.0)
for i = 1:4:len
inp = (arr[i], arr[i+1], arr[i+2], arr[i+3])
out = vadd(inp, one)
for j = 1:4
arr[i + j] = out[j]
end
end
end |
@jakebolewski : My impressions is that SLPVectorizer is supposed to handle that sort of situation if bounds checking is turned off. But alas even the SLPVectorizer in LLVM trunk failed to deal with the case cleanly. It vectorized the add partially, but scalarized the stores. So apparently there's more work to be done on SLPVectorizer. Though for vectorizing loops, its usually better if the programmer does not choose the chunk size, because the optimal chunk size depends on the target platform, and since Julia is a JIT, it can exploit whatever vector lengths are on the platform. Leaving chunking to the compiler simplifies the code too. See here for an example using the proposed |
@ArchRobison I agree that if possible the chunk size should not be chosen by the programmer, although this will not be the case if Julia gets builtin SIMD types. I saw that you were using a platform that supports AVX so I tailored my example to see if this patch helps the compiler take advantage of 256 bit SIMD instructions. I've been testing your |
The SIMD types proposal (#2299) could be architecture-specific just like the size of Int is. |
I feel you would want to support all SIMD types available for an architecture. Being able to mix and match 128 bit SIMD types and 256 bit SIMD types is often useful, so choosing SIMD size should be left to the user. 128 bit SIMD ops are also more portable between different architectures (assuming Julia get's ARM support in the future). |
512-bit SIMD types are coming (and in fact here on Intel Xeon Phi). Code hardwired to 128 bit SIMD ops will be throwing away 3/4 of the ALU on those machines. The SIMD types should be parameterized by width, and a global constant could indicate the natural machine width. Though alas "natural width" and "fastest" are not always the same, often because of memory subsystem and pipelining quirks. I've had the same problem getting |
LGTM. @JeffBezanson ? |
@loladiro Thanks for looking it over. The corresponding patch for LLVM trunk has yet to settle -- it was committed to trunk, but then reverted because it broke something. I'll wait for the dust to settle and then redo the backport. |
@ArchRobison Complex arithmetic doesn't seem to vectorize for me even with this patch and the corresponding patch to LLVM 3.3. Should that be happening? |
I was looking at that yesterday too. The catch is that SLPVectorizer does not know how to vectorize structs, even when they are isomorphic to tuples. See my email exchange for the LLVM details. What would be the impact if we lowered homogeneous composite types to LLVM vectors? Would that cause a calling convention mismatch with C? |
I removed the "WIP" from the title because I think this patch is about as ready as it is ever going to be for a while. I've updated the base note with new compilation slowdown measurements. Compilation slowdown seems to be about 4.5%. (Though as noted in the revised base note, someone else should check this.) I suppose this is good for codes that show improvement, and baggage for those that don't. Can anyone suggest any realistic workloads that might benefit from this patch? It would needs to be a code that would benefit from using SIMD to implement tuple arithmetic. As @simonster noted, it won't help vectorize complex tuples. |
As long as it seems stable, I'm ok with merging this. 4.5% compilation time overhead is not bad at all. |
+1 to merging, all else being good. I wonder if this shows any improvements in our perf benchmarks. |
Right now this code pattern is not used much so I don't expect much impact, but this will soon change dramatically. Our tuples are overall not efficient enough to be used this way---especially arrays of them. But that will change in 0.4. The compiler is starting to get slow and will get even slower with MCJIT. I'd like to see if we can do some simple things to selectively enable these optimizations, e.g. not doing loop vectorization if there are no loops. |
I agree it would be worth looking at profiles and looking for fast rejection tests. For profiling, I need to build with symbols in the binaries (for Julia and LLVM) but no extra baggage. What's the easiest way to build that configuration? The
The LowerSIMDLoop pass similarly is driven by runOnLoop. For LLVM trunk, the story is less clear.
|
2dd6234
to
4736aff
Compare
8b35c41
to
a8d4cde
Compare
a8d4cde
to
d1ca2e0
Compare
It should be much easier to turn this on now with the new |
Yes, it sounds worthwhile. Where's the new mechanism documented? |
It's an internal interface, so as usual it's not documented. But see #8459. |
a9aab60
to
4e7e226
Compare
Looks like there is a pass-order issue: Clang runs GVN (Global Value Numbering) before SLPVectorizer, but I put SLPVectorizer before GVN, and SLPVectorizer leans on GVN. I'll update my PR to reorder them. Though I also noticed that Clang 3.3 and 3.5.0 put the LoopVectorizer in significant different places with respect to GVN/SLPVectorizer, so I have some more digging to do. [Update: just moving GVN does not solve the problem.] |
Very nice! |
[Pardon long answer: some details are here so I can look them up later.] Do you have an example of the loop that you were trying to both unroll and vectorize? The vectorizer has its own partial unroller. As far as I know, partial unrolling before vectorization will thwart the vectorizer because it destroys the regular access pattern. For the passes we've been discussing, Clang 3.5's order (per
|
I found the root problem: GVN uses LLVM's memory dependence analysis, which in turn seems to require "Basic Alias Analysis" to handle this case. And I'll measure the compile-time cost of adding the additional alias analysis. If it's unacceptably high, we can look at what it takes to fix GVN to handle this case with just the existing type-based alias analysis, which in principle should have been enough. Though hopefully the pass will pay for itself by better performance. |
4e7e226
to
8aca539
Compare
I took @ViralBShah's suggestion and updated the PR to run |
ac40e22
to
2cd9e37
Compare
I've updated the PR. See revised base note for details and compilation time overheads. When built with LLVM 3.5 and the PR, Alas Travis is reporting a failure for the |
No, I've been seeing that intermittently on master and other PR's as well. I very strongly suspect it's unrelated. |
Nice! |
Looking forward to this being merged. Would love to try out |
2cd9e37
to
80c0bea
Compare
Change pass order to more closely match Clang's. Add patch to LLVM 3.3 with enhancements for vectorizing tuples. The patch contains functionality and tests backported from SLPVectorizer changes added to LLVM 3.5.0.
80c0bea
to
3bdda37
Compare
The AppVeyor/Travis problems went away after I rebased earlier this week. I think this PR is in good shape now. |
Enable SLPVectorizer and make it work for tuples
Yay! |
[Updated 2014-Nov-25] This pull request enables implicit vectorization of tuple math idioms when -O is on the Julia command line.
The changes comprise:
InstructionCombiningPass
to clean up afterSLPVectorizer
. Without the cleanup, the code is horrible.SLPVectorizer
work for tuply idioms of interest.The changes to LLVM 3.3 were backported from recent changes developed by me that are now part of LLVM 3.5. The patch includes backported LLVM unit tests.
LoopUnrollPass
instead of before.For background, see Issue #5857 and in particular my comment. Here is an example of what the enhanced SLPVectorizer can do:
Without the patch, the LLVM code for
madd
is lengthy scalar stuff.Impact on compilation time using -O using LLVM 3.3 is significant, but hardly noticeable withLLVM 3.5. I measured compilation overhead using:
Here's a table of compilation times (in seconds). "Baseline" is Julia without the PR.