SIMD performance regression tests #13686

damiendr · 2015-10-20T09:00:59Z

There are currenly a number of performance regressions with @simd loops, some involving inlining problems.

It would be great to have a test suite for @simd loops that checks for actual vector tags in the llvm code or SSE instructions in the native code. The current suite only checks for the correctness of the results.

I'm willing to contribute that. I'd need a bit of guidance though on how to do the PR, considering that the tests will likely fail with the latest head.

The text was updated successfully, but these errors were encountered:

timholy · 2015-10-20T11:38:09Z

A big 👍 from me. I'd say submit the PR and then hopefully that will trigger focused work on fixing the current problems. Presumably you can comment-out specific tests to check that your testing framework makes sense.

ViralBShah · 2015-10-20T11:47:14Z

cc @ArchRobison

yuyichao · 2015-10-20T12:07:37Z

A good starting point would be going through the issues/PRs with the performance label. I think we've been using that label for exactly this.

damiendr · 2015-10-20T14:58:28Z

OK, I have something almost ready. Here are the things that currently don't vectorize:

multiple reductions
loops on local variables without a type declaration (eg. a = Array(Float32, 100) (NOK) vs. a::Array{Float32,1} = Array(Float32, 100) (OK). That is a regression from 0.4-rc4.
loops with field access eg. obj.a[i] = ...
long expressions (depending on where the parentheses are)
function calls that could be vectorized eg. sqrt, sin, etc. with the exception of muladd (checked with @fastmath).

Is there any of these things that I should not include in the tests?

timholy · 2015-10-20T15:11:39Z

That's a great list, and great contribution. I'd suggest submitting it with the failing tests commented out, with

# TODO: uncomment the following tests
# failing tests here...
# TODO: uncomment the above tests

bracketing the failing tests. That might reduce the chances that someone will accidentally delete them.

Then it would be incredibly helpful to file separate issues about each failing case, each issue linking to the commented-out test(s). (Do you know the 'y' GitHub trick?)

yuyichao · 2015-10-20T15:12:31Z

loops on local variables without a type declaration (eg. a = Array(Float32, 100) (NOK) vs. a::Array{Float32,1} = Array(Float32, 100) (OK). That is a regression from 0.4-rc4.

Can you try #13463 (I'm still working on a Jameson compatible version of it...)?

loops with field access eg. obj.a[i] = ...

Is that a immutable? (It might be harder if it's not)

function calls that could be vectorized eg. sqrt, sin, etc. with the exception of muladd (checked with @fastmath).

Are you talking about using libmvec or sth like that? It'll be hard since AFAIK LLVM can't do that yet.

jiahao · 2015-10-20T15:13:16Z

I've also found that @simd does not vectorize division. (See #13681 for a discussion)

yuyichao · 2015-10-20T15:16:36Z

Seems that the following code vectorize just fine on llvm 37. Is this a llvm improvement or are your example more complicated?

julia> function f(a, s)
           @inbounds @simd for i in eachindex(a)
               a[i] /= s
           end
       end

damiendr · 2015-10-20T15:19:45Z

@timholy thanks! Will do that.

@yuyichao I'll check. As for sqrt etc. it seems like LLVM can do it and that it used to work: #9672

yuyichao · 2015-10-20T15:21:38Z

As for sqrt etc. it seems like LLVM can do it and that it used to work: #9672

When I check with clang, it does vectorize it at IR level but then lower it to scalar function calls.

damiendr · 2015-10-20T15:26:27Z

@yuyichao it was labelled "Fixed in LLVM-3.5" in #9672, maybe that's why? My julia HEAD install says it's using LLVM-3.3.

jiahao · 2015-10-20T15:27:33Z

@yuyichao I see vdivpd in the output of @code_native on your f on LLVM 3.3 too. However, consider generic_scale!, which is essentially

function p(a, s)
    @inbounds for i in eachindex(a)
        a[i] *= s
     end
end

When I add the @simd annotation I see a significant speedup for a = rand(10^8). @code_native spits out different code and in both cases emits vectorized instructions.

yuyichao · 2015-10-20T15:31:46Z

@damiendr I'm on LLVM 37.
Just checked and sqrt indeed vectorize at both IR and native level. sin still doesn't work likely because we pass the openlibm function pointer instead of using the llvm intrinsics (there's a issue and a PR IIRC). I would expect it to only work at IR level (at least on linux) at best though.

Ref: #13686 (comment) #13681 (comment)

damiendr · 2015-10-20T15:40:31Z

Is that a immutable? (It might be harder if it's not)

You're right, it works with an immutable but not with a type. It's a fairly common idiom though, so I'll open an issue for it. Right now my workaround is to unpack everything outside the simd loops (which is OK in my case because it's all generated code anyways).

damiendr · 2015-10-20T16:26:35Z

Here are the tests:
damiendr@cd72ff1

Will finish opening the issues & create a PR.

damiendr · 2015-10-20T17:29:11Z

OK, so I have some issues I don't understand with the PR. The new tests pass if I run the tests from the REPL but fail in make test. There isn't much information about why, only "Error during test". Any idea why that could be?

damiendr@a65c2b2

I'm not very familiar with PRs yet, sorry about that!

damiendr · 2015-10-20T18:06:28Z

Found the culprit: https://github.com/JuliaLang/julia/blob/5fbf28bfaa80098d8d70ca9c2131209949f36a21/test/Makefile

$(TESTS):
    @cd $(SRCDIR) && $(call PRINT_JULIA, $(call spawn,$(JULIA_EXECUTABLE)) --check-bounds=yes --startup-file=no ./runtests.jl $@)

Seems like the tests are run with --check-bounds, bypassing any @inbounds in the code.

malmaud · 2015-10-20T18:25:14Z

Thanks for this effort, @damiendr - it's much appreciated.

yuyichao · 2015-10-20T21:21:48Z

@jiahao (Finally get back to my computer). FYI, your p(a, s) vectorize just fine on llvm 37 too (without @simd).

timholy · 2015-10-20T21:27:31Z

There are good reasons to run most tests with forced bounds-checking, but you're right that this is problematic for testing SIMD. Seems worth running the SIMD tests by a separate mechanism.

yuyichao · 2015-10-20T21:30:10Z

Should this be part of the performance test handled by @nanosoldier?

damiendr · 2015-10-20T22:11:27Z

I think the important bit to avoid regressions is that these test are run with the CI and can make a build/PR fail. Is that the case of the tests in tests/perf? If so it would be quite natural to put them there.

tkelman · 2015-10-20T23:47:14Z

Demanding perf tests need to be run on consistent hardware and we can run them often, but primarily on master or specifically chosen PR's, not every single PR. We should first try tracking and making perf results visible so regressions get noticed promptly, and see what kind of hardware load that imposes. Gating every single build's success or failure on whether there's a significant perf regression would likely require more hardware than we can dedicate to it initially.

damiendr · 2015-10-21T00:02:54Z

The tests I added to simdloop.jl don't actually run anything costly. They just compile functions and look at the LLVM IR.
Base.runtests("simdloop") takes about 2 seconds on my laptop. Don't know how that compares to the tests in test/perf but that is certainly insignificant compared to those in make test-all, which take several minutes in total.

tkelman · 2015-10-21T00:24:36Z

2 seconds is fine to put in the normal test suite, but if it relies on running with bounds checks disabled then maybe running them inside a separate spawned process is the best thing to do.

ArchRobison · 2015-10-21T04:26:11Z

+1. Thanks for doing this. I recommend checking the LLVM output since it's a little easier to read, though it will still be target dependent. Checking that @simd did better than the auto-vectorizer can be tricky. You might be able to check for the absence of the run-time dependence check, but I forget whether it's easily recognizable.

I'm tied up this week but can take a look next week at cases that refuse to vectorize but seem like they should.

Ref: #13686 (comment) #13681 (comment)

yuyichao · 2015-10-22T20:47:47Z

@ArchRobison There's also another llvm 3.7 regression that is causing SIMD to fail in random cases (I noticed it in fill! for example). I didn't report it here and only to the llvm bug tracker since it is reproducible with clang 3.7 as well. Would be nice if you could have a look at that as well when you have time.

damiendr · 2015-10-23T16:40:16Z

So, in the short term, should I...

put the tests in /perf or in a separate project
create a new make target for tests that need to be run without --check-bounds (am not very familiar with make but can probably manage something)
add code to simdloop.jl to spawn a separate process for these tests?

Or maybe there is a way to override --check-bounds temporarily?

KristofferC · 2015-10-25T11:37:15Z

I don't see why this is supposed to go into the /perf directory when it is just directly testing codegen in a pass/fail way. I think spawning a separate process for the simd tests would work well.

Overall, I think we would have much use of tests that do not only test for correctness but also test that the generated code and that the number and size of allocations does not change (whenever it is feasible). Might help to catch things like: #13735

damiendr · 2015-10-25T14:35:15Z

I found a quick-and-dirty solution to spawn a new process from within simdloop.jl:

<imports>
<types>

if Base.JLOptions().check_bounds == 1
    testproc = addprocs(1, exeflags="--check-bounds=no")[1]
    @fetchfrom testproc include("testdefs.jl")
    @fetchfrom testproc include("simdloop.jl")
else
    <tests>
end

But I'm wondering if the cleanest way would not be to rewrite runtests.jl to have per-test exec flags. That is much more work though.

damiendr · 2015-10-25T14:37:39Z

Incidentally, I realised that ever since this --check-bounds option was added the entire simdloop test suite has been functionally dead. No vector code was ever generated and therefore the correctness of vectorized code could not possibly have been checked.

ArchRobison · 2015-10-26T22:07:51Z

Ouch. One way to catch this is to write a deliberately abusive use of @simd that generates incorrect results if the code is really vectorized. Then a wrong answer is right, and vice-versa :-). Though it's a bit tricky because LLVM's "vectorizer" largely acts like an auto-vectorizer and will refuse to vectorize some abuses. A loop over i with a[i] = b*a[i+k]+c where k is a parameter to the containing function, and k is actually negative, might do the trick. (Though it might vectorize without the b* and +c, it might not because of the low flop to memory hit ratio.)

I concur that @KristofferC that the tests that check for vectorization by inspecting the compiled code technically belong in some other directory.

damiendr · 2015-10-27T08:17:19Z

So, the tests (that is, those that should work already) now pass on my machine. However they still fail on the CI machines, in different places. Does Julia generate a CPU-specific LLVM IR?

Ref: JuliaLang#13686 (comment) JuliaLang#13681 (comment)

damiendr · 2015-10-27T13:25:42Z

After having a look at codegen.cpp and the many #ifdef I now believe that my current approach, eg. looking for vector.body in the LLVM IR can only work for one particular target platform. I'll check what can be done at the code_lowered/typed level...

yuyichao · 2015-10-27T14:16:33Z

I've also noticed cases that doesn't vectorize but still have the vector.* labels. Does checking for the existance of vector type like < x > works?

ArchRobison · 2015-10-27T14:32:41Z

@damiendr : Vectorization is going to be target specific since some targets may lack SIMD support, and cost models (i.e. whether vectorization is profitable) vary even across micro-architectures. So the tests will need to be platform specific.

With respect to vector labels showing up in code without vector instructions, this happens because the vectorizer does two optimizations: using SIMD instructions and interleaving separate loop iterations to hide latency. So presence of the labels indicates that the right semantic information got to the vectorizer, but its cost model indicated that only interleaving was worthwhile. This commonly happens for codes involving scatters or gathers on hardware that requires synthesizing those instructions from serial instructions.

damiendr · 2015-10-27T15:18:47Z

@ArchRobison Right. Initially I thought that the IR given by code_llvm would be the input to platform-specific transforms but I now see it is an output of these!

I'm not quite sure how to design the tests in these conditions, because ideally we should be testing only Julia's ability to work nicely with the vectorizer, not the vectorizer's cost model and the CPU features.

use LLVM options to force vectorization for testing, (eg. connected to a --vectorize={yes|no|all} flag)
assume/force a specific cpu for testing
have multiple cpu-specific test cases
rather than test for vectorized LLVM IR, test for a number of criteria for vectorizable code_typed, for instance conformance to a whitelist of function calls. It looks like this would already catch a lot of the issues I had like long expressions (Base.afoldl in code_typed) or non-inlined functions.

ArchRobison · 2015-10-27T15:44:39Z

I recommend making the tests resemble real field usage as much as possible, so we're testing what we're shipping. E.g., not rely on special options for testing. LLVM puts CPU-specific tests in separate subdirectories, one for each CPU. I think that's a reasonable scheme for us too. We could start with 64-bit x86 because it's prevalent and guaranteed to have at least SSE, and then expand to other targets. [Disclosure: I work for Intel.]

I think it's worth the trouble to test the LLVM IR for vector instructions, since there are all sorts of subtle things that can prevent vectorization.

As far as infrastructure, it's worth taking a look at how LLVM's tests combine both the input and the expected output into a single file. It's a nice arrangement maintenance-wise.

damiendr · 2015-10-27T16:07:14Z

I recommend making the tests resemble real field usage as much as possible, so we're testing what we're shipping. E.g., not rely on special options for testing. LLVM puts CPU-specific tests in separate subdirectories, one for each CPU. I think that's a reasonable scheme for us too. We could start with 64-bit x86 because it's prevalent and guaranteed to have at least SSE, and then expand to other targets.

I'm not sure if it's that simple. One test suceeds on my machine (Sandy Bridge Core i7-2635QM) and fails on the CI machine (Nehalem Xeon X5570). I'm not yet 100% sure that there aren't other factors or bugs involved but it could be that the tests only vectorize with AVX instructions, or just that the cost model is different for the Xeon.

yuyichao · 2015-10-27T16:15:48Z

I thought we disabled avx on non mcjit. Are you testing locally with a newer llvm?

ArchRobison · 2015-10-27T17:10:37Z

Is the test that fails on the CI machine use Float32 or Float64? I can see where a cost model might determine that Float64 is not worth the trouble with just SSE, but I would have expected Float32 to vectorize. I'm not sure if I can get Julia to build on our Nehalems here -- we tend to have antique environments on our antique machines.

damiendr · 2015-10-27T22:18:10Z

I ran some diagnostics and here are some elements:

The tests fail on the first types tested, which happen to be Int32 in one case and Float32 in the other
the AppVeyor x86_64 build has cpu_target = "x86-64" (vs. "native" on my machine and in the Travis build)
the code_warntype output is the same on my machine and on the test machines
failure is indeed caused by non-vectorized code_llvm, despite correct execflags
I checked the code emitted by my machine and it does contain AVX instructions, even though I use the built-in libLLVM-3.3 (if versioninfo() can be trusted). Here's a snippet:

    vblendvps   %ymm1, %ymm4, %ymm5, %ymm5
    vextractf128    $1, %ymm5, %xmm7
    vpcmpgtd    %xmm0, %xmm7, %xmm7
    vcmpnleps   %ymm4, %ymm6, %ymm4
    vpslld  $31, %xmm4, %xmm6
...
    vpxor   %xmm0, %xmm0, %xmm0
    vpxor   %xmm1, %xmm1, %xmm1
L224:   vpaddd  %xmm0, %xmm1, %xmm0
    vmovhlps    %xmm0, %xmm0, %xmm1 ## xmm1 = xmm0[1,1]
    vpaddd  %xmm1, %xmm0, %xmm0
    vphaddd %xmm0, %xmm0, %xmm0
    vmovd   %xmm0, %eax

None of the test machines seem to support the AVX instruction set.

So a tentative conclusion would be that most of the SIMD examples, including some very simple ones, fail to vectorize on non-AVX architectures.

It would be nice to run the tests in such a way that failures do not stop the next tests from running, to get a better idea of what fails & what does not.

eschnett · 2015-10-28T03:05:17Z

@ArchRobison To check for SIMD vectorization, you can craft a loop that involves floating-point operations with round-off errors. Depending on the order in which the operations are performed, you would get different round-off, and thus be able to tell whether SIMD vectorization was employed. This requires a reduction operations, though, so not all SIMD loops can be checked.

nalimilan · 2015-10-31T11:06:31Z

the AppVeyor x86_64 build has cpu_target = "x86-64" (vs. "native" on my machine and in the Travis build)

@damiendr This difference is a plausible explanation of the different results you get. Try passing JULIA_CPU_TARGET=x86_64 on your machine, and you'll likely get the same results as the AppVeyor build.

KristofferC · 2017-06-24T18:42:28Z

I think the extensive benchmarks together with the recent codegen tests cover this pretty ok.

jiahao added a commit that referenced this issue Oct 20, 2015

Add simd annotation to generic_scale!

c4c6bb6

Ref: #13686 (comment) #13681 (comment)

damiendr mentioned this issue Oct 20, 2015

SIMD performance regression tests #13692

Closed

JeffBezanson added performance Must go faster test This change adds or pertains to unit tests labels Oct 21, 2015

jiahao added a commit that referenced this issue Oct 21, 2015

Add simd annotation to generic_scale!

db90ac0

Ref: #13686 (comment) #13681 (comment)

damiendr closed this as completed Oct 23, 2015

damiendr reopened this Oct 23, 2015

damiendr mentioned this issue Oct 26, 2015

SIMD regression tests #13777

Closed

bjarthur pushed a commit to bjarthur/julia that referenced this issue Oct 27, 2015

Add simd annotation to generic_scale!

261460b

Ref: JuliaLang#13686 (comment) JuliaLang#13681 (comment)

jrevels added the potential benchmark Could make a good benchmark in BaseBenchmarks label Nov 17, 2015

jrevels added a commit to JuliaCI/BaseBenchmarks.jl that referenced this issue Jan 26, 2016

add SIMD benchmarks from JuliaLang/julia#13777 and JuliaLang/julia#13686

12637db

jrevels removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Jan 26, 2016

tkelman mentioned this issue Jun 2, 2016

code gen tests #16721

Closed

KristofferC closed this as completed Jun 24, 2017

SIMD performance regression tests #13686

SIMD performance regression tests #13686

Comments

damiendr commented Oct 20, 2015

timholy commented Oct 20, 2015

ViralBShah commented Oct 20, 2015

yuyichao commented Oct 20, 2015

damiendr commented Oct 20, 2015

timholy commented Oct 20, 2015

yuyichao commented Oct 20, 2015

jiahao commented Oct 20, 2015

yuyichao commented Oct 20, 2015

damiendr commented Oct 20, 2015

yuyichao commented Oct 20, 2015

damiendr commented Oct 20, 2015

jiahao commented Oct 20, 2015

yuyichao commented Oct 20, 2015

damiendr commented Oct 20, 2015

damiendr commented Oct 20, 2015

damiendr commented Oct 20, 2015

damiendr commented Oct 20, 2015

malmaud commented Oct 20, 2015

yuyichao commented Oct 20, 2015

timholy commented Oct 20, 2015

yuyichao commented Oct 20, 2015

damiendr commented Oct 20, 2015

tkelman commented Oct 20, 2015

damiendr commented Oct 21, 2015

tkelman commented Oct 21, 2015

ArchRobison commented Oct 21, 2015

yuyichao commented Oct 22, 2015

damiendr commented Oct 23, 2015

KristofferC commented Oct 25, 2015

damiendr commented Oct 25, 2015

damiendr commented Oct 25, 2015

ArchRobison commented Oct 26, 2015

damiendr commented Oct 27, 2015

damiendr commented Oct 27, 2015

yuyichao commented Oct 27, 2015

ArchRobison commented Oct 27, 2015

damiendr commented Oct 27, 2015

ArchRobison commented Oct 27, 2015

damiendr commented Oct 27, 2015

yuyichao commented Oct 27, 2015

ArchRobison commented Oct 27, 2015

damiendr commented Oct 27, 2015

eschnett commented Oct 28, 2015

nalimilan commented Oct 31, 2015

KristofferC commented Jun 24, 2017