-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: BLAS regressions at tip #19595
Comments
Looks like there are some extra reg->reg moves in tip compared to 1.8. I strongly suspect the introduction of MULSDmem ops. Previously we had
which works fine with no reg moves. After introducing MULSDmem ops, we have
The op for x = m * load ... requires its output register (for x) to be the same as the input register (m). So there are 2 additional register moves, one to copy m to another register and one to move it back. The copy back then requires an additional branch instruction. source code:
1.8:
tip:
|
CL https://golang.org/cl/38337 mentions this issue. |
@btracey Please try the patch above and see if it fixes everything for you. |
It seems to have partially helped, but not fixed the problem. I'm not sure I applied it correctly though (see #19619)
|
It looks like all of the |
I've taken a closer look at DgemvMedMedNoTransIncN. Its inner loop is gonum/internal/asm/f64/axpy.go:AxpyInc
whereas 1.8 is:
I'm pretty surprised this matters at the ~30% slowdown level. There might be something I'm missing. |
We might be able to get the effect of the loop peel just by changing the weights on block ordering -- for loops where the header block has a loop exit edge, make the branch to the header be lower than the backedge to the header. |
@dr2chase I feel like that would interact badly with the register allocator. If you put the loop header block last, then the 2nd block of the loop appears first in the schedule. But none of its predecessors have been processed yet, so the register allocator won't know what register state to start with. It could be fixed, I suppose, computing a regalloc schedule instead of just doing blocks in order.
Which isn't quite the same as 1.8 either. Better or worse? I'm not sure. |
When I did the explicit loop peel to fix i,e:=range-of-slice loops in the presence of backedge rescheduling checks, PPC did get a little faster. |
CL https://golang.org/cl/38431 mentions this issue. |
We want to merge a load and op into a single instruction l = LOAD ptr mem y = OP x l into y = OPload x ptr mem However, all of our OPload instructions require that y uses the same register as x. If x is needed past this instruction, then we must copy x somewhere else, losing the whole benefit of merging the instructions in the first place. Disable this optimization if x is live past the OP. Also disable this optimization if the OP is in a deeper loop than the load. Update #19595 Change-Id: I87f596aad7e91c9127bfb4705cbae47106e1e77a Reviewed-on: https://go-review.googlesource.com/38337 Reviewed-by: Ilya Tocar <[email protected]>
Regalloc uses loop depths - make sure they are initialized! Test to make sure we aren't pushing spills into loops. This fixes a generated-code performance bug introduced with the better spill placement change: https://go-review.googlesource.com/c/34822/ Update #19595 Change-Id: Ib9f0da6fb588503518847d7aab51e569fd3fa61e Reviewed-on: https://go-review.googlesource.com/38434 Reviewed-by: David Chase <[email protected]>
CL https://golang.org/cl/38434 mentions this issue. |
Old loops look like this: loop: CMPQ ... JGE exit ... JMP loop exit: New loops look like this: JMP entry loop: ... entry: CMPQ ... JLT loop This removes one instruction (the unconditional jump) from the inner loop. Kinda surprisingly, it matters. This is a bit different than the peeling that the old obj library did in that we don't duplicate the loop exit test. We just jump to the test. I'm not sure if it is better or worse to do that (peeling gets rid of the JMP but means more code duplication), but this CL is certainly a much simpler compiler change, so I'll try this way first. The obj library used to do peeling before CL https://go-review.googlesource.com/c/36205 turned it off. Fixes #15837 (remove obj instruction reordering) The reordering is already removed, this CL implements the only part of that reordering that we'd like to keep. Fixes #14758 (append loop) name old time/op new time/op delta Foo-12 817ns ± 4% 538ns ± 0% -34.08% (p=0.000 n=10+9) Bar-12 850ns ±11% 570ns ±13% -32.88% (p=0.000 n=10+10) Update #19595 (BLAS slowdown) name old time/op new time/op delta DgemvMedMedNoTransIncN-12 13.2µs ± 9% 10.2µs ± 1% -22.26% (p=0.000 n=9+9) Fixes #19633 (append loop) name old time/op new time/op delta Foo-12 810ns ± 1% 540ns ± 0% -33.30% (p=0.000 n=8+9) Update #18977 (Fannkuch11 regression) name old time/op new time/op delta Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) This one makes no sense. There's strictly 1 less instruction in the inner loop (17 instead of 18). They are exactly the same instructions except for the JMP that has been elided. go1 benchmarks generally don't look very impressive. But the gains for the specific issues above make this CL still probably worth it. name old time/op new time/op delta BinaryTree17-8 2.32s ± 0% 2.34s ± 0% +1.14% (p=0.000 n=9+7) Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) FmtFprintfEmpty-8 44.1ns ± 1% 46.1ns ± 1% +4.53% (p=0.000 n=10+10) FmtFprintfString-8 67.8ns ± 0% 74.4ns ± 1% +9.80% (p=0.000 n=10+9) FmtFprintfInt-8 74.9ns ± 0% 78.4ns ± 0% +4.67% (p=0.000 n=8+10) FmtFprintfIntInt-8 117ns ± 1% 123ns ± 1% +4.69% (p=0.000 n=9+10) FmtFprintfPrefixedInt-8 160ns ± 1% 146ns ± 0% -8.22% (p=0.000 n=8+10) FmtFprintfFloat-8 214ns ± 0% 206ns ± 0% -3.91% (p=0.000 n=8+8) FmtManyArgs-8 468ns ± 0% 497ns ± 1% +6.09% (p=0.000 n=8+10) GobDecode-8 6.16ms ± 0% 6.21ms ± 1% +0.76% (p=0.000 n=9+10) GobEncode-8 4.90ms ± 0% 4.92ms ± 1% +0.37% (p=0.028 n=9+10) Gzip-8 209ms ± 0% 212ms ± 0% +1.33% (p=0.000 n=10+10) Gunzip-8 36.6ms ± 0% 38.0ms ± 1% +4.03% (p=0.000 n=9+9) HTTPClientServer-8 84.2µs ± 0% 86.0µs ± 1% +2.14% (p=0.000 n=9+9) JSONEncode-8 13.6ms ± 3% 13.8ms ± 1% +1.55% (p=0.003 n=9+10) JSONDecode-8 53.2ms ± 5% 52.9ms ± 0% ~ (p=0.280 n=10+10) Mandelbrot200-8 3.78ms ± 0% 3.78ms ± 1% ~ (p=0.661 n=10+9) GoParse-8 2.89ms ± 0% 2.94ms ± 2% +1.50% (p=0.000 n=10+10) RegexpMatchEasy0_32-8 68.5ns ± 2% 68.9ns ± 1% ~ (p=0.136 n=10+10) RegexpMatchEasy0_1K-8 220ns ± 1% 225ns ± 1% +2.41% (p=0.000 n=10+10) RegexpMatchEasy1_32-8 64.7ns ± 0% 64.5ns ± 0% -0.28% (p=0.042 n=10+10) RegexpMatchEasy1_1K-8 348ns ± 1% 355ns ± 0% +1.90% (p=0.000 n=10+10) RegexpMatchMedium_32-8 102ns ± 1% 105ns ± 1% +2.95% (p=0.000 n=10+10) RegexpMatchMedium_1K-8 33.1µs ± 3% 32.5µs ± 0% -1.75% (p=0.000 n=10+10) RegexpMatchHard_32-8 1.71µs ± 1% 1.70µs ± 1% -0.84% (p=0.002 n=10+9) RegexpMatchHard_1K-8 51.1µs ± 0% 50.8µs ± 1% -0.48% (p=0.004 n=10+10) Revcomp-8 411ms ± 1% 402ms ± 0% -2.22% (p=0.000 n=10+9) Template-8 61.8ms ± 1% 59.7ms ± 0% -3.44% (p=0.000 n=9+9) TimeParse-8 306ns ± 0% 318ns ± 0% +3.83% (p=0.000 n=10+10) TimeFormat-8 320ns ± 0% 318ns ± 1% -0.53% (p=0.012 n=7+10) Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b Reviewed-on: https://go-review.googlesource.com/38431 Run-TryBot: Keith Randall <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: David Chase <[email protected]>
@btracey would you mind re-measuring with the latest tip, to assess whether this still might need fixes for 1.9? Thanks. |
Here are the current stats. Mostly better
|
But a few non-trivial regressions still. If you're up for it, would you mind patching in CLs 43293 and 43491 and see whether they help any? (No particular reason to think they will, other than that they have a significant effect on code layout in general.) |
With both applied:
|
Thanks. Looks like those CLs are a mixed bag, maybe a bit worse on balance. To confirm, are these latest from 1.8 to tip+2 CLs? I ask because the before numbers e.g. for the last benchmark don't line up—10.4us two comments back vs 9.84us in the last comment. Iff you have all the benchmark files around, could you post the benchstat going from tip to tip+2CLs, maybe running with -geomean? Ta. |
It was 1.8.1 to the most recent. I'm running the benchmarks on my laptop (which I've been using all day), so the benchmarks could shift a bit. This is why I re-ran the 1.8.1 results (and why they don't exactly match up).
|
Punting to 1.10. |
@btracey - Comparing 1.8.7 with 1.12beta1, I see improvements
Kindly check and let us know if there is anything else to be done here. Thanks. |
Timed out in state WaitingForInfo. Closing. (I am just a bot, though. Please speak up if this is a mistake or you have the requested information.) |
Comparing
go1.8
withgo version devel +b9f6b22 Fri Mar 17 13:59:31 2017 +0000 darwin/amd64
For all the benchmarks, the package is The
github.com/gonum/blas/native
. Below, thenoasm
tag makes the code pure go, rather than the ASM implementations ofddot
anddaxpy
(which use SIMD and thus make the code significantly faster).The first benchmark is Daxpy. Command is
go test -bench Daxpy -tags noasm -count 5
There are significant slowdowns at the smaller sizes, though they disappear at the larger sizes. I'm not sure these explain the regressions we're seeing in
Dgemv
, so it may be a separate issue.The command is
go test -bench Dgemv -tags noasm -count 5
.As we see, here some of the largest regressions are at the larger sizes. These additionally work to about a 15% regression on
Dgemm
for almost all sizes, and about a 15% regression for the yet higher-level Cholesky decomposition.The text was updated successfully, but these errors were encountered: