-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: Fannkuch11 on AMD64 slow down 6% after removing assembler backend instruction reordering #18977
Comments
CL https://golang.org/cl/38431 mentions this issue. |
Old loops look like this: loop: CMPQ ... JGE exit ... JMP loop exit: New loops look like this: JMP entry loop: ... entry: CMPQ ... JLT loop This removes one instruction (the unconditional jump) from the inner loop. Kinda surprisingly, it matters. This is a bit different than the peeling that the old obj library did in that we don't duplicate the loop exit test. We just jump to the test. I'm not sure if it is better or worse to do that (peeling gets rid of the JMP but means more code duplication), but this CL is certainly a much simpler compiler change, so I'll try this way first. The obj library used to do peeling before CL https://go-review.googlesource.com/c/36205 turned it off. Fixes #15837 (remove obj instruction reordering) The reordering is already removed, this CL implements the only part of that reordering that we'd like to keep. Fixes #14758 (append loop) name old time/op new time/op delta Foo-12 817ns ± 4% 538ns ± 0% -34.08% (p=0.000 n=10+9) Bar-12 850ns ±11% 570ns ±13% -32.88% (p=0.000 n=10+10) Update #19595 (BLAS slowdown) name old time/op new time/op delta DgemvMedMedNoTransIncN-12 13.2µs ± 9% 10.2µs ± 1% -22.26% (p=0.000 n=9+9) Fixes #19633 (append loop) name old time/op new time/op delta Foo-12 810ns ± 1% 540ns ± 0% -33.30% (p=0.000 n=8+9) Update #18977 (Fannkuch11 regression) name old time/op new time/op delta Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) This one makes no sense. There's strictly 1 less instruction in the inner loop (17 instead of 18). They are exactly the same instructions except for the JMP that has been elided. go1 benchmarks generally don't look very impressive. But the gains for the specific issues above make this CL still probably worth it. name old time/op new time/op delta BinaryTree17-8 2.32s ± 0% 2.34s ± 0% +1.14% (p=0.000 n=9+7) Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) FmtFprintfEmpty-8 44.1ns ± 1% 46.1ns ± 1% +4.53% (p=0.000 n=10+10) FmtFprintfString-8 67.8ns ± 0% 74.4ns ± 1% +9.80% (p=0.000 n=10+9) FmtFprintfInt-8 74.9ns ± 0% 78.4ns ± 0% +4.67% (p=0.000 n=8+10) FmtFprintfIntInt-8 117ns ± 1% 123ns ± 1% +4.69% (p=0.000 n=9+10) FmtFprintfPrefixedInt-8 160ns ± 1% 146ns ± 0% -8.22% (p=0.000 n=8+10) FmtFprintfFloat-8 214ns ± 0% 206ns ± 0% -3.91% (p=0.000 n=8+8) FmtManyArgs-8 468ns ± 0% 497ns ± 1% +6.09% (p=0.000 n=8+10) GobDecode-8 6.16ms ± 0% 6.21ms ± 1% +0.76% (p=0.000 n=9+10) GobEncode-8 4.90ms ± 0% 4.92ms ± 1% +0.37% (p=0.028 n=9+10) Gzip-8 209ms ± 0% 212ms ± 0% +1.33% (p=0.000 n=10+10) Gunzip-8 36.6ms ± 0% 38.0ms ± 1% +4.03% (p=0.000 n=9+9) HTTPClientServer-8 84.2µs ± 0% 86.0µs ± 1% +2.14% (p=0.000 n=9+9) JSONEncode-8 13.6ms ± 3% 13.8ms ± 1% +1.55% (p=0.003 n=9+10) JSONDecode-8 53.2ms ± 5% 52.9ms ± 0% ~ (p=0.280 n=10+10) Mandelbrot200-8 3.78ms ± 0% 3.78ms ± 1% ~ (p=0.661 n=10+9) GoParse-8 2.89ms ± 0% 2.94ms ± 2% +1.50% (p=0.000 n=10+10) RegexpMatchEasy0_32-8 68.5ns ± 2% 68.9ns ± 1% ~ (p=0.136 n=10+10) RegexpMatchEasy0_1K-8 220ns ± 1% 225ns ± 1% +2.41% (p=0.000 n=10+10) RegexpMatchEasy1_32-8 64.7ns ± 0% 64.5ns ± 0% -0.28% (p=0.042 n=10+10) RegexpMatchEasy1_1K-8 348ns ± 1% 355ns ± 0% +1.90% (p=0.000 n=10+10) RegexpMatchMedium_32-8 102ns ± 1% 105ns ± 1% +2.95% (p=0.000 n=10+10) RegexpMatchMedium_1K-8 33.1µs ± 3% 32.5µs ± 0% -1.75% (p=0.000 n=10+10) RegexpMatchHard_32-8 1.71µs ± 1% 1.70µs ± 1% -0.84% (p=0.002 n=10+9) RegexpMatchHard_1K-8 51.1µs ± 0% 50.8µs ± 1% -0.48% (p=0.004 n=10+10) Revcomp-8 411ms ± 1% 402ms ± 0% -2.22% (p=0.000 n=10+9) Template-8 61.8ms ± 1% 59.7ms ± 0% -3.44% (p=0.000 n=9+9) TimeParse-8 306ns ± 0% 318ns ± 0% +3.83% (p=0.000 n=10+10) TimeFormat-8 320ns ± 0% 318ns ± 1% -0.53% (p=0.012 n=7+10) Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b Reviewed-on: https://go-review.googlesource.com/38431 Run-TryBot: Keith Randall <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: David Chase <[email protected]>
Tip (6897030) looks like a net win vs 1.8.1. Code size is smaller:
Execution time is faster, at least on my machine:
So whatever regression occurred has at the least been compensated for, if not fixed by CL 38431. I do see one notable regression, though, which is in the number of panicindex calls. I've filed #20332 for that. I'm going to close this as fixed. @cherrymui or @randall77, please reopen if you disagree. |
@josharian, sad to bring bad news but here is how it looks on my machine:
in other words it is ~ 10% regression.
|
@navytux will you re-measure including CL 43291? |
let me try |
Strangely CL 43291 make things only worse:
and I've doble-checked it is reproducible via recompiling ( |
OK, re-opening. One other things I see in tip vs 1.8.1 is that in 1.8.1, all the panicindex calls are at the end of the function, so jumps to them are always unlikely (forward jumps). In tip, they're scattered throughout. Not sure whether that's responsible, but I'll see if I can fix it and check whether it helps. |
@josharian thanks for feedback and reopening. |
CL https://golang.org/cl/43293 mentions this issue. |
@navytux will you see what effect CL 43293 has on your machine? Thanks. |
@josharian sure. Today's tip is
in other words it is partial improvement and we are now back to 6% regression over go18:
|
For the reference here is how Fannkuch11 looks under
highlights:
To me all this suggests that after CL43293 or similar is applied the problem is in that:
Hope this helps a bit, |
Thanks, Kirill, that's great. My guess was that CL 43293 was making the critical sections of code more compact, thus keeping more in cache; nice to have that confirmed. I'm working on making CL 43293 submittable now, and then will look again. |
Josh thanks for feedback and patches. Offhand me thinks to get this regression gone we need to fix "Kinda surprisingly" in 39ce590 (CL 38431) - in other words to understand the root cause and apply loop rotation only when it is more optimal judging on corresponding factors. Current state is, it seems, we are applying loop rotation always and it sometimes helps and sometimes harms. @randall77, @TocarIP please see #18977 (comment) for a bit of info about what is going on with Fannkuch11. |
I'm going to dig a bit more, although I'll be mostly AFK for a while. In the meantime, Kirill, maybe you could confirm that disabling looprotate on your machine fixes the problem by adding "return" at the very beginning of func loopRotate in cmd/compile/internal/ssa/looprotate.go, to disable the pass. |
Josh, thanks for feedback. Sure, staying at the same tip = 482da51 here is what we have:
some comments: there is a bit more (0.2%) instructions but insn/cycle grows from 2.10 to 2.13 and as the result total cpu cycles is ~ 1% less (which correllates with totall run time reported by perf and probably includes preparatory program actions
insn / cycle grows from 2.13 to 2.15 (good) but there are 2.7% more total instructions so overal cycles grow by 1.7% (correlates wrt total run time reported by perf). There is 12% more branches and looks like they are the only visible factor that contribute to instruction growth: δ(branches) / δ(insn) ≈ 1.
Here the number of instructions increases by 1.7% which is mostly due to 6% N(branches) increase (δ(branches) / δ(insn) ≈ 0.82) however the insn/cycle ratio also grows from 2.04 to 2.09 which overcompensate all that and resulting N(cycles) decrease by ~ 1%. My observation:
( I've rebenchmarked previous runs from todays morning and reverified all timings several times ) |
I'm not an x86 guy (probably @TocarIP could help here) but I imagined: if there are several execution units inside cpu with somehow different properties, units overal load could depend on the order of instructions in instruction stream. With this thinking in mind I googled for "x86 add nop speedup" and found e.g. https://research.google.com/pubs/pub37077.html also if we simplify and assume data loads/stores are always hitting cache what affects insn/cycle is
(note not / N(branches)). The stackoverflow link and MAO article above contains some information on what factors could affect branch prediction and other loop related things. Coincidence or not there the talking is usually too about 5% - 7% speedup or slowdown. |
I've played a bit with https://github.com/andikleen/pmu-tools
Running ocperf.py record -e br_misp_retired.all_branches_pebs
and code layout:
Note that in this case branch is unconditional! |
I haven't had a chance to dig into this excellent data, but see also #20355 for something that seems definitely wrong with looprotate. It seems that looprotate is doing both good work and bad for the benchmark. |
Also #20356. |
It can be sped up 33% with -B. That might or might not be the same thing. |
Correct. However all runtime bounds checks in fannkuch can be statically proven to be unneccessary: That means the compiler can theoretically reach -B level with just clever BCE. |
It wasn't spammy and it wasn't particularly noisy. #8717 lead to episodic noise, but otherwise it reliably detected changes in multiple metrics within fractions of percent. Dashboard wasn't disabled. It was broken by changes to the app and then never fixed. |
The analysis looks great! Thanks! |
@dvyukov thanks for feedback. |
We're punting more loop rotation work to 1.10. |
Punting to 1.13, too late for anything major in 1.12. |
CL https://go-review.googlesource.com/c/36205/ deletes assembler backend instruction reordering. This makes Fannkuch-11 on AMD64 slow down 6%. Figure out why and improve the compiler.
The text was updated successfully, but these errors were encountered: