-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/compile: suboptimal assembly for a + b + 1 #31900
Comments
Splitting the 3-operand LEA instruction into two 2-operand LEA instructions is only faster in tight loops that fit into the µOp cache. If there is no tight loop the cost of decoding two instructions instead of one is higher than the one cycle the 2 two-operand LEAs save during execution. For the example above the 3-operand LEA will be faster than two LEAs or ADDs. The situation changes if the function is inlined into a loop that fits the µOp cache. Intel formulated following rule in the Intel® 64 and IA-32 Architectures Optimization Reference Manual. Assembly/Compiler Coding Rule 33. (ML impact, L generality) If an LEA instruction using the |
Im fine with reverting the lea splitting to keep it simple and to reduce binary size. However, Last time i benchmarked decoding two 2 operand leas was better than the latency of a 3 operand lea. Maybe thats changed and they require more than one or two cycles for decoding combined now. Otherwise i do not understand why they would be slower in the example above. The intel rule also only seems to suggest to me that the 3 operand lea is better for trace cache utilization and not that two leas are only better if served from the trace cache. I need to have a look again. @ulikunitz can you post your cpu type and the benchmark and measurements that show 3 operand lea being faster than two 2 operand leas for comparison in the example above? Thanks Generally i think we should prefer add 1 over inc. the later only has the upside of size while the former seems never slower (some archs take an extra cycle for flag update of inc), fuses better and doesnt partially update flags. |
I have microbenchmarked the three variants of the functions on multiple platforms. Variant 1: ADDQ + INCQ On Intel variant 3 is always the fastest. The 3-operand LEAQ was only slower on an old Athlon X2 5600. I used following commands to produce the output.
Skylake (Xeon 2GHz)
Haswell
Nehalem - before Sandy Bridge, where LEAQ implementation changed
Athlon X2
The assembler code for reference:
|
Thank you very much for the data. A difference of 0.03 ns doesnt usually mean it is generally slower but looks to be within benchmarking variance (going by the -+6%) which we will not be able to completely narrow down even on higher counts. where 0.2 or 0.3ns likely indeed mean its likely slower by a clock cycle. If these are really parallel benchmarks I usually run with -cpu=1 for these to reduce the load and interference and disable frequency scaling and turbo boost too. I will run the 2 lea vs 3 op lea variant on my benchmarking computer once near it too. |
Currently only have my laptop at hand (i7-3520M 2.9ghz, Ivy Bridge) to make a quick benchmark. On it disabling slow lea splitting (go tip 83f205f , ssa.go#L608 set to false then go/src/cmd/compile/internal/amd64/ssa.go Line 608 in 2e4edf4
makes the benchmarks 1 clock cycle (around 0.3ns) slower. I regard 0.1ns as within variance from runs here. old = go tip 83f205f
I used e.g. Benchmarks
Functions
|
Martin,
could you provide the code on github.com. I would like to check them on the
machines I have access to.
Kind regards,
Ulrich
Am So., 12. Mai 2019 um 09:22 Uhr schrieb Martin Möhrmann <
[email protected]>:
… Currently only have my laptop at hand (i7-3520M 2.9ghz) to make a quick
benchmark.
On it disabling slow lea splitting (go tip 83f205,
https://github.com/golang/go/blob/2e4edf46977994c9d26df9327f0e41c1b60f3435/src/cmd/compile/internal/amd64/ssa.go#L608
set to false)
makes the benchmarks 1 clock cycle (around 0.3ns) slower. I regard 0.1ns
as within variance from runs here.
benchstat ~/lea2.bench ~/lea3.bench
name old time/op new time/op delta
LEA22_1_noinline 4.24ns ± 0% 4.52ns ± 0% +6.70% (p=0.000 n=8+10)
LEA22_4_noinline 4.31ns ± 2% 4.61ns ± 2% +6.91% (p=0.000 n=10+10)
LEA22_1_inline 0.58ns ± 2% 0.87ns ± 4% +50.05% (p=0.000 n=10+10)
LEA22_4_inline 0.59ns ± 3% 0.97ns ± 3% +64.47% (p=0.000 n=10+9)
I used go tool objdump -s BenchmarkLEA22_4_inline to check that the
expected LEA instructions were emitted.
Command
go test -cpu=1 -count=10 -bench=.*
Benchmarks
var global int
func BenchmarkLEA22_1_noinline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_1_noinline(sink, sink)
}
global = sink
}
func BenchmarkLEA22_4_noinline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_4_noinline(sink, sink)
}
global = sink
}
func BenchmarkLEA22_1_inline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_1_inline(sink, sink)
}
global = sink
}
func BenchmarkLEA22_4_inline(b *testing.B) {
var sink int
for i := 0; i < b.N; i++ {
sink = lea22_4_inline(sink, sink)
}
global = sink
}
Functions
//go:noinline
func lea22_1_noinline(a, b int) int {
return 1 + a + b
}
func lea22_1_inline(a, b int) int {
return 1 + a + b
}
//go:noinline
func lea22_4_noinline(a, b int) int {
return 1 + (a + 4*b)
}
func lea22_4_inline(a, b int) int {
return 1 + (a + 4*b)
}
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31900 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACARSFMD4R5DG74UNO7WYK3PU7ATRANCNFSM4HLOC6ZQ>
.
|
Change https://golang.org/cl/176622 mentions this issue: |
Uploaded as https://go-review.googlesource.com/c/go/+/176622, also contains the change to generate slow leas in src/cmd/compile/internal/amd64/ssa.go. For the original issue posted: To make this work it would be nice if we have a last rule based optimization pass after the normal pass to do these kinds of low level transformations that should not interfere with other rules but need to be done before emitting instructions. This would also make amd64/ssa.go code simpler. Replacing mov with xor and some other optimization could fit this category as well. see #27034 where I had commented in that direction. |
Update:
|
On amd64, this assembles to:
I believe the two LEAQs should instead be
ADDQ CX, AX
,INC AX
. This would be six bytes of instructions instead of 8.I need to double-check, but I believe that this happens because we optimize
a + b + const
toLEAQ const(a)(b*1), dst
, but then break apart the three-part LEAQ.We can't fix this when lowering to instructions, because LEAQ doesn't clobber flags and ADD and INC do. So we need to somehow catch this in the rewrite rules. I wonder whether we should reconsider the previous strategy of breaking apart "slow" LEAQs at the last minute (for #21735).
cc @martisch @randall77
The text was updated successfully, but these errors were encountered: