cmd/compile: suboptimal assembly for a + b + 1 #31900

josharian · 2019-05-08T03:32:59Z

package p

func f(a, b uint64) uint64 {
	return a + b + 1
}

On amd64, this assembles to:

"".f STEXT nosplit size=24 args=0x18 locals=0x0
	0x0000 00000 (x.go:4)	MOVQ	"".a+8(SP), AX
	0x0005 00005 (x.go:4)	MOVQ	"".b+16(SP), CX
	0x000a 00010 (x.go:4)	LEAQ	(CX)(AX*1), AX
	0x000e 00014 (x.go:4)	LEAQ	1(AX), AX
	0x0012 00018 (x.go:4)	MOVQ	AX, "".~r2+24(SP)
	0x0017 00023 (x.go:4)	RET

I believe the two LEAQs should instead be ADDQ CX, AX, INC AX. This would be six bytes of instructions instead of 8.

I need to double-check, but I believe that this happens because we optimize a + b + const to LEAQ const(a)(b*1), dst, but then break apart the three-part LEAQ.

We can't fix this when lowering to instructions, because LEAQ doesn't clobber flags and ADD and INC do. So we need to somehow catch this in the rewrite rules. I wonder whether we should reconsider the previous strategy of breaking apart "slow" LEAQs at the last minute (for #21735).

cc @martisch @randall77

The text was updated successfully, but these errors were encountered:

ulikunitz · 2019-05-11T09:02:38Z

Splitting the 3-operand LEA instruction into two 2-operand LEA instructions is only faster in tight loops that fit into the µOp cache. If there is no tight loop the cost of decoding two instructions instead of one is higher than the one cycle the 2 two-operand LEAs save during execution.

For the example above the 3-operand LEA will be faster than two LEAs or ADDs. The situation changes if the function is inlined into a loop that fits the µOp cache.

Intel formulated following rule in the Intel® 64 and IA-32 Architectures Optimization Reference Manual.

Assembly/Compiler Coding Rule 33. (ML impact, L generality) If an LEA instruction using the
scaled index is on the critical path, a sequence with ADDs may be better. If code density and bandwidth
out of the trace cache are the critical factor, then use the LEA instruction.

martisch · 2019-05-11T09:41:36Z

Im fine with reverting the lea splitting to keep it simple and to reduce binary size.

However, Last time i benchmarked decoding two 2 operand leas was better than the latency of a 3 operand lea. Maybe thats changed and they require more than one or two cycles for decoding combined now. Otherwise i do not understand why they would be slower in the example above. The intel rule also only seems to suggest to me that the 3 operand lea is better for trace cache utilization and not that two leas are only better if served from the trace cache. I need to have a look again.

@ulikunitz can you post your cpu type and the benchmark and measurements that show 3 operand lea being faster than two 2 operand leas for comparison in the example above? Thanks

Generally i think we should prefer add 1 over inc. the later only has the upside of size while the former seems never slower (some archs take an extra cycle for flag update of inc), fuses better and doesnt partially update flags.

ulikunitz · 2019-05-11T16:48:54Z

I have microbenchmarked the three variants of the functions on multiple platforms.

Variant 1: ADDQ + INCQ
Variant 2: LEAQ + LEAQ
Variant 3: LEAQ (3 operands)

On Intel variant 3 is always the fastest. The 3-operand LEAQ was only slower on an old Athlon X2 5600.

I used following commands to produce the output.

$ sed -n '/family/,+3p;/processor.*1/q' /proc/cpuinfo
$ go test -bench . -count 10 | tee bench.txt
$ benchstat bench.txt

Skylake (Xeon 2GHz)

cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) CPU @ 2.00GHz
stepping	: 3

name     time/op
Add/1-4  2.38ns ± 1%
Add/2-4  2.38ns ± 0%
Add/3-4  2.35ns ± 1%

Haswell

cpu family	: 6
model		: 69
model name	: Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz
stepping	: 1

name     time/op
Add/1-4  2.46ns ± 6%
Add/2-4  2.41ns ± 5%
Add/3-4  2.39ns ± 5%

Nehalem - before Sandy Bridge, where LEAQ implementation changed

cpu family	: 6
model		: 26
model name	: Intel(R) Core(TM) i7 CPU         950  @ 3.07GHz
stepping	: 5

name     time/op
Add/1-8  2.42ns ± 1%
Add/2-8  2.45ns ± 2%
Add/3-8  2.42ns ± 0%

Athlon X2

cpu family	: 15
model		: 67
model name	: AMD Athlon(tm) 64 X2 Dual Core Processor 5600+
stepping	: 3

name     time/op
Add/1-2  3.94ns ± 0%
Add/2-2  3.98ns ± 2%
Add/3-2  4.38ns ± 3%

The assembler code for reference:

#include "textflag.h"

// func add1_1(x uint64, y uint64) uint64
TEXT ·add1_1(SB), NOSPLIT, $0-24
	MOVQ x+0(FP), AX
	MOVQ y+8(FP), CX
	ADDQ AX, CX
	INCQ CX
	MOVQ CX, ret+16(FP)
	RET

// func add1_2(x uint64, y uint64) uint64
TEXT ·add1_2(SB), NOSPLIT, $0-24
	MOVQ x+0(FP), AX
	MOVQ y+8(FP), CX
	LEAQ (AX)(CX*1), CX
	LEAQ 1(CX), CX
	MOVQ CX, ret+16(FP)
	RET

// func add1_3(x uint64, y uint64) uint64
TEXT ·add1_3(SB), NOSPLIT, $0-24
	MOVQ x+0(FP), AX
	MOVQ y+8(FP), CX
	LEAQ 1(AX)(CX*1), CX
	MOVQ CX, ret+16(FP)
	RET

martisch · 2019-05-12T05:31:39Z

Thank you very much for the data. A difference of 0.03 ns doesnt usually mean it is generally slower but looks to be within benchmarking variance (going by the -+6%) which we will not be able to completely narrow down even on higher counts. where 0.2 or 0.3ns likely indeed mean its likely slower by a clock cycle. If these are really parallel benchmarks I usually run with -cpu=1 for these to reduce the load and interference and disable frequency scaling and turbo boost too.

I will run the 2 lea vs 3 op lea variant on my benchmarking computer once near it too.

martisch · 2019-05-12T07:19:57Z

Currently only have my laptop at hand (i7-3520M 2.9ghz, Ivy Bridge) to make a quick benchmark.

On it disabling slow lea splitting (go tip 83f205f , ssa.go#L608 set to false then go install cmd/compile)

go/src/cmd/compile/internal/amd64/ssa.go

Line 608 in 2e4edf4

if v.AuxInt != 0 && v.Aux == nil {

makes the benchmarks 1 clock cycle (around 0.3ns) slower. I regard 0.1ns as within variance from runs here.

old = go tip 83f205f
new = go tip 83f205f with slow lea splitting disabled

go test -cpu=1 -count=10 -bench=.*
benchstat ~/lea2.bench ~/lea3.bench 

name              old time/op  new time/op  delta
LEA22_1_noinline  4.24ns ± 0%  4.52ns ± 0%   +6.70%  (p=0.000 n=8+10)
LEA22_4_noinline  4.31ns ± 2%  4.61ns ± 2%   +6.91%  (p=0.000 n=10+10)
LEA22_1_inline    0.58ns ± 2%  0.87ns ± 4%  +50.05%  (p=0.000 n=10+10)
LEA22_4_inline    0.59ns ± 3%  0.97ns ± 3%  +64.47%  (p=0.000 n=10+9)

go test -cpu=1 -count=100 -bench="LEA22_1" > ~/lea3.bench
benchstat ~/lea2.bench ~/lea3.bench 
name              old time/op  new time/op  delta
LEA22_1_noinline  4.33ns ± 4%  4.60ns ± 3%   +6.41%  (p=0.000 n=100+100)
LEA22_1_inline    0.58ns ± 5%  0.86ns ± 4%  +49.52%  (p=0.000 n=98+98)

I used e.g. go test -c go tool objdump -s BenchmarkLEA22_4_inline to check that the expected LEA instructions were emitted.

Benchmarks

var global int

func BenchmarkLEA22_1_noinline(b *testing.B) {
	var sink int
	for i := 0; i < b.N; i++ {
		sink = lea22_1_noinline(sink, sink)
	}
	global = sink
}

func BenchmarkLEA22_4_noinline(b *testing.B) {
	var sink int
	for i := 0; i < b.N; i++ {
		sink = lea22_4_noinline(sink, sink)
	}
	global = sink
}

func BenchmarkLEA22_1_inline(b *testing.B) {
	var sink int
	for i := 0; i < b.N; i++ {
		sink = lea22_1_inline(sink, sink)
	}
	global = sink
}

func BenchmarkLEA22_4_inline(b *testing.B) {
	var sink int
	for i := 0; i < b.N; i++ {
		sink = lea22_4_inline(sink, sink)
	}
	global = sink
}

Functions

//go:noinline
func lea22_1_noinline(a, b int) int {
	return 1 + a + b
}

func lea22_1_inline(a, b int) int {
	return 1 + a + b
}

//go:noinline
func lea22_4_noinline(a, b int) int {
	return 1 + (a + 4*b)
}

func lea22_4_inline(a, b int) int {
	return 1 + (a + 4*b)
}

ulikunitz · 2019-05-12T11:39:45Z

Martin, could you provide the code on github.com. I would like to check them on the machines I have access to. Kind regards, Ulrich Am So., 12. Mai 2019 um 09:22 Uhr schrieb Martin Möhrmann < [email protected]>:

…

Currently only have my laptop at hand (i7-3520M 2.9ghz) to make a quick benchmark. On it disabling slow lea splitting (go tip 83f205, https://github.com/golang/go/blob/2e4edf46977994c9d26df9327f0e41c1b60f3435/src/cmd/compile/internal/amd64/ssa.go#L608 set to false) makes the benchmarks 1 clock cycle (around 0.3ns) slower. I regard 0.1ns as within variance from runs here. benchstat ~/lea2.bench ~/lea3.bench name old time/op new time/op delta LEA22_1_noinline 4.24ns ± 0% 4.52ns ± 0% +6.70% (p=0.000 n=8+10) LEA22_4_noinline 4.31ns ± 2% 4.61ns ± 2% +6.91% (p=0.000 n=10+10) LEA22_1_inline 0.58ns ± 2% 0.87ns ± 4% +50.05% (p=0.000 n=10+10) LEA22_4_inline 0.59ns ± 3% 0.97ns ± 3% +64.47% (p=0.000 n=10+9) I used go tool objdump -s BenchmarkLEA22_4_inline to check that the expected LEA instructions were emitted. Command go test -cpu=1 -count=10 -bench=.* Benchmarks var global int func BenchmarkLEA22_1_noinline(b *testing.B) { var sink int for i := 0; i < b.N; i++ { sink = lea22_1_noinline(sink, sink) } global = sink } func BenchmarkLEA22_4_noinline(b *testing.B) { var sink int for i := 0; i < b.N; i++ { sink = lea22_4_noinline(sink, sink) } global = sink } func BenchmarkLEA22_1_inline(b *testing.B) { var sink int for i := 0; i < b.N; i++ { sink = lea22_1_inline(sink, sink) } global = sink } func BenchmarkLEA22_4_inline(b *testing.B) { var sink int for i := 0; i < b.N; i++ { sink = lea22_4_inline(sink, sink) } global = sink } Functions //go:noinline func lea22_1_noinline(a, b int) int { return 1 + a + b } func lea22_1_inline(a, b int) int { return 1 + a + b } //go:noinline func lea22_4_noinline(a, b int) int { return 1 + (a + 4*b) } func lea22_4_inline(a, b int) int { return 1 + (a + 4*b) } — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#31900 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACARSFMD4R5DG74UNO7WYK3PU7ATRANCNFSM4HLOC6ZQ> .

gopherbot · 2019-05-12T17:16:51Z

Change https://golang.org/cl/176622 mentions this issue: cmd/compile: benchmark for slow lea

martisch · 2019-05-12T17:26:38Z

Uploaded as https://go-review.googlesource.com/c/go/+/176622, also contains the change to generate slow leas in src/cmd/compile/internal/amd64/ssa.go.

For the original issue posted:
If it is better binary wise or for port utilization we could emit two adds. Thats also what gccgo 7.2.0 emits.

To make this work it would be nice if we have a last rule based optimization pass after the normal pass to do these kinds of low level transformations that should not interfere with other rules but need to be done before emitting instructions. This would also make amd64/ssa.go code simpler. Replacing mov with xor and some other optimization could fit this category as well. see #27034 where I had commented in that direction.

martisch · 2019-05-21T10:37:03Z

Update:
Running the benchmark from above on a Haswell Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
also results in a measurable slow down of about 1 clock cycle when turning off slow lea splitting:

go test -run=none -bench=.* -cpu=1 -count=20

name              old time/op  new time/op  delta
LEA22_1_noinline  3.85ns ± 3%  4.39ns ± 2%  +13.84%  (p=0.000 n=20+20)
LEA22_4_noinline  3.90ns ± 1%  4.27ns ± 2%   +9.40%  (p=0.000 n=19+20)
LEA22_1_inline    0.55ns ± 1%  0.82ns ± 3%  +50.39%  (p=0.000 n=20+20)
LEA22_4_inline    0.55ns ± 3%  0.82ns ± 2%  +49.77%  (p=0.000 n=18+19)

josharian added the Performance label May 8, 2019

josharian added this to the Unplanned milestone May 8, 2019

andybons added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 8, 2019

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jul 13, 2022

mknyszek added this to Go Compiler / Runtime Jul 13, 2022

mknyszek removed this from Go Compiler / Runtime Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: suboptimal assembly for a + b + 1 #31900

cmd/compile: suboptimal assembly for a + b + 1 #31900

josharian commented May 8, 2019

ulikunitz commented May 11, 2019 •

edited

Loading

martisch commented May 11, 2019 •

edited

Loading

ulikunitz commented May 11, 2019

martisch commented May 12, 2019

martisch commented May 12, 2019 •

edited

Loading

ulikunitz commented May 12, 2019 via email

gopherbot commented May 12, 2019

martisch commented May 12, 2019 •

edited

Loading

martisch commented May 21, 2019

cmd/compile: suboptimal assembly for a + b + 1 #31900

cmd/compile: suboptimal assembly for a + b + 1 #31900

Comments

josharian commented May 8, 2019

ulikunitz commented May 11, 2019 • edited Loading

martisch commented May 11, 2019 • edited Loading

ulikunitz commented May 11, 2019

martisch commented May 12, 2019

martisch commented May 12, 2019 • edited Loading

ulikunitz commented May 12, 2019 via email

gopherbot commented May 12, 2019

martisch commented May 12, 2019 • edited Loading

martisch commented May 21, 2019

ulikunitz commented May 11, 2019 •

edited

Loading

martisch commented May 11, 2019 •

edited

Loading

martisch commented May 12, 2019 •

edited

Loading

martisch commented May 12, 2019 •

edited

Loading