cmd/compile: BLAS regressions at tip #19595

btracey · 2017-03-17T17:45:34Z

Comparing go1.8 with go version devel +b9f6b22 Fri Mar 17 13:59:31 2017 +0000 darwin/amd64

For all the benchmarks, the package is The github.com/gonum/blas/native. Below, the noasm tag makes the code pure go, rather than the ASM implementations of ddot and daxpy (which use SIMD and thus make the code significantly faster).

The first benchmark is Daxpy. Command is go test -bench Daxpy -tags noasm -count 5

brendan:~/Documents/mygo/src/github.com/gonum$ benchstat daxpy18.txt dgaxpytip.txt 
name                      old time/op  new time/op   delta
DaxpySmallBothUnitary-8   21.8ns ± 2%   23.2ns ± 6%   +6.80%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.6ns ± 1%   29.7ns ± 1%  +16.08%  (p=0.008 n=5+5)
DaxpySmallUniInc-8        24.8ns ± 1%   29.0ns ± 4%  +17.00%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       25.3ns ± 1%   28.6ns ± 1%  +13.04%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   872ns ± 1%   1153ns ± 0%  +32.14%  (p=0.008 n=5+5)
DaxpyMediumIncUni-8       1.07µs ± 1%   1.17µs ± 1%   +9.10%  (p=0.008 n=5+5)
DaxpyMediumUniInc-8        928ns ± 1%   1162ns ± 1%  +25.12%  (p=0.008 n=5+5)
DaxpyMediumBothInc-8      1.15µs ± 1%   1.22µs ± 1%   +6.68%  (p=0.008 n=5+5)
DaxpyLargeBothUnitary-8   88.6µs ± 1%  116.5µs ± 4%  +31.57%  (p=0.008 n=5+5)
DaxpyLargeIncUni-8         181µs ± 1%    185µs ± 2%   +2.35%  (p=0.016 n=5+5)
DaxpyLargeUniInc-8         140µs ± 1%    146µs ± 2%   +4.20%  (p=0.008 n=5+5)
DaxpyLargeBothInc-8        294µs ± 1%    296µs ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeBothUnitary-8    13.2ms ± 1%   14.2ms ± 2%   +8.00%  (p=0.008 n=5+5)
DaxpyHugeIncUni-8         30.5ms ± 3%   30.2ms ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeUniInc-8         29.3ms ± 1%   29.1ms ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeBothInc-8        46.6ms ± 2%   46.3ms ± 1%     ~     (p=0.222 n=5+5)

There are significant slowdowns at the smaller sizes, though they disappear at the larger sizes. I'm not sure these explain the regressions we're seeing in Dgemv, so it may be a separate issue.

The command is go test -bench Dgemv -tags noasm -count 5.

brendan:~/Documents/mygo/src/github.com/gonum$ benchstat dgemv18.txt dgemvtip.txt 
name                      old time/op  new time/op   delta
DgemvSmSmNoTransInc1-8     171ns ± 0%    179ns ± 1%   +4.81%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     206ns ± 1%    228ns ± 0%  +10.66%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       209ns ± 0%    220ns ± 1%   +5.07%  (p=0.016 n=4+5)
DgemvSmSmTransIncN-8       221ns ± 1%    257ns ± 1%  +16.17%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.93µs ± 3%   9.90µs ± 4%     ~     (p=0.841 n=5+5)
DgemvMedMedNoTransIncN-8  10.1µs ± 1%   12.9µs ± 1%  +27.62%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    10.7µs ± 1%   12.8µs ± 1%  +19.38%  (p=0.008 n=5+5)
DgemvMedMedTransIncN-8    11.4µs ± 1%   13.9µs ± 1%  +22.11%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     919µs ± 1%    922µs ± 2%     ~     (p=0.841 n=5+5)
DgemvLgLgNoTransIncN-8     937µs ± 2%   1219µs ± 2%  +30.12%  (p=0.008 n=5+5)
DgemvLgLgTransInc1-8       948µs ± 1%   1236µs ± 1%  +30.29%  (p=0.008 n=5+5)
DgemvLgLgTransIncN-8       988µs ± 1%   1223µs ± 1%  +23.85%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    14.4µs ± 1%   14.5µs ± 1%     ~     (p=0.421 n=5+5)
DgemvLgSmNoTransIncN-8    17.4µs ± 1%   20.2µs ± 1%  +15.57%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      17.6µs ± 2%   18.3µs ± 2%   +4.18%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      18.0µs ± 1%   22.4µs ± 2%  +24.35%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.65µs ± 1%   8.76µs ± 1%   +1.21%  (p=0.032 n=5+5)
DgemvSmLgNoTransIncN-8    8.74µs ± 1%  11.64µs ± 1%  +33.15%  (p=0.008 n=5+5)
DgemvSmLgTransInc1-8      9.36µs ± 1%  12.15µs ± 1%  +29.83%  (p=0.008 n=5+5)
DgemvSmLgTransIncN-8      10.1µs ± 0%   12.5µs ± 1%  +24.03%  (p=0.008 n=5+5)

As we see, here some of the largest regressions are at the larger sizes. These additionally work to about a 15% regression on Dgemm for almost all sizes, and about a 15% regression for the yet higher-level Cholesky decomposition.

The text was updated successfully, but these errors were encountered:

randall77 · 2017-03-18T17:46:00Z

Looks like there are some extra reg->reg moves in tip compared to 1.8.

I strongly suspect the introduction of MULSDmem ops. Previously we had

    m = some constant
    for ... {
        x = load ...
        x = m * x
    }

which works fine with no reg moves. After introducing MULSDmem ops, we have

    m = some constant
    for ... {
        x = m * load ...
    }

The op for x = m * load ... requires its output register (for x) to be the same as the input register (m). So there are 2 additional register moves, one to copy m to another register and one to move it back. The copy back then requires an additional branch instruction.

@TocarIP

source code:

func f(dst []float64, alpha float64, x, y []float64) {
	for i, v := range x {
		dst[i] = alpha*v + y[i]
	}
}

1.8:

	0x0038 00056 (/Users/khr/go/tmp1.go:4)	MOVSD	(DI), X1
	0x003c 00060 (/Users/khr/go/tmp1.go:5)	MULSD	X0, X1
	0x0040 00064 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, BX
	0x0043 00067 (/Users/khr/go/tmp1.go:5)	JCC	$0, 111
	0x0045 00069 (/Users/khr/go/tmp1.go:5)	MOVSD	(SI)(R8*8), X2
	0x004b 00075 (/Users/khr/go/tmp1.go:5)	ADDSD	X2, X1
	0x004f 00079 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, DX
	0x0052 00082 (/Users/khr/go/tmp1.go:5)	JCC	$0, 111
	0x0054 00084 (/Users/khr/go/tmp1.go:5)	MOVSD	X1, (CX)(R8*8)
	0x005a 00090 (/Users/khr/go/tmp1.go:4)	ADDQ	$8, DI
	0x005e 00094 (/Users/khr/go/tmp1.go:4)	INCQ	R8
	0x0061 00097 (/Users/khr/go/tmp1.go:4)	CMPQ	R8, AX
	0x0064 00100 (/Users/khr/go/tmp1.go:4)	JLT	$0, 56

tip:

	0x0038 00056 (/Users/khr/go/tmp1.go:5)	MOVUPS	X0, X1
	0x003b 00059 (/Users/khr/go/tmp1.go:5)	MULSD	(DI), X0
	0x003f 00063 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, SI
	0x0042 00066 (/Users/khr/go/tmp1.go:5)	JCC	$0, 122
	0x0044 00068 (/Users/khr/go/tmp1.go:5)	MOVSD	(CX)(R8*8), X2
	0x004a 00074 (/Users/khr/go/tmp1.go:5)	ADDSD	X2, X0
	0x004e 00078 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, BX
	0x0051 00081 (/Users/khr/go/tmp1.go:5)	JCC	$0, 115
	0x0053 00083 (/Users/khr/go/tmp1.go:5)	MOVSD	X0, (DX)(R8*8)
	0x0059 00089 (/Users/khr/go/tmp1.go:4)	ADDQ	$8, DI
	0x005d 00093 (/Users/khr/go/tmp1.go:4)	INCQ	R8
	0x0060 00096 (/Users/khr/go/tmp1.go:4)	CMPQ	R8, AX
	0x0063 00099 (/Users/khr/go/tmp1.go:4)	JGE	$0, 106
	0x0065 00101 (/Users/khr/go/tmp1.go:5)	MOVUPS	X1, X0
	0x0068 00104 (/Users/khr/go/tmp1.go:4)	JMP	56

gopherbot · 2017-03-18T18:30:14Z

CL https://golang.org/cl/38337 mentions this issue.

randall77 · 2017-03-18T18:32:48Z

@btracey Please try the patch above and see if it fixes everything for you.

btracey · 2017-03-20T16:12:23Z

It seems to have partially helped, but not fixed the problem. I'm not sure I applied it correctly though (see #19619)

DaxpySmallBothUnitary-8   21.6ns ± 1%   22.3ns ± 1%   +3.15%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.6ns ± 0%   27.8ns ± 1%   +8.75%  (p=0.016 n=4+5)
DaxpySmallUniInc-8        24.9ns ± 1%   27.4ns ± 2%  +10.13%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       25.4ns ± 0%   27.8ns ± 1%   +9.61%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   880ns ± 1%    884ns ± 1%     ~     (p=0.341 n=5+5)
DaxpyMediumIncUni-8       1.07µs ± 1%   1.18µs ± 1%  +10.12%  (p=0.008 n=5+5)
DaxpyMediumUniInc-8        927ns ± 0%   1168ns ± 1%  +26.07%  (p=0.008 n=5+5)
DaxpyMediumBothInc-8      1.15µs ± 2%   1.24µs ± 4%   +8.47%  (p=0.008 n=5+5)
DaxpyLargeBothUnitary-8   88.2µs ± 1%   89.2µs ± 0%   +1.18%  (p=0.008 n=5+5)
DaxpyLargeIncUni-8         185µs ± 1%    189µs ± 1%   +2.45%  (p=0.008 n=5+5)
DaxpyLargeUniInc-8         142µs ± 1%    146µs ± 1%   +3.11%  (p=0.008 n=5+5)
DaxpyLargeBothInc-8        292µs ± 0%    304µs ± 1%   +4.25%  (p=0.008 n=5+5)
DaxpyHugeBothUnitary-8    13.3ms ± 1%   13.1ms ± 1%   -1.43%  (p=0.016 n=5+5)
DaxpyHugeIncUni-8         30.4ms ± 2%   30.5ms ± 1%     ~     (p=0.690 n=5+5)
DaxpyHugeUniInc-8         29.1ms ± 0%   29.3ms ± 1%     ~     (p=0.286 n=4+5)
DaxpyHugeBothInc-8        47.4ms ± 1%   46.6ms ± 1%   -1.60%  (p=0.032 n=5+5)
DgemvSmSmNoTransInc1-8     171ns ± 1%    203ns ± 0%  +19.09%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     206ns ± 1%    228ns ± 1%  +10.37%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       210ns ± 1%    212ns ± 2%     ~     (p=0.333 n=5+5)
DgemvSmSmTransIncN-8       222ns ± 1%    245ns ± 1%  +10.08%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.68µs ± 1%  10.59µs ± 1%   +9.41%  (p=0.008 n=5+5)
DgemvMedMedNoTransIncN-8  10.1µs ± 1%   13.0µs ± 1%  +28.10%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    10.8µs ± 1%   10.9µs ± 3%     ~     (p=0.151 n=5+5)
DgemvMedMedTransIncN-8    11.3µs ± 0%   13.9µs ± 1%  +22.81%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     924µs ± 2%    934µs ± 1%     ~     (p=0.310 n=5+5)
DgemvLgLgNoTransIncN-8     929µs ± 1%   1212µs ± 1%  +30.49%  (p=0.008 n=5+5)
DgemvLgLgTransInc1-8       952µs ± 2%    957µs ± 2%     ~     (p=0.690 n=5+5)
DgemvLgLgTransIncN-8       974µs ± 0%   1212µs ± 1%  +24.38%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    14.4µs ± 1%   17.5µs ± 1%  +21.47%  (p=0.008 n=5+5)
DgemvLgSmNoTransIncN-8    17.9µs ± 1%   19.9µs ± 1%  +11.48%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      17.5µs ± 1%   17.5µs ± 0%     ~     (p=0.690 n=5+5)
DgemvLgSmTransIncN-8      18.0µs ± 0%   20.5µs ± 1%  +14.09%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.70µs ± 1%   8.85µs ± 2%   +1.75%  (p=0.016 n=5+5)
DgemvSmLgNoTransIncN-8    8.72µs ± 1%  11.65µs ± 2%  +33.57%  (p=0.008 n=5+5)
DgemvSmLgTransInc1-8      9.41µs ± 0%   9.39µs ± 1%     ~     (p=0.421 n=5+5)
DgemvSmLgTransIncN-8      10.2µs ± 2%   12.7µs ± 1%  +24.70%  (p=0.008 n=5+5)

btracey · 2017-03-20T16:16:34Z

It looks like all of the Daxpy benchmarks got faster with this patch, even though there seems to still be some regression overall. The Dgemv benchmarks have a less clear story.

randall77 · 2017-03-20T20:13:29Z

I've taken a closer look at DgemvMedMedNoTransIncN. Its inner loop is gonum/internal/asm/f64/axpy.go:AxpyInc
There's no difference in the inner loop with CL38337.
The difference between 1.8 and tip is that we're no longer loop peeling the branch at the start of the loop. See CL 36205 and issues #18977 #15837.
So tip looks like:

loop:
    CMPQ ...
    JGE exit
    ...
    JMP loop
exit:

whereas 1.8 is:

    CMPQ ...
    JGE exit
loop:
    ...
    CMPQ ...
    JLT loop
exit:

I'm pretty surprised this matters at the ~30% slowdown level. There might be something I'm missing.
In any case, we plan to fix this one for 1.9.

dr2chase · 2017-03-20T20:38:37Z

We might be able to get the effect of the loop peel just by changing the weights on block ordering -- for loops where the header block has a loop exit edge, make the branch to the header be lower than the backedge to the header.

randall77 · 2017-03-20T20:58:57Z

@dr2chase I feel like that would interact badly with the register allocator. If you put the loop header block last, then the 2nd block of the loop appears first in the schedule. But none of its predecessors have been processed yet, so the register allocator won't know what register state to start with. It could be fixed, I suppose, computing a regalloc schedule instead of just doing blocks in order.
It would generate:

    JMP entry
loop:
    ...
entry:
    CMPQ ...
    JLT loop
exit:

Which isn't quite the same as 1.8 either. Better or worse? I'm not sure.
It definitely has the appeal of doing less CFG surgery.

dr2chase · 2017-03-20T21:16:12Z

When I did the explicit loop peel to fix i,e:=range-of-slice loops in the presence of backedge rescheduling checks, PPC did get a little faster.

gopherbot · 2017-03-21T22:45:27Z

CL https://golang.org/cl/38431 mentions this issue.

We want to merge a load and op into a single instruction l = LOAD ptr mem y = OP x l into y = OPload x ptr mem However, all of our OPload instructions require that y uses the same register as x. If x is needed past this instruction, then we must copy x somewhere else, losing the whole benefit of merging the instructions in the first place. Disable this optimization if x is live past the OP. Also disable this optimization if the OP is in a deeper loop than the load. Update #19595 Change-Id: I87f596aad7e91c9127bfb4705cbae47106e1e77a Reviewed-on: https://go-review.googlesource.com/38337 Reviewed-by: Ilya Tocar <[email protected]>

Regalloc uses loop depths - make sure they are initialized! Test to make sure we aren't pushing spills into loops. This fixes a generated-code performance bug introduced with the better spill placement change: https://go-review.googlesource.com/c/34822/ Update #19595 Change-Id: Ib9f0da6fb588503518847d7aab51e569fd3fa61e Reviewed-on: https://go-review.googlesource.com/38434 Reviewed-by: David Chase <[email protected]>

gopherbot · 2017-04-21T18:18:22Z

CL https://golang.org/cl/38434 mentions this issue.

Old loops look like this: loop: CMPQ ... JGE exit ... JMP loop exit: New loops look like this: JMP entry loop: ... entry: CMPQ ... JLT loop This removes one instruction (the unconditional jump) from the inner loop. Kinda surprisingly, it matters. This is a bit different than the peeling that the old obj library did in that we don't duplicate the loop exit test. We just jump to the test. I'm not sure if it is better or worse to do that (peeling gets rid of the JMP but means more code duplication), but this CL is certainly a much simpler compiler change, so I'll try this way first. The obj library used to do peeling before CL https://go-review.googlesource.com/c/36205 turned it off. Fixes #15837 (remove obj instruction reordering) The reordering is already removed, this CL implements the only part of that reordering that we'd like to keep. Fixes #14758 (append loop) name old time/op new time/op delta Foo-12 817ns ± 4% 538ns ± 0% -34.08% (p=0.000 n=10+9) Bar-12 850ns ±11% 570ns ±13% -32.88% (p=0.000 n=10+10) Update #19595 (BLAS slowdown) name old time/op new time/op delta DgemvMedMedNoTransIncN-12 13.2µs ± 9% 10.2µs ± 1% -22.26% (p=0.000 n=9+9) Fixes #19633 (append loop) name old time/op new time/op delta Foo-12 810ns ± 1% 540ns ± 0% -33.30% (p=0.000 n=8+9) Update #18977 (Fannkuch11 regression) name old time/op new time/op delta Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) This one makes no sense. There's strictly 1 less instruction in the inner loop (17 instead of 18). They are exactly the same instructions except for the JMP that has been elided. go1 benchmarks generally don't look very impressive. But the gains for the specific issues above make this CL still probably worth it. name old time/op new time/op delta BinaryTree17-8 2.32s ± 0% 2.34s ± 0% +1.14% (p=0.000 n=9+7) Fannkuch11-8 2.80s ± 0% 3.01s ± 0% +7.47% (p=0.000 n=9+10) FmtFprintfEmpty-8 44.1ns ± 1% 46.1ns ± 1% +4.53% (p=0.000 n=10+10) FmtFprintfString-8 67.8ns ± 0% 74.4ns ± 1% +9.80% (p=0.000 n=10+9) FmtFprintfInt-8 74.9ns ± 0% 78.4ns ± 0% +4.67% (p=0.000 n=8+10) FmtFprintfIntInt-8 117ns ± 1% 123ns ± 1% +4.69% (p=0.000 n=9+10) FmtFprintfPrefixedInt-8 160ns ± 1% 146ns ± 0% -8.22% (p=0.000 n=8+10) FmtFprintfFloat-8 214ns ± 0% 206ns ± 0% -3.91% (p=0.000 n=8+8) FmtManyArgs-8 468ns ± 0% 497ns ± 1% +6.09% (p=0.000 n=8+10) GobDecode-8 6.16ms ± 0% 6.21ms ± 1% +0.76% (p=0.000 n=9+10) GobEncode-8 4.90ms ± 0% 4.92ms ± 1% +0.37% (p=0.028 n=9+10) Gzip-8 209ms ± 0% 212ms ± 0% +1.33% (p=0.000 n=10+10) Gunzip-8 36.6ms ± 0% 38.0ms ± 1% +4.03% (p=0.000 n=9+9) HTTPClientServer-8 84.2µs ± 0% 86.0µs ± 1% +2.14% (p=0.000 n=9+9) JSONEncode-8 13.6ms ± 3% 13.8ms ± 1% +1.55% (p=0.003 n=9+10) JSONDecode-8 53.2ms ± 5% 52.9ms ± 0% ~ (p=0.280 n=10+10) Mandelbrot200-8 3.78ms ± 0% 3.78ms ± 1% ~ (p=0.661 n=10+9) GoParse-8 2.89ms ± 0% 2.94ms ± 2% +1.50% (p=0.000 n=10+10) RegexpMatchEasy0_32-8 68.5ns ± 2% 68.9ns ± 1% ~ (p=0.136 n=10+10) RegexpMatchEasy0_1K-8 220ns ± 1% 225ns ± 1% +2.41% (p=0.000 n=10+10) RegexpMatchEasy1_32-8 64.7ns ± 0% 64.5ns ± 0% -0.28% (p=0.042 n=10+10) RegexpMatchEasy1_1K-8 348ns ± 1% 355ns ± 0% +1.90% (p=0.000 n=10+10) RegexpMatchMedium_32-8 102ns ± 1% 105ns ± 1% +2.95% (p=0.000 n=10+10) RegexpMatchMedium_1K-8 33.1µs ± 3% 32.5µs ± 0% -1.75% (p=0.000 n=10+10) RegexpMatchHard_32-8 1.71µs ± 1% 1.70µs ± 1% -0.84% (p=0.002 n=10+9) RegexpMatchHard_1K-8 51.1µs ± 0% 50.8µs ± 1% -0.48% (p=0.004 n=10+10) Revcomp-8 411ms ± 1% 402ms ± 0% -2.22% (p=0.000 n=10+9) Template-8 61.8ms ± 1% 59.7ms ± 0% -3.44% (p=0.000 n=9+9) TimeParse-8 306ns ± 0% 318ns ± 0% +3.83% (p=0.000 n=10+10) TimeFormat-8 320ns ± 0% 318ns ± 1% -0.53% (p=0.012 n=7+10) Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b Reviewed-on: https://go-review.googlesource.com/38431 Run-TryBot: Keith Randall <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: David Chase <[email protected]>

josharian · 2017-05-11T17:16:44Z

@btracey would you mind re-measuring with the latest tip, to assess whether this still might need fixes for 1.9? Thanks.

btracey · 2017-05-14T18:18:11Z

Here are the current stats. Mostly better

DaxpySmallBothUnitary-8   19.5ns ± 1%  21.5ns ± 1%  +10.27%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.9ns ± 1%  28.4ns ± 2%   +9.58%  (p=0.008 n=5+5)
DaxpySmallUniInc-8        25.5ns ± 1%  27.7ns ± 2%   +8.62%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       25.1ns ± 1%  27.4ns ± 4%   +8.92%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   865ns ± 1%   861ns ± 1%     ~     (p=0.190 n=5+5)
DaxpyMediumIncUni-8       1.07µs ± 1%  1.08µs ± 2%     ~     (p=0.690 n=5+5)
DaxpyMediumUniInc-8        922ns ± 1%   915ns ± 1%     ~     (p=0.103 n=5+5)
DaxpyMediumBothInc-8      1.14µs ± 1%  1.15µs ± 1%     ~     (p=0.841 n=5+5)
DaxpyLargeBothUnitary-8   87.9µs ± 2%  87.3µs ± 3%     ~     (p=0.151 n=5+5)
DaxpyLargeIncUni-8         173µs ± 2%   174µs ± 2%     ~     (p=1.000 n=5+5)
DaxpyLargeUniInc-8         137µs ± 1%   138µs ± 1%     ~     (p=0.056 n=5+5)
DaxpyLargeBothInc-8        299µs ± 1%   287µs ± 2%   -4.03%  (p=0.008 n=5+5)
DaxpyHugeBothUnitary-8    12.9ms ± 1%  12.9ms ± 0%     ~     (p=0.841 n=5+5)
DaxpyHugeIncUni-8         29.9ms ± 1%  29.9ms ± 3%     ~     (p=0.841 n=5+5)
DaxpyHugeUniInc-8         28.8ms ± 1%  28.7ms ± 0%     ~     (p=0.111 n=5+4)
DaxpyHugeBothInc-8        46.1ms ± 2%  45.8ms ± 1%     ~     (p=0.151 n=5+5)
DgemmSmSmSm-8             1.73µs ± 1%  1.73µs ± 2%     ~     (p=0.690 n=5+5)
DgemmMedMedMed-8           469µs ± 2%   469µs ± 1%     ~     (p=0.841 n=5+5)
DgemmMedLgMed-8           3.20ms ± 2%  3.39ms ±14%     ~     (p=0.310 n=5+5)
DgemmLgLgLg-8              313ms ± 1%   317ms ± 1%   +1.58%  (p=0.008 n=5+5)
DgemmLgSmLg-8             4.85ms ± 2%  4.88ms ± 1%     ~     (p=0.690 n=5+5)
DgemmLgLgSm-8             3.38ms ± 3%  3.39ms ± 2%     ~     (p=0.841 n=5+5)
DgemmHgHgSm-8              299ms ± 2%   297ms ± 4%     ~     (p=0.690 n=5+5)
DgemmMedMedMedTNT-8        486µs ± 1%   483µs ± 4%     ~     (p=0.310 n=5+5)
DgemmMedMedMedNTT-8        448µs ± 1%   453µs ± 4%     ~     (p=0.690 n=5+5)
DgemmMedMedMedTT-8         556µs ± 1%   576µs ± 2%   +3.72%  (p=0.008 n=5+5)
DgemvSmSmNoTransInc1-8     168ns ± 1%   181ns ± 1%   +7.98%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     207ns ± 1%   213ns ± 1%   +2.89%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       188ns ± 1%   204ns ± 1%   +8.61%  (p=0.008 n=5+5)
DgemvSmSmTransIncN-8       222ns ± 1%   244ns ± 0%   +9.81%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.43µs ± 1%  9.59µs ± 1%   +1.75%  (p=0.008 n=5+5)
DgemvMedMedNoTransIncN-8  10.6µs ± 2%  11.0µs ± 3%   +3.23%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    10.0µs ± 2%   9.9µs ± 0%   -1.24%  (p=0.032 n=5+5)
DgemvMedMedTransIncN-8    11.2µs ± 0%  11.5µs ± 1%   +2.45%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     909µs ± 1%   908µs ± 1%     ~     (p=0.690 n=5+5)
DgemvLgLgNoTransIncN-8     929µs ± 2%   929µs ± 1%     ~     (p=0.548 n=5+5)
DgemvLgLgTransInc1-8       920µs ± 1%   920µs ± 1%     ~     (p=1.000 n=5+5)
DgemvLgLgTransIncN-8       966µs ± 1%   977µs ± 1%   +1.19%  (p=0.016 n=5+5)
DgemvLgSmNoTransInc1-8    13.7µs ± 1%  14.7µs ± 1%   +6.91%  (p=0.008 n=5+5)
DgemvLgSmNoTransIncN-8    18.0µs ± 1%  18.0µs ± 1%     ~     (p=0.841 n=5+5)
DgemvLgSmTransInc1-8      15.2µs ± 1%  16.3µs ± 1%   +7.52%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      18.4µs ± 1%  19.7µs ± 1%   +7.27%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.93µs ± 1%  8.71µs ± 3%     ~     (p=0.056 n=5+5)
DgemvSmLgNoTransIncN-8    9.07µs ± 1%  9.03µs ± 1%     ~     (p=0.548 n=5+5)
DgemvSmLgTransInc1-8      9.59µs ± 1%  9.55µs ± 0%     ~     (p=0.341 n=5+5)
DgemvSmLgTransIncN-8      10.4µs ± 1%  10.3µs ± 1%     ~     (p=0.421 n=5+5)

josharian · 2017-05-14T23:00:18Z

But a few non-trivial regressions still. If you're up for it, would you mind patching in CLs 43293 and 43491 and see whether they help any? (No particular reason to think they will, other than that they have a significant effect on code layout in general.)

btracey · 2017-05-14T23:52:12Z

With both applied:

name                      old time/op  new time/op   delta
DaxpySmallBothUnitary-8   19.1ns ± 1%   19.7ns ± 1%   +3.03%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.3ns ± 1%   28.8ns ± 1%  +13.65%  (p=0.008 n=5+5)
DaxpySmallUniInc-8        25.1ns ± 1%   25.7ns ± 1%   +2.47%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       24.7ns ± 1%   26.5ns ± 1%   +7.19%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   846ns ± 1%    858ns ± 3%     ~     (p=0.222 n=5+5)
DaxpyMediumIncUni-8       1.05µs ± 1%   1.22µs ±11%  +16.51%  (p=0.008 n=5+5)
DaxpyMediumUniInc-8        902ns ± 1%   1134ns ± 1%  +25.75%  (p=0.008 n=5+5)
DaxpyMediumBothInc-8      1.13µs ± 1%   1.19µs ± 3%   +5.80%  (p=0.008 n=5+5)
DaxpyLargeBothUnitary-8   85.8µs ± 1%   93.2µs ± 4%   +8.69%  (p=0.016 n=4+5)
DaxpyLargeIncUni-8         164µs ± 1%    179µs ± 2%   +8.89%  (p=0.008 n=5+5)
DaxpyLargeUniInc-8         134µs ± 0%    141µs ± 1%   +4.79%  (p=0.008 n=5+5)
DaxpyLargeBothInc-8        288µs ± 2%    287µs ± 2%     ~     (p=1.000 n=5+5)
DaxpyHugeBothUnitary-8    12.8ms ± 1%   12.8ms ± 1%     ~     (p=0.421 n=5+5)
DaxpyHugeIncUni-8         29.4ms ± 1%   29.3ms ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeUniInc-8         28.5ms ± 1%   28.4ms ± 1%     ~     (p=0.421 n=5+5)
DaxpyHugeBothInc-8        45.5ms ± 2%   45.5ms ± 1%     ~     (p=0.841 n=5+5)
DgemmSmSmSm-8             1.70µs ± 2%   1.66µs ± 1%   -2.55%  (p=0.008 n=5+5)
DgemmMedMedMed-8           467µs ± 3%    471µs ± 3%     ~     (p=0.421 n=5+5)
DgemmMedLgMed-8           3.16ms ± 3%   3.20ms ± 3%     ~     (p=0.151 n=5+5)
DgemmLgLgLg-8              315ms ± 2%    314ms ± 0%     ~     (p=0.730 n=5+4)
DgemmLgSmLg-8             4.87ms ± 5%   4.81ms ± 1%     ~     (p=0.690 n=5+5)
DgemmLgLgSm-8             3.37ms ± 3%   3.33ms ± 1%     ~     (p=0.222 n=5+5)
DgemmHgHgSm-8              298ms ± 3%    299ms ± 5%     ~     (p=1.000 n=5+5)
DgemmMedMedMedTNT-8        485µs ± 1%    492µs ± 2%     ~     (p=0.095 n=5+5)
DgemmMedMedMedNTT-8        452µs ± 1%    448µs ± 2%     ~     (p=0.310 n=5+5)
DgemmMedMedMedTT-8         561µs ± 3%    603µs ± 2%   +7.56%  (p=0.008 n=5+5)
DgemvSmSmNoTransInc1-8     166ns ± 1%    168ns ± 1%   +1.69%  (p=0.032 n=5+5)
DgemvSmSmNoTransIncN-8     205ns ± 1%    202ns ± 1%   -1.27%  (p=0.032 n=5+5)
DgemvSmSmTransInc1-8       186ns ± 1%    182ns ± 1%   -2.15%  (p=0.008 n=5+5)
DgemvSmSmTransIncN-8       219ns ± 1%    220ns ± 0%     ~     (p=0.175 n=5+4)
DgemvMedMedNoTransInc1-8  9.35µs ± 2%   9.36µs ± 0%     ~     (p=0.690 n=5+5)
DgemvMedMedNoTransIncN-8  10.4µs ± 0%   10.0µs ± 1%   -4.68%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    9.82µs ± 0%   9.77µs ± 1%   -0.52%  (p=0.032 n=5+5)
DgemvMedMedTransIncN-8    11.0µs ± 1%   12.4µs ± 1%  +12.07%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     893µs ± 1%    888µs ± 1%     ~     (p=0.421 n=5+5)
DgemvLgLgNoTransIncN-8     908µs ± 1%    904µs ± 0%     ~     (p=0.151 n=5+5)
DgemvLgLgTransInc1-8       905µs ± 1%    898µs ± 1%     ~     (p=0.056 n=5+5)
DgemvLgLgTransIncN-8       951µs ± 0%   1158µs ± 1%  +21.80%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    13.5µs ± 2%   13.4µs ± 1%     ~     (p=0.690 n=5+5)
DgemvLgSmNoTransIncN-8    17.8µs ± 2%   17.1µs ± 1%   -3.83%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      15.0µs ± 0%   14.6µs ± 1%   -2.41%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      18.0µs ± 1%   18.0µs ± 0%     ~     (p=0.738 n=5+5)
DgemvSmLgNoTransInc1-8    8.49µs ± 1%   8.44µs ± 1%     ~     (p=0.381 n=5+5)
DgemvSmLgNoTransIncN-8    8.56µs ± 1%   8.56µs ± 3%     ~     (p=0.690 n=5+5)
DgemvSmLgTransInc1-8      9.02µs ± 0%   9.11µs ± 2%   +0.91%  (p=0.016 n=5+5)
DgemvSmLgTransIncN-8      9.84µs ± 1%  12.11µs ± 2%  +23.10%  (p=0.008 n=5+5)
```

josharian · 2017-05-15T00:05:44Z

Thanks. Looks like those CLs are a mixed bag, maybe a bit worse on balance. To confirm, are these latest from 1.8 to tip+2 CLs? I ask because the before numbers e.g. for the last benchmark don't line up—10.4us two comments back vs 9.84us in the last comment. Iff you have all the benchmark files around, could you post the benchstat going from tip to tip+2CLs, maybe running with -geomean? Ta.

btracey · 2017-05-15T00:50:43Z

It was 1.8.1 to the most recent. I'm running the benchmarks on my laptop (which I've been using all day), so the benchmarks could shift a bit. This is why I re-ran the 1.8.1 results (and why they don't exactly match up).

-geomean doesn't do what I expected. It prints the exact same output as before, except with a [Geo mean] 41.0µs 42.2µs +2.94% at the bottom (I ran benchstat -geomean blasbench181.txt blasbenchtip.txt , using the latest from golang.org/x/perf/cmd/benchstat). It seems like it's taking the geometric mean of all the runs rather than each benchmark individually.

randall77 · 2017-06-06T19:44:58Z

Punting to 1.10.

agnivade · 2019-02-01T13:28:44Z

@btracey - Comparing 1.8.7 with 1.12beta1, I see improvements

benchstat 1.8.txt 1.12.txt 
name                      old time/op  new time/op  delta
DaxpySmallBothUnitary-4   25.4ns ± 1%  23.0ns ± 1%  -9.52%  (p=0.008 n=5+5)
DaxpySmallIncUni-4        25.3ns ± 1%  26.0ns ± 1%  +2.61%  (p=0.008 n=5+5)
DaxpySmallUniInc-4        25.7ns ± 1%  25.3ns ± 2%    ~     (p=0.056 n=5+5)
DaxpySmallBothInc-4       25.8ns ± 1%  25.6ns ± 1%    ~     (p=0.373 n=5+5)
DaxpyMediumBothUnitary-4   972ns ± 0%   909ns ± 1%  -6.52%  (p=0.008 n=5+5)
DaxpyMediumIncUni-4       1.18µs ± 1%  1.18µs ± 1%    ~     (p=0.897 n=5+5)
DaxpyMediumUniInc-4       1.16µs ± 1%  1.15µs ± 1%  -1.08%  (p=0.032 n=5+5)
DaxpyMediumBothInc-4      1.36µs ± 1%  1.37µs ± 1%    ~     (p=0.381 n=5+5)
DaxpyLargeBothUnitary-4    105µs ± 0%   101µs ± 1%  -4.37%  (p=0.008 n=5+5)
DaxpyLargeIncUni-4         294µs ± 5%   278µs ± 1%  -5.58%  (p=0.008 n=5+5)
DaxpyLargeUniInc-4         206µs ± 0%   186µs ± 3%  -9.40%  (p=0.008 n=5+5)
DaxpyLargeBothInc-4        491µs ± 1%   451µs ± 0%  -8.16%  (p=0.008 n=5+5)
DaxpyHugeBothUnitary-4    14.3ms ± 0%  14.3ms ± 1%    ~     (p=0.310 n=5+5)
DaxpyHugeIncUni-4         33.3ms ± 0%  33.0ms ± 0%  -0.92%  (p=0.008 n=5+5)
DaxpyHugeUniInc-4         30.4ms ± 0%  30.6ms ± 0%  +0.83%  (p=0.008 n=5+5)
DaxpyHugeBothInc-4        49.6ms ± 0%  49.6ms ± 1%    ~     (p=0.690 n=5+5)

Kindly check and let us know if there is anything else to be done here. Thanks.

gopherbot · 2019-03-01T13:29:49Z

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

josharian changed the title ~~BLAS regressions at tip~~ cmd/compile: BLAS regressions at tip Mar 17, 2017

bradfitz added this to the Go1.9 milestone Mar 17, 2017

josharian added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance labels Mar 17, 2017

bradfitz removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 17, 2017

randall77 self-assigned this Mar 18, 2017

randall77 modified the milestones: Go1.10, Go1.9 Jun 6, 2017

mdempsky removed this from the Go1.10 milestone Nov 30, 2017

mdempsky added this to the Go1.11 milestone Nov 30, 2017

bradfitz modified the milestones: Go1.11, Unreleased Jun 13, 2018

agnivade added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Feb 1, 2019

gopherbot closed this as completed Mar 1, 2019

golang locked and limited conversation to collaborators Feb 29, 2020

gopherbot added the FrozenDueToAge label Feb 29, 2020

rsc unassigned randall77 Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/compile: BLAS regressions at tip #19595

cmd/compile: BLAS regressions at tip #19595

btracey commented Mar 17, 2017

randall77 commented Mar 18, 2017 •

edited

Loading

gopherbot commented Mar 18, 2017

randall77 commented Mar 18, 2017

btracey commented Mar 20, 2017

btracey commented Mar 20, 2017

randall77 commented Mar 20, 2017

dr2chase commented Mar 20, 2017

randall77 commented Mar 20, 2017

dr2chase commented Mar 20, 2017

gopherbot commented Mar 21, 2017

gopherbot commented Apr 21, 2017

josharian commented May 11, 2017

btracey commented May 14, 2017

josharian commented May 14, 2017

btracey commented May 14, 2017

josharian commented May 15, 2017

btracey commented May 15, 2017 •

edited

Loading

randall77 commented Jun 6, 2017

agnivade commented Feb 1, 2019

gopherbot commented Mar 1, 2019

cmd/compile: BLAS regressions at tip #19595

cmd/compile: BLAS regressions at tip #19595

Comments

btracey commented Mar 17, 2017

randall77 commented Mar 18, 2017 • edited Loading

gopherbot commented Mar 18, 2017

randall77 commented Mar 18, 2017

btracey commented Mar 20, 2017

btracey commented Mar 20, 2017

randall77 commented Mar 20, 2017

dr2chase commented Mar 20, 2017

randall77 commented Mar 20, 2017

dr2chase commented Mar 20, 2017

gopherbot commented Mar 21, 2017

gopherbot commented Apr 21, 2017

josharian commented May 11, 2017

btracey commented May 14, 2017

josharian commented May 14, 2017

btracey commented May 14, 2017

josharian commented May 15, 2017

btracey commented May 15, 2017 • edited Loading

randall77 commented Jun 6, 2017

agnivade commented Feb 1, 2019

gopherbot commented Mar 1, 2019

randall77 commented Mar 18, 2017 •

edited

Loading

btracey commented May 15, 2017 •

edited

Loading