Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: BLAS regressions at tip #19595

Closed
btracey opened this issue Mar 17, 2017 · 20 comments
Closed

cmd/compile: BLAS regressions at tip #19595

btracey opened this issue Mar 17, 2017 · 20 comments
Labels
FrozenDueToAge Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@btracey
Copy link
Contributor

btracey commented Mar 17, 2017

Comparing go1.8 with go version devel +b9f6b22 Fri Mar 17 13:59:31 2017 +0000 darwin/amd64

For all the benchmarks, the package is The github.com/gonum/blas/native. Below, the noasm tag makes the code pure go, rather than the ASM implementations of ddot and daxpy (which use SIMD and thus make the code significantly faster).

The first benchmark is Daxpy. Command is go test -bench Daxpy -tags noasm -count 5

brendan:~/Documents/mygo/src/github.com/gonum$ benchstat daxpy18.txt dgaxpytip.txt 
name                      old time/op  new time/op   delta
DaxpySmallBothUnitary-8   21.8ns ± 2%   23.2ns ± 6%   +6.80%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.6ns ± 1%   29.7ns ± 1%  +16.08%  (p=0.008 n=5+5)
DaxpySmallUniInc-8        24.8ns ± 1%   29.0ns ± 4%  +17.00%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       25.3ns ± 1%   28.6ns ± 1%  +13.04%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   872ns ± 1%   1153ns ± 0%  +32.14%  (p=0.008 n=5+5)
DaxpyMediumIncUni-8       1.07µs ± 1%   1.17µs ± 1%   +9.10%  (p=0.008 n=5+5)
DaxpyMediumUniInc-8        928ns ± 1%   1162ns ± 1%  +25.12%  (p=0.008 n=5+5)
DaxpyMediumBothInc-8      1.15µs ± 1%   1.22µs ± 1%   +6.68%  (p=0.008 n=5+5)
DaxpyLargeBothUnitary-8   88.6µs ± 1%  116.5µs ± 4%  +31.57%  (p=0.008 n=5+5)
DaxpyLargeIncUni-8         181µs ± 1%    185µs ± 2%   +2.35%  (p=0.016 n=5+5)
DaxpyLargeUniInc-8         140µs ± 1%    146µs ± 2%   +4.20%  (p=0.008 n=5+5)
DaxpyLargeBothInc-8        294µs ± 1%    296µs ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeBothUnitary-8    13.2ms ± 1%   14.2ms ± 2%   +8.00%  (p=0.008 n=5+5)
DaxpyHugeIncUni-8         30.5ms ± 3%   30.2ms ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeUniInc-8         29.3ms ± 1%   29.1ms ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeBothInc-8        46.6ms ± 2%   46.3ms ± 1%     ~     (p=0.222 n=5+5)

There are significant slowdowns at the smaller sizes, though they disappear at the larger sizes. I'm not sure these explain the regressions we're seeing in Dgemv, so it may be a separate issue.

The command is go test -bench Dgemv -tags noasm -count 5.

brendan:~/Documents/mygo/src/github.com/gonum$ benchstat dgemv18.txt dgemvtip.txt 
name                      old time/op  new time/op   delta
DgemvSmSmNoTransInc1-8     171ns ± 0%    179ns ± 1%   +4.81%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     206ns ± 1%    228ns ± 0%  +10.66%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       209ns ± 0%    220ns ± 1%   +5.07%  (p=0.016 n=4+5)
DgemvSmSmTransIncN-8       221ns ± 1%    257ns ± 1%  +16.17%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.93µs ± 3%   9.90µs ± 4%     ~     (p=0.841 n=5+5)
DgemvMedMedNoTransIncN-8  10.1µs ± 1%   12.9µs ± 1%  +27.62%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    10.7µs ± 1%   12.8µs ± 1%  +19.38%  (p=0.008 n=5+5)
DgemvMedMedTransIncN-8    11.4µs ± 1%   13.9µs ± 1%  +22.11%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     919µs ± 1%    922µs ± 2%     ~     (p=0.841 n=5+5)
DgemvLgLgNoTransIncN-8     937µs ± 2%   1219µs ± 2%  +30.12%  (p=0.008 n=5+5)
DgemvLgLgTransInc1-8       948µs ± 1%   1236µs ± 1%  +30.29%  (p=0.008 n=5+5)
DgemvLgLgTransIncN-8       988µs ± 1%   1223µs ± 1%  +23.85%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    14.4µs ± 1%   14.5µs ± 1%     ~     (p=0.421 n=5+5)
DgemvLgSmNoTransIncN-8    17.4µs ± 1%   20.2µs ± 1%  +15.57%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      17.6µs ± 2%   18.3µs ± 2%   +4.18%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      18.0µs ± 1%   22.4µs ± 2%  +24.35%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.65µs ± 1%   8.76µs ± 1%   +1.21%  (p=0.032 n=5+5)
DgemvSmLgNoTransIncN-8    8.74µs ± 1%  11.64µs ± 1%  +33.15%  (p=0.008 n=5+5)
DgemvSmLgTransInc1-8      9.36µs ± 1%  12.15µs ± 1%  +29.83%  (p=0.008 n=5+5)
DgemvSmLgTransIncN-8      10.1µs ± 0%   12.5µs ± 1%  +24.03%  (p=0.008 n=5+5)

As we see, here some of the largest regressions are at the larger sizes. These additionally work to about a 15% regression on Dgemm for almost all sizes, and about a 15% regression for the yet higher-level Cholesky decomposition.

@josharian josharian changed the title BLAS regressions at tip cmd/compile: BLAS regressions at tip Mar 17, 2017
@bradfitz bradfitz added this to the Go1.9 milestone Mar 17, 2017
@josharian josharian added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance labels Mar 17, 2017
@bradfitz bradfitz removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Mar 17, 2017
@randall77
Copy link
Contributor

randall77 commented Mar 18, 2017

Looks like there are some extra reg->reg moves in tip compared to 1.8.

I strongly suspect the introduction of MULSDmem ops. Previously we had

    m = some constant
    for ... {
        x = load ...
        x = m * x
    }

which works fine with no reg moves. After introducing MULSDmem ops, we have

    m = some constant
    for ... {
        x = m * load ...
    }

The op for x = m * load ... requires its output register (for x) to be the same as the input register (m). So there are 2 additional register moves, one to copy m to another register and one to move it back. The copy back then requires an additional branch instruction.

@TocarIP

source code:

func f(dst []float64, alpha float64, x, y []float64) {
	for i, v := range x {
		dst[i] = alpha*v + y[i]
	}
}

1.8:

	0x0038 00056 (/Users/khr/go/tmp1.go:4)	MOVSD	(DI), X1
	0x003c 00060 (/Users/khr/go/tmp1.go:5)	MULSD	X0, X1
	0x0040 00064 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, BX
	0x0043 00067 (/Users/khr/go/tmp1.go:5)	JCC	$0, 111
	0x0045 00069 (/Users/khr/go/tmp1.go:5)	MOVSD	(SI)(R8*8), X2
	0x004b 00075 (/Users/khr/go/tmp1.go:5)	ADDSD	X2, X1
	0x004f 00079 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, DX
	0x0052 00082 (/Users/khr/go/tmp1.go:5)	JCC	$0, 111
	0x0054 00084 (/Users/khr/go/tmp1.go:5)	MOVSD	X1, (CX)(R8*8)
	0x005a 00090 (/Users/khr/go/tmp1.go:4)	ADDQ	$8, DI
	0x005e 00094 (/Users/khr/go/tmp1.go:4)	INCQ	R8
	0x0061 00097 (/Users/khr/go/tmp1.go:4)	CMPQ	R8, AX
	0x0064 00100 (/Users/khr/go/tmp1.go:4)	JLT	$0, 56

tip:

	0x0038 00056 (/Users/khr/go/tmp1.go:5)	MOVUPS	X0, X1
	0x003b 00059 (/Users/khr/go/tmp1.go:5)	MULSD	(DI), X0
	0x003f 00063 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, SI
	0x0042 00066 (/Users/khr/go/tmp1.go:5)	JCC	$0, 122
	0x0044 00068 (/Users/khr/go/tmp1.go:5)	MOVSD	(CX)(R8*8), X2
	0x004a 00074 (/Users/khr/go/tmp1.go:5)	ADDSD	X2, X0
	0x004e 00078 (/Users/khr/go/tmp1.go:5)	CMPQ	R8, BX
	0x0051 00081 (/Users/khr/go/tmp1.go:5)	JCC	$0, 115
	0x0053 00083 (/Users/khr/go/tmp1.go:5)	MOVSD	X0, (DX)(R8*8)
	0x0059 00089 (/Users/khr/go/tmp1.go:4)	ADDQ	$8, DI
	0x005d 00093 (/Users/khr/go/tmp1.go:4)	INCQ	R8
	0x0060 00096 (/Users/khr/go/tmp1.go:4)	CMPQ	R8, AX
	0x0063 00099 (/Users/khr/go/tmp1.go:4)	JGE	$0, 106
	0x0065 00101 (/Users/khr/go/tmp1.go:5)	MOVUPS	X1, X0
	0x0068 00104 (/Users/khr/go/tmp1.go:4)	JMP	56

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/38337 mentions this issue.

@randall77
Copy link
Contributor

@btracey Please try the patch above and see if it fixes everything for you.

@randall77 randall77 self-assigned this Mar 18, 2017
@btracey
Copy link
Contributor Author

btracey commented Mar 20, 2017

It seems to have partially helped, but not fixed the problem. I'm not sure I applied it correctly though (see #19619)

DaxpySmallBothUnitary-8   21.6ns ± 1%   22.3ns ± 1%   +3.15%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.6ns ± 0%   27.8ns ± 1%   +8.75%  (p=0.016 n=4+5)
DaxpySmallUniInc-8        24.9ns ± 1%   27.4ns ± 2%  +10.13%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       25.4ns ± 0%   27.8ns ± 1%   +9.61%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   880ns ± 1%    884ns ± 1%     ~     (p=0.341 n=5+5)
DaxpyMediumIncUni-8       1.07µs ± 1%   1.18µs ± 1%  +10.12%  (p=0.008 n=5+5)
DaxpyMediumUniInc-8        927ns ± 0%   1168ns ± 1%  +26.07%  (p=0.008 n=5+5)
DaxpyMediumBothInc-8      1.15µs ± 2%   1.24µs ± 4%   +8.47%  (p=0.008 n=5+5)
DaxpyLargeBothUnitary-8   88.2µs ± 1%   89.2µs ± 0%   +1.18%  (p=0.008 n=5+5)
DaxpyLargeIncUni-8         185µs ± 1%    189µs ± 1%   +2.45%  (p=0.008 n=5+5)
DaxpyLargeUniInc-8         142µs ± 1%    146µs ± 1%   +3.11%  (p=0.008 n=5+5)
DaxpyLargeBothInc-8        292µs ± 0%    304µs ± 1%   +4.25%  (p=0.008 n=5+5)
DaxpyHugeBothUnitary-8    13.3ms ± 1%   13.1ms ± 1%   -1.43%  (p=0.016 n=5+5)
DaxpyHugeIncUni-8         30.4ms ± 2%   30.5ms ± 1%     ~     (p=0.690 n=5+5)
DaxpyHugeUniInc-8         29.1ms ± 0%   29.3ms ± 1%     ~     (p=0.286 n=4+5)
DaxpyHugeBothInc-8        47.4ms ± 1%   46.6ms ± 1%   -1.60%  (p=0.032 n=5+5)
DgemvSmSmNoTransInc1-8     171ns ± 1%    203ns ± 0%  +19.09%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     206ns ± 1%    228ns ± 1%  +10.37%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       210ns ± 1%    212ns ± 2%     ~     (p=0.333 n=5+5)
DgemvSmSmTransIncN-8       222ns ± 1%    245ns ± 1%  +10.08%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.68µs ± 1%  10.59µs ± 1%   +9.41%  (p=0.008 n=5+5)
DgemvMedMedNoTransIncN-8  10.1µs ± 1%   13.0µs ± 1%  +28.10%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    10.8µs ± 1%   10.9µs ± 3%     ~     (p=0.151 n=5+5)
DgemvMedMedTransIncN-8    11.3µs ± 0%   13.9µs ± 1%  +22.81%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     924µs ± 2%    934µs ± 1%     ~     (p=0.310 n=5+5)
DgemvLgLgNoTransIncN-8     929µs ± 1%   1212µs ± 1%  +30.49%  (p=0.008 n=5+5)
DgemvLgLgTransInc1-8       952µs ± 2%    957µs ± 2%     ~     (p=0.690 n=5+5)
DgemvLgLgTransIncN-8       974µs ± 0%   1212µs ± 1%  +24.38%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    14.4µs ± 1%   17.5µs ± 1%  +21.47%  (p=0.008 n=5+5)
DgemvLgSmNoTransIncN-8    17.9µs ± 1%   19.9µs ± 1%  +11.48%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      17.5µs ± 1%   17.5µs ± 0%     ~     (p=0.690 n=5+5)
DgemvLgSmTransIncN-8      18.0µs ± 0%   20.5µs ± 1%  +14.09%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.70µs ± 1%   8.85µs ± 2%   +1.75%  (p=0.016 n=5+5)
DgemvSmLgNoTransIncN-8    8.72µs ± 1%  11.65µs ± 2%  +33.57%  (p=0.008 n=5+5)
DgemvSmLgTransInc1-8      9.41µs ± 0%   9.39µs ± 1%     ~     (p=0.421 n=5+5)
DgemvSmLgTransIncN-8      10.2µs ± 2%   12.7µs ± 1%  +24.70%  (p=0.008 n=5+5)

@btracey
Copy link
Contributor Author

btracey commented Mar 20, 2017

It looks like all of the Daxpy benchmarks got faster with this patch, even though there seems to still be some regression overall. The Dgemv benchmarks have a less clear story.

@randall77
Copy link
Contributor

I've taken a closer look at DgemvMedMedNoTransIncN. Its inner loop is gonum/internal/asm/f64/axpy.go:AxpyInc
There's no difference in the inner loop with CL38337.
The difference between 1.8 and tip is that we're no longer loop peeling the branch at the start of the loop. See CL 36205 and issues #18977 #15837.
So tip looks like:

loop:
    CMPQ ...
    JGE exit
    ...
    JMP loop
exit:

whereas 1.8 is:

    CMPQ ...
    JGE exit
loop:
    ...
    CMPQ ...
    JLT loop
exit:

I'm pretty surprised this matters at the ~30% slowdown level. There might be something I'm missing.
In any case, we plan to fix this one for 1.9.

@dr2chase
Copy link
Contributor

We might be able to get the effect of the loop peel just by changing the weights on block ordering -- for loops where the header block has a loop exit edge, make the branch to the header be lower than the backedge to the header.

@randall77
Copy link
Contributor

@dr2chase I feel like that would interact badly with the register allocator. If you put the loop header block last, then the 2nd block of the loop appears first in the schedule. But none of its predecessors have been processed yet, so the register allocator won't know what register state to start with. It could be fixed, I suppose, computing a regalloc schedule instead of just doing blocks in order.
It would generate:

    JMP entry
loop:
    ...
entry:
    CMPQ ...
    JLT loop
exit:

Which isn't quite the same as 1.8 either. Better or worse? I'm not sure.
It definitely has the appeal of doing less CFG surgery.

@dr2chase
Copy link
Contributor

When I did the explicit loop peel to fix i,e:=range-of-slice loops in the presence of backedge rescheduling checks, PPC did get a little faster.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/38431 mentions this issue.

gopherbot pushed a commit that referenced this issue Mar 23, 2017
We want to merge a load and op into a single instruction

    l = LOAD ptr mem
    y = OP x l

into

    y = OPload x ptr mem

However, all of our OPload instructions require that y uses
the same register as x. If x is needed past this instruction, then
we must copy x somewhere else, losing the whole benefit of merging
the instructions in the first place.

Disable this optimization if x is live past the OP.

Also disable this optimization if the OP is in a deeper loop than the load.

Update #19595

Change-Id: I87f596aad7e91c9127bfb4705cbae47106e1e77a
Reviewed-on: https://go-review.googlesource.com/38337
Reviewed-by: Ilya Tocar <[email protected]>
gopherbot pushed a commit that referenced this issue Mar 23, 2017
Regalloc uses loop depths - make sure they are initialized!

Test to make sure we aren't pushing spills into loops.

This fixes a generated-code performance bug introduced with
the better spill placement change:
https://go-review.googlesource.com/c/34822/

Update #19595

Change-Id: Ib9f0da6fb588503518847d7aab51e569fd3fa61e
Reviewed-on: https://go-review.googlesource.com/38434
Reviewed-by: David Chase <[email protected]>
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/38434 mentions this issue.

gopherbot pushed a commit that referenced this issue Apr 24, 2017
Old loops look like this:
   loop:
     CMPQ ...
     JGE exit
     ...
     JMP loop
   exit:

New loops look like this:
    JMP entry
  loop:
    ...
  entry:
    CMPQ ...
    JLT loop

This removes one instruction (the unconditional jump) from
the inner loop.
Kinda surprisingly, it matters.

This is a bit different than the peeling that the old obj
library did in that we don't duplicate the loop exit test.
We just jump to the test.  I'm not sure if it is better or
worse to do that (peeling gets rid of the JMP but means more
code duplication), but this CL is certainly a much simpler
compiler change, so I'll try this way first.

The obj library used to do peeling before
CL https://go-review.googlesource.com/c/36205 turned it off.

Fixes #15837 (remove obj instruction reordering)
The reordering is already removed, this CL implements the only
part of that reordering that we'd like to keep.

Fixes #14758 (append loop)
name    old time/op    new time/op    delta
Foo-12     817ns ± 4%     538ns ± 0%  -34.08%   (p=0.000 n=10+9)
Bar-12     850ns ±11%     570ns ±13%  -32.88%  (p=0.000 n=10+10)

Update #19595 (BLAS slowdown)
name                       old time/op  new time/op  delta
DgemvMedMedNoTransIncN-12  13.2µs ± 9%  10.2µs ± 1%  -22.26%  (p=0.000 n=9+9)

Fixes #19633 (append loop)
name    old time/op    new time/op    delta
Foo-12     810ns ± 1%     540ns ± 0%  -33.30%   (p=0.000 n=8+9)

Update #18977 (Fannkuch11 regression)
name         old time/op    new time/op    delta
Fannkuch11-8                2.80s ± 0%     3.01s ± 0%  +7.47%   (p=0.000 n=9+10)
This one makes no sense.  There's strictly 1 less instruction in the
inner loop (17 instead of 18).  They are exactly the same instructions
except for the JMP that has been elided.

go1 benchmarks generally don't look very impressive.  But the gains for the
specific issues above make this CL still probably worth it.
name                      old time/op    new time/op    delta
BinaryTree17-8              2.32s ± 0%     2.34s ± 0%  +1.14%    (p=0.000 n=9+7)
Fannkuch11-8                2.80s ± 0%     3.01s ± 0%  +7.47%   (p=0.000 n=9+10)
FmtFprintfEmpty-8          44.1ns ± 1%    46.1ns ± 1%  +4.53%  (p=0.000 n=10+10)
FmtFprintfString-8         67.8ns ± 0%    74.4ns ± 1%  +9.80%   (p=0.000 n=10+9)
FmtFprintfInt-8            74.9ns ± 0%    78.4ns ± 0%  +4.67%   (p=0.000 n=8+10)
FmtFprintfIntInt-8          117ns ± 1%     123ns ± 1%  +4.69%   (p=0.000 n=9+10)
FmtFprintfPrefixedInt-8     160ns ± 1%     146ns ± 0%  -8.22%   (p=0.000 n=8+10)
FmtFprintfFloat-8           214ns ± 0%     206ns ± 0%  -3.91%    (p=0.000 n=8+8)
FmtManyArgs-8               468ns ± 0%     497ns ± 1%  +6.09%   (p=0.000 n=8+10)
GobDecode-8                6.16ms ± 0%    6.21ms ± 1%  +0.76%   (p=0.000 n=9+10)
GobEncode-8                4.90ms ± 0%    4.92ms ± 1%  +0.37%   (p=0.028 n=9+10)
Gzip-8                      209ms ± 0%     212ms ± 0%  +1.33%  (p=0.000 n=10+10)
Gunzip-8                   36.6ms ± 0%    38.0ms ± 1%  +4.03%    (p=0.000 n=9+9)
HTTPClientServer-8         84.2µs ± 0%    86.0µs ± 1%  +2.14%    (p=0.000 n=9+9)
JSONEncode-8               13.6ms ± 3%    13.8ms ± 1%  +1.55%   (p=0.003 n=9+10)
JSONDecode-8               53.2ms ± 5%    52.9ms ± 0%    ~     (p=0.280 n=10+10)
Mandelbrot200-8            3.78ms ± 0%    3.78ms ± 1%    ~      (p=0.661 n=10+9)
GoParse-8                  2.89ms ± 0%    2.94ms ± 2%  +1.50%  (p=0.000 n=10+10)
RegexpMatchEasy0_32-8      68.5ns ± 2%    68.9ns ± 1%    ~     (p=0.136 n=10+10)
RegexpMatchEasy0_1K-8       220ns ± 1%     225ns ± 1%  +2.41%  (p=0.000 n=10+10)
RegexpMatchEasy1_32-8      64.7ns ± 0%    64.5ns ± 0%  -0.28%  (p=0.042 n=10+10)
RegexpMatchEasy1_1K-8       348ns ± 1%     355ns ± 0%  +1.90%  (p=0.000 n=10+10)
RegexpMatchMedium_32-8      102ns ± 1%     105ns ± 1%  +2.95%  (p=0.000 n=10+10)
RegexpMatchMedium_1K-8     33.1µs ± 3%    32.5µs ± 0%  -1.75%  (p=0.000 n=10+10)
RegexpMatchHard_32-8       1.71µs ± 1%    1.70µs ± 1%  -0.84%   (p=0.002 n=10+9)
RegexpMatchHard_1K-8       51.1µs ± 0%    50.8µs ± 1%  -0.48%  (p=0.004 n=10+10)
Revcomp-8                   411ms ± 1%     402ms ± 0%  -2.22%   (p=0.000 n=10+9)
Template-8                 61.8ms ± 1%    59.7ms ± 0%  -3.44%    (p=0.000 n=9+9)
TimeParse-8                 306ns ± 0%     318ns ± 0%  +3.83%  (p=0.000 n=10+10)
TimeFormat-8                320ns ± 0%     318ns ± 1%  -0.53%   (p=0.012 n=7+10)

Change-Id: Ifaf29abbe5874e437048e411ba8f7cfbc9e1c94b
Reviewed-on: https://go-review.googlesource.com/38431
Run-TryBot: Keith Randall <[email protected]>
TryBot-Result: Gobot Gobot <[email protected]>
Reviewed-by: David Chase <[email protected]>
@josharian
Copy link
Contributor

@btracey would you mind re-measuring with the latest tip, to assess whether this still might need fixes for 1.9? Thanks.

@btracey
Copy link
Contributor Author

btracey commented May 14, 2017

Here are the current stats. Mostly better

DaxpySmallBothUnitary-8   19.5ns ± 1%  21.5ns ± 1%  +10.27%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.9ns ± 1%  28.4ns ± 2%   +9.58%  (p=0.008 n=5+5)
DaxpySmallUniInc-8        25.5ns ± 1%  27.7ns ± 2%   +8.62%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       25.1ns ± 1%  27.4ns ± 4%   +8.92%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   865ns ± 1%   861ns ± 1%     ~     (p=0.190 n=5+5)
DaxpyMediumIncUni-8       1.07µs ± 1%  1.08µs ± 2%     ~     (p=0.690 n=5+5)
DaxpyMediumUniInc-8        922ns ± 1%   915ns ± 1%     ~     (p=0.103 n=5+5)
DaxpyMediumBothInc-8      1.14µs ± 1%  1.15µs ± 1%     ~     (p=0.841 n=5+5)
DaxpyLargeBothUnitary-8   87.9µs ± 2%  87.3µs ± 3%     ~     (p=0.151 n=5+5)
DaxpyLargeIncUni-8         173µs ± 2%   174µs ± 2%     ~     (p=1.000 n=5+5)
DaxpyLargeUniInc-8         137µs ± 1%   138µs ± 1%     ~     (p=0.056 n=5+5)
DaxpyLargeBothInc-8        299µs ± 1%   287µs ± 2%   -4.03%  (p=0.008 n=5+5)
DaxpyHugeBothUnitary-8    12.9ms ± 1%  12.9ms ± 0%     ~     (p=0.841 n=5+5)
DaxpyHugeIncUni-8         29.9ms ± 1%  29.9ms ± 3%     ~     (p=0.841 n=5+5)
DaxpyHugeUniInc-8         28.8ms ± 1%  28.7ms ± 0%     ~     (p=0.111 n=5+4)
DaxpyHugeBothInc-8        46.1ms ± 2%  45.8ms ± 1%     ~     (p=0.151 n=5+5)
DgemmSmSmSm-8             1.73µs ± 1%  1.73µs ± 2%     ~     (p=0.690 n=5+5)
DgemmMedMedMed-8           469µs ± 2%   469µs ± 1%     ~     (p=0.841 n=5+5)
DgemmMedLgMed-8           3.20ms ± 2%  3.39ms ±14%     ~     (p=0.310 n=5+5)
DgemmLgLgLg-8              313ms ± 1%   317ms ± 1%   +1.58%  (p=0.008 n=5+5)
DgemmLgSmLg-8             4.85ms ± 2%  4.88ms ± 1%     ~     (p=0.690 n=5+5)
DgemmLgLgSm-8             3.38ms ± 3%  3.39ms ± 2%     ~     (p=0.841 n=5+5)
DgemmHgHgSm-8              299ms ± 2%   297ms ± 4%     ~     (p=0.690 n=5+5)
DgemmMedMedMedTNT-8        486µs ± 1%   483µs ± 4%     ~     (p=0.310 n=5+5)
DgemmMedMedMedNTT-8        448µs ± 1%   453µs ± 4%     ~     (p=0.690 n=5+5)
DgemmMedMedMedTT-8         556µs ± 1%   576µs ± 2%   +3.72%  (p=0.008 n=5+5)
DgemvSmSmNoTransInc1-8     168ns ± 1%   181ns ± 1%   +7.98%  (p=0.008 n=5+5)
DgemvSmSmNoTransIncN-8     207ns ± 1%   213ns ± 1%   +2.89%  (p=0.008 n=5+5)
DgemvSmSmTransInc1-8       188ns ± 1%   204ns ± 1%   +8.61%  (p=0.008 n=5+5)
DgemvSmSmTransIncN-8       222ns ± 1%   244ns ± 0%   +9.81%  (p=0.008 n=5+5)
DgemvMedMedNoTransInc1-8  9.43µs ± 1%  9.59µs ± 1%   +1.75%  (p=0.008 n=5+5)
DgemvMedMedNoTransIncN-8  10.6µs ± 2%  11.0µs ± 3%   +3.23%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    10.0µs ± 2%   9.9µs ± 0%   -1.24%  (p=0.032 n=5+5)
DgemvMedMedTransIncN-8    11.2µs ± 0%  11.5µs ± 1%   +2.45%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     909µs ± 1%   908µs ± 1%     ~     (p=0.690 n=5+5)
DgemvLgLgNoTransIncN-8     929µs ± 2%   929µs ± 1%     ~     (p=0.548 n=5+5)
DgemvLgLgTransInc1-8       920µs ± 1%   920µs ± 1%     ~     (p=1.000 n=5+5)
DgemvLgLgTransIncN-8       966µs ± 1%   977µs ± 1%   +1.19%  (p=0.016 n=5+5)
DgemvLgSmNoTransInc1-8    13.7µs ± 1%  14.7µs ± 1%   +6.91%  (p=0.008 n=5+5)
DgemvLgSmNoTransIncN-8    18.0µs ± 1%  18.0µs ± 1%     ~     (p=0.841 n=5+5)
DgemvLgSmTransInc1-8      15.2µs ± 1%  16.3µs ± 1%   +7.52%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      18.4µs ± 1%  19.7µs ± 1%   +7.27%  (p=0.008 n=5+5)
DgemvSmLgNoTransInc1-8    8.93µs ± 1%  8.71µs ± 3%     ~     (p=0.056 n=5+5)
DgemvSmLgNoTransIncN-8    9.07µs ± 1%  9.03µs ± 1%     ~     (p=0.548 n=5+5)
DgemvSmLgTransInc1-8      9.59µs ± 1%  9.55µs ± 0%     ~     (p=0.341 n=5+5)
DgemvSmLgTransIncN-8      10.4µs ± 1%  10.3µs ± 1%     ~     (p=0.421 n=5+5)

@josharian
Copy link
Contributor

But a few non-trivial regressions still. If you're up for it, would you mind patching in CLs 43293 and 43491 and see whether they help any? (No particular reason to think they will, other than that they have a significant effect on code layout in general.)

@btracey
Copy link
Contributor Author

btracey commented May 14, 2017

With both applied:

name                      old time/op  new time/op   delta
DaxpySmallBothUnitary-8   19.1ns ± 1%   19.7ns ± 1%   +3.03%  (p=0.008 n=5+5)
DaxpySmallIncUni-8        25.3ns ± 1%   28.8ns ± 1%  +13.65%  (p=0.008 n=5+5)
DaxpySmallUniInc-8        25.1ns ± 1%   25.7ns ± 1%   +2.47%  (p=0.008 n=5+5)
DaxpySmallBothInc-8       24.7ns ± 1%   26.5ns ± 1%   +7.19%  (p=0.008 n=5+5)
DaxpyMediumBothUnitary-8   846ns ± 1%    858ns ± 3%     ~     (p=0.222 n=5+5)
DaxpyMediumIncUni-8       1.05µs ± 1%   1.22µs ±11%  +16.51%  (p=0.008 n=5+5)
DaxpyMediumUniInc-8        902ns ± 1%   1134ns ± 1%  +25.75%  (p=0.008 n=5+5)
DaxpyMediumBothInc-8      1.13µs ± 1%   1.19µs ± 3%   +5.80%  (p=0.008 n=5+5)
DaxpyLargeBothUnitary-8   85.8µs ± 1%   93.2µs ± 4%   +8.69%  (p=0.016 n=4+5)
DaxpyLargeIncUni-8         164µs ± 1%    179µs ± 2%   +8.89%  (p=0.008 n=5+5)
DaxpyLargeUniInc-8         134µs ± 0%    141µs ± 1%   +4.79%  (p=0.008 n=5+5)
DaxpyLargeBothInc-8        288µs ± 2%    287µs ± 2%     ~     (p=1.000 n=5+5)
DaxpyHugeBothUnitary-8    12.8ms ± 1%   12.8ms ± 1%     ~     (p=0.421 n=5+5)
DaxpyHugeIncUni-8         29.4ms ± 1%   29.3ms ± 1%     ~     (p=0.310 n=5+5)
DaxpyHugeUniInc-8         28.5ms ± 1%   28.4ms ± 1%     ~     (p=0.421 n=5+5)
DaxpyHugeBothInc-8        45.5ms ± 2%   45.5ms ± 1%     ~     (p=0.841 n=5+5)
DgemmSmSmSm-8             1.70µs ± 2%   1.66µs ± 1%   -2.55%  (p=0.008 n=5+5)
DgemmMedMedMed-8           467µs ± 3%    471µs ± 3%     ~     (p=0.421 n=5+5)
DgemmMedLgMed-8           3.16ms ± 3%   3.20ms ± 3%     ~     (p=0.151 n=5+5)
DgemmLgLgLg-8              315ms ± 2%    314ms ± 0%     ~     (p=0.730 n=5+4)
DgemmLgSmLg-8             4.87ms ± 5%   4.81ms ± 1%     ~     (p=0.690 n=5+5)
DgemmLgLgSm-8             3.37ms ± 3%   3.33ms ± 1%     ~     (p=0.222 n=5+5)
DgemmHgHgSm-8              298ms ± 3%    299ms ± 5%     ~     (p=1.000 n=5+5)
DgemmMedMedMedTNT-8        485µs ± 1%    492µs ± 2%     ~     (p=0.095 n=5+5)
DgemmMedMedMedNTT-8        452µs ± 1%    448µs ± 2%     ~     (p=0.310 n=5+5)
DgemmMedMedMedTT-8         561µs ± 3%    603µs ± 2%   +7.56%  (p=0.008 n=5+5)
DgemvSmSmNoTransInc1-8     166ns ± 1%    168ns ± 1%   +1.69%  (p=0.032 n=5+5)
DgemvSmSmNoTransIncN-8     205ns ± 1%    202ns ± 1%   -1.27%  (p=0.032 n=5+5)
DgemvSmSmTransInc1-8       186ns ± 1%    182ns ± 1%   -2.15%  (p=0.008 n=5+5)
DgemvSmSmTransIncN-8       219ns ± 1%    220ns ± 0%     ~     (p=0.175 n=5+4)
DgemvMedMedNoTransInc1-8  9.35µs ± 2%   9.36µs ± 0%     ~     (p=0.690 n=5+5)
DgemvMedMedNoTransIncN-8  10.4µs ± 0%   10.0µs ± 1%   -4.68%  (p=0.008 n=5+5)
DgemvMedMedTransInc1-8    9.82µs ± 0%   9.77µs ± 1%   -0.52%  (p=0.032 n=5+5)
DgemvMedMedTransIncN-8    11.0µs ± 1%   12.4µs ± 1%  +12.07%  (p=0.008 n=5+5)
DgemvLgLgNoTransInc1-8     893µs ± 1%    888µs ± 1%     ~     (p=0.421 n=5+5)
DgemvLgLgNoTransIncN-8     908µs ± 1%    904µs ± 0%     ~     (p=0.151 n=5+5)
DgemvLgLgTransInc1-8       905µs ± 1%    898µs ± 1%     ~     (p=0.056 n=5+5)
DgemvLgLgTransIncN-8       951µs ± 0%   1158µs ± 1%  +21.80%  (p=0.008 n=5+5)
DgemvLgSmNoTransInc1-8    13.5µs ± 2%   13.4µs ± 1%     ~     (p=0.690 n=5+5)
DgemvLgSmNoTransIncN-8    17.8µs ± 2%   17.1µs ± 1%   -3.83%  (p=0.008 n=5+5)
DgemvLgSmTransInc1-8      15.0µs ± 0%   14.6µs ± 1%   -2.41%  (p=0.008 n=5+5)
DgemvLgSmTransIncN-8      18.0µs ± 1%   18.0µs ± 0%     ~     (p=0.738 n=5+5)
DgemvSmLgNoTransInc1-8    8.49µs ± 1%   8.44µs ± 1%     ~     (p=0.381 n=5+5)
DgemvSmLgNoTransIncN-8    8.56µs ± 1%   8.56µs ± 3%     ~     (p=0.690 n=5+5)
DgemvSmLgTransInc1-8      9.02µs ± 0%   9.11µs ± 2%   +0.91%  (p=0.016 n=5+5)
DgemvSmLgTransIncN-8      9.84µs ± 1%  12.11µs ± 2%  +23.10%  (p=0.008 n=5+5)
```

@josharian
Copy link
Contributor

Thanks. Looks like those CLs are a mixed bag, maybe a bit worse on balance. To confirm, are these latest from 1.8 to tip+2 CLs? I ask because the before numbers e.g. for the last benchmark don't line up—10.4us two comments back vs 9.84us in the last comment. Iff you have all the benchmark files around, could you post the benchstat going from tip to tip+2CLs, maybe running with -geomean? Ta.

@btracey
Copy link
Contributor Author

btracey commented May 15, 2017

It was 1.8.1 to the most recent. I'm running the benchmarks on my laptop (which I've been using all day), so the benchmarks could shift a bit. This is why I re-ran the 1.8.1 results (and why they don't exactly match up).

-geomean doesn't do what I expected. It prints the exact same output as before, except with a [Geo mean] 41.0µs 42.2µs +2.94% at the bottom (I ran benchstat -geomean blasbench181.txt blasbenchtip.txt , using the latest from golang.org/x/perf/cmd/benchstat). It seems like it's taking the geometric mean of all the runs rather than each benchmark individually.

@randall77
Copy link
Contributor

Punting to 1.10.

@randall77 randall77 modified the milestones: Go1.10, Go1.9 Jun 6, 2017
@mdempsky mdempsky removed this from the Go1.10 milestone Nov 30, 2017
@mdempsky mdempsky added this to the Go1.11 milestone Nov 30, 2017
@bradfitz bradfitz modified the milestones: Go1.11, Unreleased Jun 13, 2018
@agnivade
Copy link
Contributor

agnivade commented Feb 1, 2019

@btracey - Comparing 1.8.7 with 1.12beta1, I see improvements

benchstat 1.8.txt 1.12.txt 
name                      old time/op  new time/op  delta
DaxpySmallBothUnitary-4   25.4ns ± 1%  23.0ns ± 1%  -9.52%  (p=0.008 n=5+5)
DaxpySmallIncUni-4        25.3ns ± 1%  26.0ns ± 1%  +2.61%  (p=0.008 n=5+5)
DaxpySmallUniInc-4        25.7ns ± 1%  25.3ns ± 2%    ~     (p=0.056 n=5+5)
DaxpySmallBothInc-4       25.8ns ± 1%  25.6ns ± 1%    ~     (p=0.373 n=5+5)
DaxpyMediumBothUnitary-4   972ns ± 0%   909ns ± 1%  -6.52%  (p=0.008 n=5+5)
DaxpyMediumIncUni-4       1.18µs ± 1%  1.18µs ± 1%    ~     (p=0.897 n=5+5)
DaxpyMediumUniInc-4       1.16µs ± 1%  1.15µs ± 1%  -1.08%  (p=0.032 n=5+5)
DaxpyMediumBothInc-4      1.36µs ± 1%  1.37µs ± 1%    ~     (p=0.381 n=5+5)
DaxpyLargeBothUnitary-4    105µs ± 0%   101µs ± 1%  -4.37%  (p=0.008 n=5+5)
DaxpyLargeIncUni-4         294µs ± 5%   278µs ± 1%  -5.58%  (p=0.008 n=5+5)
DaxpyLargeUniInc-4         206µs ± 0%   186µs ± 3%  -9.40%  (p=0.008 n=5+5)
DaxpyLargeBothInc-4        491µs ± 1%   451µs ± 0%  -8.16%  (p=0.008 n=5+5)
DaxpyHugeBothUnitary-4    14.3ms ± 0%  14.3ms ± 1%    ~     (p=0.310 n=5+5)
DaxpyHugeIncUni-4         33.3ms ± 0%  33.0ms ± 0%  -0.92%  (p=0.008 n=5+5)
DaxpyHugeUniInc-4         30.4ms ± 0%  30.6ms ± 0%  +0.83%  (p=0.008 n=5+5)
DaxpyHugeBothInc-4        49.6ms ± 0%  49.6ms ± 1%    ~     (p=0.690 n=5+5)

Kindly check and let us know if there is anything else to be done here. Thanks.

@agnivade agnivade added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Feb 1, 2019
@gopherbot
Copy link
Contributor

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

@golang golang locked and limited conversation to collaborators Feb 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

8 participants