Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: >100ms mark termination times #10898

Closed
aclements opened this issue May 18, 2015 · 15 comments
Closed

runtime: >100ms mark termination times #10898

aclements opened this issue May 18, 2015 · 15 comments
Milestone

Comments

@aclements
Copy link
Member

The program posted in https://groups.google.com/d/msg/golang-dev/_qY3IugLlAc/5ZIFjYHTHUwJ experiences very long mark termination times:

$ GODEBUG=gctrace=1 goscheme test.scm
gc #1 @0.240s 37%: 0+37+18+9+43 ms clock, 0+37+0+0/9/0+43 ms cpu, 4->5->2 MB, 4 MB goal, 1 P
gc #2 @0.514s 48%: 0+69+88+14+73 ms clock, 0+69+0+0/14/0+73 ms cpu, 4->5->3 MB, 4 MB goal, 1 P
gc #3 @0.756s 59%: 0+88+12+19+97 ms clock, 0+88+0+0/19/0+97 ms cpu, 4->6->4 MB, 5 MB goal, 1 P
gc #4 @1.107s 66%: 0+111+14+60+144 ms clock, 0+111+0+9/20/0+144 ms cpu, 5->8->6 MB, 7 MB goal, 1 P
gc #5 @1.664s 65%: 0+148+13+60+160 ms clock, 0+148+0+14/20/0+160 ms cpu, 6->10->7 MB, 8 MB goal, 1 P
gc #6 @2.166s 70%: 0+188+12+60+205 ms clock, 0+188+0+23/20/0+205 ms cpu, 7->12->9 MB, 8 MB goal, 1 P
gc #7 @2.812s 73%: 0+239+14+80+258 ms clock, 0+239+0+35/20/0+258 ms cpu, 14->16->12 MB, 15 MB goal, 1 P
gc #8 @3.934s 70%: 0+316+11+82+330 ms clock, 0+316+0+47/21/0+330 ms cpu, 19->20->14 MB, 19 MB goal, 1 P

It is also ~2x slower than it was on Go 1.4, which may be related.

@RLH

@aclements aclements added this to the Go1.5 milestone May 18, 2015
@aclements
Copy link
Member Author

It appears we're entering mark termination when there's still quite a bit of work on the work queues. Curiously, the work isn't trapped in the per-P work cache (which was what I originally suspected) because there's only one P and hence only one background mark worker, which disposes its per-P cache when it ends the concurrent mark phase. Rather, it's on the global work list.

entering mark termination:
P 0 cache => nil wbuf
full wbuf list: 24+24+24+24+24+24+24+24+24+24+24+24=288
partial wbuf list: 10=10
gc #14 @11.172s 69%: 0+578+13+120+595 ms clock, 0+578+0+74/37/0+595 ms cpu, 16->87->37 MB, 16 MB goal, 1 P

@aclements
Copy link
Member Author

The pending work isn't the problem. (It's almost certainly all recent allocations and write barriers for publishing these recent allocations, which means the pending pointers have almost no depth. That is, there's no choke-point in the work queue, so while there is pending work, it will be scanned in almost no time during mark termination.)

The problem is that this program has a G with a ~70MB stack, which takes >150ms to scan. We see this cost both during the concurrent scan phase and during the re-scan in mark termination (which is why the two durations closely align in each cycle). This probably also explains the 2x slowdown compared to 1.4: this program spends most of its time in GC, and each GC cycle is doing roughly twice as much as it did in 1.4 because of the stack re-scan.

I'm going to try implementing stack barriers to address this. While this example is just a toy program, it's easy to imagine that there are real programs that have large stacks like this and right now we can't deliver on our low-latency promise when active goroutines have large stacks. Assuming only the top of the stack changes during the concurrent GC phases, this should drastically reduce the stack re-scan time during mark termination.

@aclements aclements self-assigned this May 18, 2015
@minux
Copy link
Member

minux commented May 18, 2015

I don't think we can assume that only top of stack changes (it might be
true today due to deficiencies in escape analysis).

For example, the following contrived program can make the stack grow
arbitrarily large, yet the topmost frame can write to slot in the bottom
most frame.

package main

func F(x **int, i int) {
if i == 0 {
*x = &i
return
}
F(x, i-1)
}

func main() {
var x *int
F(&x, 10000)
}

And -gcflags -m confirms that x in main is allocated on stack, so this
problem is not theoretical.
$ go build -gcflags -m stk.go

command-line-arguments

./stk.go:3: moved to heap: i
./stk.go:5: &i escapes to heap
./stk.go:3: F x does not escape
./stk.go:13: main &x does not escape

@aclements
Copy link
Member Author

If you have multi-megabyte stacks and you've gone to the trouble of threading a stack pointer through 50 megabytes worth of stack, then, well, there's only so much we can do to reduce the stack scan cost. Don't do that. But otherwise, stack barriers will dramatically reduce the cost of scanning large stacks.

@minux
Copy link
Member

minux commented May 18, 2015 via email

@aclements
Copy link
Member Author

Writes to stack slots do not have write barriers. That's why they need to be rescanned.

@randall77
Copy link
Contributor

Only writes to the current frame might be missing write barriers. Writes to the non-current frame still have write barriers (because the compiler can't assume incoming pointer arguments are to the stack).
I'm not sure that helps any, but worth considering.

@minux
Copy link
Member

minux commented May 18, 2015 via email

@minux
Copy link
Member

minux commented May 18, 2015 via email

@aclements
Copy link
Member Author

You still need to re-scan all of the frames that were created since the concurrent scan. Any of those frames can still contain a pointer to a white heap object. The stack barriers I'm implementing will tell the GC which set of frames have been created since the first scan of the stack and hence need to be rescanned.

The fact that writes to up-pointers generate write barriers may simplify things, though. I was thinking I could only install stack barriers at frames where there are no pointers to higher frames (which is easy to compute, but does restrict where stack barriers can be installed, especially in examples like minux's). But if there are write barriers for these (and we don't simply ignore those write barriers like we do now), then I could install stack barriers wherever I want and use the write barrier to change the watermark if there's a write to a higher frame via a pointer.

@aclements
Copy link
Member Author

Actually, the fact that we generate write barriers for writes to up-pointers is incredibly useful. I was wrong that we ignore these write barriers: the slot may be in the stack, but if it's a write barrier we care about, the pointer will point to the heap and we'll grey the object. This is exactly what we need to handle up-pointers. Hence, we can put stack barriers between any two frames on the stack without worrying about up-pointers.

@RLH
Copy link
Contributor

RLH commented May 19, 2015

I think we are on the same page but just to make sure. A stack barrier
notes a location on the stack where all frames older
than the current frame has been scanned.

  1. Mark Termination needs to scan any stack frames created since the last
    time the stack was scanned.
  2. The write barrier is responsible for greying any objects written into a
    slot older than the stack frame barrier.
  3. Greying objects written into slots younger than the stack barrier is not
    needed for correctness but may help performance/scalability.
    Just checking....

On Tue, May 19, 2015 at 11:39 AM, Austin Clements [email protected]
wrote:

Actually, the fact that we generate write barriers for writes to
up-pointers is incredibly useful. I was wrong that we ignore these write
barriers: the slot may be in the stack, but if it's a write barrier we care
about, the pointer will point to the heap and we'll grey the object. This
is exactly what we need to handle up-pointers. Hence, we can put stack
barriers between any two frames on the stack without worrying about
up-pointers.


Reply to this email directly or view it on GitHub
#10898 (comment).

@aclements
Copy link
Member Author

Yes, it sounds like we're on the same page.

On Tue, May 19, 2015 at 12:02 PM, Richard L. Hudson <
[email protected]> wrote:

I think we are on the same page but just to make sure. A stack barrier
notes a location on the stack where all frames older
than the current frame has been scanned.

... where all frames older than that location were scanned during
concurrent scan. (Just being precise.)

  1. Mark Termination needs to scan any stack frames created since the last

time the stack was scanned.
2) The write barrier is responsible for greying any objects written into a
slot older than the stack frame barrier.
3) Greying objects written into slots younger than the stack barrier is not
needed for correctness but may help performance/scalability.
Just checking....

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/10314 mentions this issue.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/13614 mentions this issue.

aclements added a commit that referenced this issue Oct 22, 2015
The cost of scanning large stacks is currently dominated by the time
spent looking up and decoding the pcvalue table. However, large stacks
are usually large not because they contain calls to many different
functions, but because they contain many calls to the same, small set
of recursive functions. Hence, walking large stacks tends to make the
same pcvalue queries many times.

Based on this observation, this commit adds a small, very simple, and
fast cache in front of pcvalue lookup. We thread this cache down from
operations that make many pcvalue calls, such as gentraceback, stack
scanning, and stack adjusting.

This simple cache works well because it has minimal overhead when it's
not effective. I also tried a hashed direct-map cache, CLOCK-based
replacement, round-robin replacement, and round-robin with lookups
disabled until there had been at least 16 probes, but none of these
approaches had obvious wins over the random replacement policy in this
commit.

This nearly doubles the overall performance of the deep stack test
program from issue #10898:

name        old time/op  new time/op  delta
Issue10898   16.5s ±12%    9.2s ±12%  -44.37%  (p=0.008 n=5+5)

It's a very slight win on the garbage benchmark:

name              old time/op  new time/op  delta
XBenchGarbage-12  4.92ms ± 1%  4.89ms ± 1%  -0.75%  (p=0.000 n=18+19)

It's a wash (but doesn't harm performance) on the go1 benchmarks,
which don't have particularly deep stacks:

name                      old time/op    new time/op    delta
BinaryTree17-12              3.11s ± 2%     3.20s ± 3%  +2.83%  (p=0.000 n=17+20)
Fannkuch11-12                2.51s ± 1%     2.51s ± 1%  -0.22%  (p=0.034 n=19+18)
FmtFprintfEmpty-12          50.8ns ± 3%    50.6ns ± 2%    ~     (p=0.793 n=20+20)
FmtFprintfString-12          174ns ± 0%     174ns ± 1%  +0.17%  (p=0.048 n=15+20)
FmtFprintfInt-12             177ns ± 0%     165ns ± 1%  -6.99%  (p=0.000 n=17+19)
FmtFprintfIntInt-12          283ns ± 1%     284ns ± 0%  +0.22%  (p=0.000 n=18+15)
FmtFprintfPrefixedInt-12     243ns ± 1%     244ns ± 1%  +0.40%  (p=0.000 n=20+19)
FmtFprintfFloat-12           318ns ± 0%     319ns ± 0%  +0.27%  (p=0.001 n=19+20)
FmtManyArgs-12              1.12µs ± 0%    1.14µs ± 0%  +1.74%  (p=0.000 n=19+20)
GobDecode-12                8.69ms ± 0%    8.73ms ± 1%  +0.46%  (p=0.000 n=18+18)
GobEncode-12                6.64ms ± 1%    6.61ms ± 1%  -0.46%  (p=0.000 n=20+20)
Gzip-12                      323ms ± 2%     319ms ± 1%  -1.11%  (p=0.000 n=20+20)
Gunzip-12                   42.8ms ± 0%    42.9ms ± 0%    ~     (p=0.158 n=18+20)
HTTPClientServer-12         63.3µs ± 1%    63.1µs ± 1%  -0.35%  (p=0.011 n=20+20)
JSONEncode-12               16.9ms ± 1%    17.3ms ± 1%  +2.84%  (p=0.000 n=19+20)
JSONDecode-12               59.7ms ± 0%    58.5ms ± 0%  -2.05%  (p=0.000 n=19+17)
Mandelbrot200-12            3.92ms ± 0%    3.91ms ± 0%  -0.16%  (p=0.003 n=19+19)
GoParse-12                  3.79ms ± 2%    3.75ms ± 2%  -0.91%  (p=0.005 n=20+20)
RegexpMatchEasy0_32-12       102ns ± 1%     101ns ± 1%  -0.80%  (p=0.001 n=14+20)
RegexpMatchEasy0_1K-12       337ns ± 1%     346ns ± 1%  +2.90%  (p=0.000 n=20+19)
RegexpMatchEasy1_32-12      84.4ns ± 2%    84.3ns ± 2%    ~     (p=0.743 n=20+20)
RegexpMatchEasy1_1K-12       502ns ± 1%     505ns ± 0%  +0.64%  (p=0.000 n=20+20)
RegexpMatchMedium_32-12      133ns ± 1%     132ns ± 1%  -0.85%  (p=0.000 n=20+19)
RegexpMatchMedium_1K-12     40.1µs ± 1%    39.8µs ± 1%  -0.77%  (p=0.000 n=18+18)
RegexpMatchHard_32-12       2.08µs ± 1%    2.07µs ± 1%  -0.55%  (p=0.001 n=18+19)
RegexpMatchHard_1K-12       62.4µs ± 1%    62.0µs ± 1%  -0.74%  (p=0.000 n=19+19)
Revcomp-12                   545ms ± 2%     545ms ± 3%    ~     (p=0.771 n=19+20)
Template-12                 73.7ms ± 1%    72.0ms ± 0%  -2.33%  (p=0.000 n=20+18)
TimeParse-12                 358ns ± 1%     351ns ± 1%  -2.07%  (p=0.000 n=20+20)
TimeFormat-12                369ns ± 1%     356ns ± 0%  -3.53%  (p=0.000 n=20+18)
[Geo mean]                  63.5µs         63.2µs       -0.41%

name                      old speed      new speed      delta
GobDecode-12              88.3MB/s ± 0%  87.9MB/s ± 0%  -0.43%  (p=0.000 n=18+17)
GobEncode-12               116MB/s ± 1%   116MB/s ± 1%  +0.47%  (p=0.000 n=20+20)
Gzip-12                   60.2MB/s ± 2%  60.8MB/s ± 1%  +1.13%  (p=0.000 n=20+20)
Gunzip-12                  453MB/s ± 0%   453MB/s ± 0%    ~     (p=0.160 n=18+20)
JSONEncode-12              115MB/s ± 1%   112MB/s ± 1%  -2.76%  (p=0.000 n=19+20)
JSONDecode-12             32.5MB/s ± 0%  33.2MB/s ± 0%  +2.09%  (p=0.000 n=19+17)
GoParse-12                15.3MB/s ± 2%  15.4MB/s ± 2%  +0.92%  (p=0.004 n=20+20)
RegexpMatchEasy0_32-12     311MB/s ± 1%   314MB/s ± 1%  +0.78%  (p=0.000 n=15+19)
RegexpMatchEasy0_1K-12    3.04GB/s ± 1%  2.95GB/s ± 1%  -2.90%  (p=0.000 n=19+19)
RegexpMatchEasy1_32-12     379MB/s ± 2%   380MB/s ± 2%    ~     (p=0.779 n=20+20)
RegexpMatchEasy1_1K-12    2.04GB/s ± 1%  2.02GB/s ± 0%  -0.62%  (p=0.000 n=20+20)
RegexpMatchMedium_32-12   7.46MB/s ± 1%  7.53MB/s ± 1%  +0.86%  (p=0.000 n=20+19)
RegexpMatchMedium_1K-12   25.5MB/s ± 1%  25.7MB/s ± 1%  +0.78%  (p=0.000 n=18+18)
RegexpMatchHard_32-12     15.4MB/s ± 1%  15.5MB/s ± 1%  +0.62%  (p=0.000 n=19+19)
RegexpMatchHard_1K-12     16.4MB/s ± 1%  16.5MB/s ± 1%  +0.82%  (p=0.000 n=20+19)
Revcomp-12                 466MB/s ± 2%   466MB/s ± 3%    ~     (p=0.765 n=19+20)
Template-12               26.3MB/s ± 1%  27.0MB/s ± 0%  +2.38%  (p=0.000 n=20+18)
[Geo mean]                97.8MB/s       98.0MB/s       +0.23%

Change-Id: I281044ae0b24990ba46487cacbc1069493274bc4
Reviewed-on: https://go-review.googlesource.com/13614
Reviewed-by: Keith Randall <[email protected]>
@golang golang locked and limited conversation to collaborators Sep 22, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants