runtime: >100ms mark termination times #10898

aclements · 2015-05-18T17:35:50Z

The program posted in https://groups.google.com/d/msg/golang-dev/_qY3IugLlAc/5ZIFjYHTHUwJ experiences very long mark termination times:

$ GODEBUG=gctrace=1 goscheme test.scm
gc #1 @0.240s 37%: 0+37+18+9+43 ms clock, 0+37+0+0/9/0+43 ms cpu, 4->5->2 MB, 4 MB goal, 1 P
gc #2 @0.514s 48%: 0+69+88+14+73 ms clock, 0+69+0+0/14/0+73 ms cpu, 4->5->3 MB, 4 MB goal, 1 P
gc #3 @0.756s 59%: 0+88+12+19+97 ms clock, 0+88+0+0/19/0+97 ms cpu, 4->6->4 MB, 5 MB goal, 1 P
gc #4 @1.107s 66%: 0+111+14+60+144 ms clock, 0+111+0+9/20/0+144 ms cpu, 5->8->6 MB, 7 MB goal, 1 P
gc #5 @1.664s 65%: 0+148+13+60+160 ms clock, 0+148+0+14/20/0+160 ms cpu, 6->10->7 MB, 8 MB goal, 1 P
gc #6 @2.166s 70%: 0+188+12+60+205 ms clock, 0+188+0+23/20/0+205 ms cpu, 7->12->9 MB, 8 MB goal, 1 P
gc #7 @2.812s 73%: 0+239+14+80+258 ms clock, 0+239+0+35/20/0+258 ms cpu, 14->16->12 MB, 15 MB goal, 1 P
gc #8 @3.934s 70%: 0+316+11+82+330 ms clock, 0+316+0+47/21/0+330 ms cpu, 19->20->14 MB, 19 MB goal, 1 P

It is also ~2x slower than it was on Go 1.4, which may be related.

@RLH

aclements · 2015-05-18T18:33:21Z

It appears we're entering mark termination when there's still quite a bit of work on the work queues. Curiously, the work isn't trapped in the per-P work cache (which was what I originally suspected) because there's only one P and hence only one background mark worker, which disposes its per-P cache when it ends the concurrent mark phase. Rather, it's on the global work list.

entering mark termination:
P 0 cache => nil wbuf
full wbuf list: 24+24+24+24+24+24+24+24+24+24+24+24=288
partial wbuf list: 10=10
gc #14 @11.172s 69%: 0+578+13+120+595 ms clock, 0+578+0+74/37/0+595 ms cpu, 16->87->37 MB, 16 MB goal, 1 P

aclements · 2015-05-18T19:29:52Z

The pending work isn't the problem. (It's almost certainly all recent allocations and write barriers for publishing these recent allocations, which means the pending pointers have almost no depth. That is, there's no choke-point in the work queue, so while there is pending work, it will be scanned in almost no time during mark termination.)

The problem is that this program has a G with a ~70MB stack, which takes >150ms to scan. We see this cost both during the concurrent scan phase and during the re-scan in mark termination (which is why the two durations closely align in each cycle). This probably also explains the 2x slowdown compared to 1.4: this program spends most of its time in GC, and each GC cycle is doing roughly twice as much as it did in 1.4 because of the stack re-scan.

I'm going to try implementing stack barriers to address this. While this example is just a toy program, it's easy to imagine that there are real programs that have large stacks like this and right now we can't deliver on our low-latency promise when active goroutines have large stacks. Assuming only the top of the stack changes during the concurrent GC phases, this should drastically reduce the stack re-scan time during mark termination.

minux · 2015-05-18T21:06:28Z

I don't think we can assume that only top of stack changes (it might be
true today due to deficiencies in escape analysis).

For example, the following contrived program can make the stack grow
arbitrarily large, yet the topmost frame can write to slot in the bottom
most frame.

package main

func F(x **int, i int) {
if i == 0 {
*x = &i
return
}
F(x, i-1)
}

func main() {
var x *int
F(&x, 10000)
}

And -gcflags -m confirms that x in main is allocated on stack, so this
problem is not theoretical.
$ go build -gcflags -m stk.go

command-line-arguments

./stk.go:3: moved to heap: i
./stk.go:5: &i escapes to heap
./stk.go:3: F x does not escape
./stk.go:13: main &x does not escape

aclements · 2015-05-18T21:36:09Z

If you have multi-megabyte stacks and you've gone to the trouble of threading a stack pointer through 50 megabytes worth of stack, then, well, there's only so much we can do to reduce the stack scan cost. Don't do that. But otherwise, stack barriers will dramatically reduce the cost of scanning large stacks.

minux · 2015-05-18T22:23:07Z

I guess my question is why do we need to treat stacks separately and rescan them in the mark termination step? Writes to slots in stack still have writer barriers, so why are they any different from normal heap objects? Surely I'm missing something, so thanks for the explanations in advance.

aclements · 2015-05-18T22:24:49Z

Writes to stack slots do not have write barriers. That's why they need to be rescanned.

randall77 · 2015-05-18T22:35:29Z

Only writes to the current frame might be missing write barriers. Writes to the non-current frame still have write barriers (because the compiler can't assume incoming pointer arguments are to the stack).
I'm not sure that helps any, but worth considering.

minux · 2015-05-18T22:35:32Z

ahh, i see. thanks. I was puzzled by my example program where write to *x has write barrier because the compiler doesn't know whether it's stack slot or heap slot. if the huge stack is not just a single frame, then i think we can be sure that writes to bottom stack slots are using writer barriers, so perhaps we do have a way to know if we need to scan the whole stack or not during mark termination.

minux · 2015-05-18T22:40:16Z

Yeah, I also think that we only need to scan the most current frame in mark termination. This should fix the problem nicely (unless the user code has a huge frame on the top, but then we can't do anything about it unless we add write barriers for stack writes).

aclements · 2015-05-18T22:49:44Z

You still need to re-scan all of the frames that were created since the concurrent scan. Any of those frames can still contain a pointer to a white heap object. The stack barriers I'm implementing will tell the GC which set of frames have been created since the first scan of the stack and hence need to be rescanned.

The fact that writes to up-pointers generate write barriers may simplify things, though. I was thinking I could only install stack barriers at frames where there are no pointers to higher frames (which is easy to compute, but does restrict where stack barriers can be installed, especially in examples like minux's). But if there are write barriers for these (and we don't simply ignore those write barriers like we do now), then I could install stack barriers wherever I want and use the write barrier to change the watermark if there's a write to a higher frame via a pointer.

aclements · 2015-05-19T15:39:46Z

Actually, the fact that we generate write barriers for writes to up-pointers is incredibly useful. I was wrong that we ignore these write barriers: the slot may be in the stack, but if it's a write barrier we care about, the pointer will point to the heap and we'll grey the object. This is exactly what we need to handle up-pointers. Hence, we can put stack barriers between any two frames on the stack without worrying about up-pointers.

RLH · 2015-05-19T16:02:19Z

I think we are on the same page but just to make sure. A stack barrier
notes a location on the stack where all frames older
than the current frame has been scanned.

Mark Termination needs to scan any stack frames created since the last
time the stack was scanned.
The write barrier is responsible for greying any objects written into a
slot older than the stack frame barrier.
Greying objects written into slots younger than the stack barrier is not
needed for correctness but may help performance/scalability.
Just checking....

On Tue, May 19, 2015 at 11:39 AM, Austin Clements [email protected]
wrote:

Actually, the fact that we generate write barriers for writes to
up-pointers is incredibly useful. I was wrong that we ignore these write
barriers: the slot may be in the stack, but if it's a write barrier we care
about, the pointer will point to the heap and we'll grey the object. This
is exactly what we need to handle up-pointers. Hence, we can put stack
barriers between any two frames on the stack without worrying about
up-pointers.

—
Reply to this email directly or view it on GitHub
#10898 (comment).

aclements · 2015-05-19T16:08:09Z

Yes, it sounds like we're on the same page.

On Tue, May 19, 2015 at 12:02 PM, Richard L. Hudson <
[email protected]> wrote:

I think we are on the same page but just to make sure. A stack barrier
notes a location on the stack where all frames older
than the current frame has been scanned.

... where all frames older than that location were scanned during
concurrent scan. (Just being precise.)

Mark Termination needs to scan any stack frames created since the last

time the stack was scanned.
2) The write barrier is responsible for greying any objects written into a
slot older than the stack frame barrier.
3) Greying objects written into slots younger than the stack barrier is not
needed for correctness but may help performance/scalability.
Just checking....

gopherbot · 2015-05-21T23:00:29Z

CL https://golang.org/cl/10314 mentions this issue.

gopherbot · 2015-09-14T02:01:32Z

CL https://golang.org/cl/13614 mentions this issue.

The cost of scanning large stacks is currently dominated by the time spent looking up and decoding the pcvalue table. However, large stacks are usually large not because they contain calls to many different functions, but because they contain many calls to the same, small set of recursive functions. Hence, walking large stacks tends to make the same pcvalue queries many times. Based on this observation, this commit adds a small, very simple, and fast cache in front of pcvalue lookup. We thread this cache down from operations that make many pcvalue calls, such as gentraceback, stack scanning, and stack adjusting. This simple cache works well because it has minimal overhead when it's not effective. I also tried a hashed direct-map cache, CLOCK-based replacement, round-robin replacement, and round-robin with lookups disabled until there had been at least 16 probes, but none of these approaches had obvious wins over the random replacement policy in this commit. This nearly doubles the overall performance of the deep stack test program from issue #10898: name old time/op new time/op delta Issue10898 16.5s ±12% 9.2s ±12% -44.37% (p=0.008 n=5+5) It's a very slight win on the garbage benchmark: name old time/op new time/op delta XBenchGarbage-12 4.92ms ± 1% 4.89ms ± 1% -0.75% (p=0.000 n=18+19) It's a wash (but doesn't harm performance) on the go1 benchmarks, which don't have particularly deep stacks: name old time/op new time/op delta BinaryTree17-12 3.11s ± 2% 3.20s ± 3% +2.83% (p=0.000 n=17+20) Fannkuch11-12 2.51s ± 1% 2.51s ± 1% -0.22% (p=0.034 n=19+18) FmtFprintfEmpty-12 50.8ns ± 3% 50.6ns ± 2% ~ (p=0.793 n=20+20) FmtFprintfString-12 174ns ± 0% 174ns ± 1% +0.17% (p=0.048 n=15+20) FmtFprintfInt-12 177ns ± 0% 165ns ± 1% -6.99% (p=0.000 n=17+19) FmtFprintfIntInt-12 283ns ± 1% 284ns ± 0% +0.22% (p=0.000 n=18+15) FmtFprintfPrefixedInt-12 243ns ± 1% 244ns ± 1% +0.40% (p=0.000 n=20+19) FmtFprintfFloat-12 318ns ± 0% 319ns ± 0% +0.27% (p=0.001 n=19+20) FmtManyArgs-12 1.12µs ± 0% 1.14µs ± 0% +1.74% (p=0.000 n=19+20) GobDecode-12 8.69ms ± 0% 8.73ms ± 1% +0.46% (p=0.000 n=18+18) GobEncode-12 6.64ms ± 1% 6.61ms ± 1% -0.46% (p=0.000 n=20+20) Gzip-12 323ms ± 2% 319ms ± 1% -1.11% (p=0.000 n=20+20) Gunzip-12 42.8ms ± 0% 42.9ms ± 0% ~ (p=0.158 n=18+20) HTTPClientServer-12 63.3µs ± 1% 63.1µs ± 1% -0.35% (p=0.011 n=20+20) JSONEncode-12 16.9ms ± 1% 17.3ms ± 1% +2.84% (p=0.000 n=19+20) JSONDecode-12 59.7ms ± 0% 58.5ms ± 0% -2.05% (p=0.000 n=19+17) Mandelbrot200-12 3.92ms ± 0% 3.91ms ± 0% -0.16% (p=0.003 n=19+19) GoParse-12 3.79ms ± 2% 3.75ms ± 2% -0.91% (p=0.005 n=20+20) RegexpMatchEasy0_32-12 102ns ± 1% 101ns ± 1% -0.80% (p=0.001 n=14+20) RegexpMatchEasy0_1K-12 337ns ± 1% 346ns ± 1% +2.90% (p=0.000 n=20+19) RegexpMatchEasy1_32-12 84.4ns ± 2% 84.3ns ± 2% ~ (p=0.743 n=20+20) RegexpMatchEasy1_1K-12 502ns ± 1% 505ns ± 0% +0.64% (p=0.000 n=20+20) RegexpMatchMedium_32-12 133ns ± 1% 132ns ± 1% -0.85% (p=0.000 n=20+19) RegexpMatchMedium_1K-12 40.1µs ± 1% 39.8µs ± 1% -0.77% (p=0.000 n=18+18) RegexpMatchHard_32-12 2.08µs ± 1% 2.07µs ± 1% -0.55% (p=0.001 n=18+19) RegexpMatchHard_1K-12 62.4µs ± 1% 62.0µs ± 1% -0.74% (p=0.000 n=19+19) Revcomp-12 545ms ± 2% 545ms ± 3% ~ (p=0.771 n=19+20) Template-12 73.7ms ± 1% 72.0ms ± 0% -2.33% (p=0.000 n=20+18) TimeParse-12 358ns ± 1% 351ns ± 1% -2.07% (p=0.000 n=20+20) TimeFormat-12 369ns ± 1% 356ns ± 0% -3.53% (p=0.000 n=20+18) [Geo mean] 63.5µs 63.2µs -0.41% name old speed new speed delta GobDecode-12 88.3MB/s ± 0% 87.9MB/s ± 0% -0.43% (p=0.000 n=18+17) GobEncode-12 116MB/s ± 1% 116MB/s ± 1% +0.47% (p=0.000 n=20+20) Gzip-12 60.2MB/s ± 2% 60.8MB/s ± 1% +1.13% (p=0.000 n=20+20) Gunzip-12 453MB/s ± 0% 453MB/s ± 0% ~ (p=0.160 n=18+20) JSONEncode-12 115MB/s ± 1% 112MB/s ± 1% -2.76% (p=0.000 n=19+20) JSONDecode-12 32.5MB/s ± 0% 33.2MB/s ± 0% +2.09% (p=0.000 n=19+17) GoParse-12 15.3MB/s ± 2% 15.4MB/s ± 2% +0.92% (p=0.004 n=20+20) RegexpMatchEasy0_32-12 311MB/s ± 1% 314MB/s ± 1% +0.78% (p=0.000 n=15+19) RegexpMatchEasy0_1K-12 3.04GB/s ± 1% 2.95GB/s ± 1% -2.90% (p=0.000 n=19+19) RegexpMatchEasy1_32-12 379MB/s ± 2% 380MB/s ± 2% ~ (p=0.779 n=20+20) RegexpMatchEasy1_1K-12 2.04GB/s ± 1% 2.02GB/s ± 0% -0.62% (p=0.000 n=20+20) RegexpMatchMedium_32-12 7.46MB/s ± 1% 7.53MB/s ± 1% +0.86% (p=0.000 n=20+19) RegexpMatchMedium_1K-12 25.5MB/s ± 1% 25.7MB/s ± 1% +0.78% (p=0.000 n=18+18) RegexpMatchHard_32-12 15.4MB/s ± 1% 15.5MB/s ± 1% +0.62% (p=0.000 n=19+19) RegexpMatchHard_1K-12 16.4MB/s ± 1% 16.5MB/s ± 1% +0.82% (p=0.000 n=20+19) Revcomp-12 466MB/s ± 2% 466MB/s ± 3% ~ (p=0.765 n=19+20) Template-12 26.3MB/s ± 1% 27.0MB/s ± 0% +2.38% (p=0.000 n=20+18) [Geo mean] 97.8MB/s 98.0MB/s +0.23% Change-Id: I281044ae0b24990ba46487cacbc1069493274bc4 Reviewed-on: https://go-review.googlesource.com/13614 Reviewed-by: Keith Randall <[email protected]>

aclements added this to the Go1.5 milestone May 18, 2015

aclements self-assigned this May 18, 2015

aclements closed this as completed in faa7a7e Jun 2, 2015

golang locked and limited conversation to collaborators Sep 22, 2016

gopherbot added the FrozenDueToAge label Sep 22, 2016

rsc unassigned aclements Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: >100ms mark termination times #10898

runtime: >100ms mark termination times #10898

aclements commented May 18, 2015

aclements commented May 18, 2015

aclements commented May 18, 2015

minux commented May 18, 2015

aclements commented May 18, 2015

minux commented May 18, 2015 via email

aclements commented May 18, 2015

randall77 commented May 18, 2015

minux commented May 18, 2015 via email

minux commented May 18, 2015 via email

aclements commented May 18, 2015

aclements commented May 19, 2015

RLH commented May 19, 2015

aclements commented May 19, 2015

gopherbot commented May 21, 2015

gopherbot commented Sep 14, 2015

runtime: >100ms mark termination times #10898

runtime: >100ms mark termination times #10898

Comments

aclements commented May 18, 2015

aclements commented May 18, 2015

aclements commented May 18, 2015

minux commented May 18, 2015

command-line-arguments

aclements commented May 18, 2015

minux commented May 18, 2015 via email

aclements commented May 18, 2015

randall77 commented May 18, 2015

minux commented May 18, 2015 via email

minux commented May 18, 2015 via email

aclements commented May 18, 2015

aclements commented May 19, 2015

RLH commented May 19, 2015

aclements commented May 19, 2015

gopherbot commented May 21, 2015

gopherbot commented Sep 14, 2015