-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: >100ms mark termination times #10898
Comments
It appears we're entering mark termination when there's still quite a bit of work on the work queues. Curiously, the work isn't trapped in the per-P work cache (which was what I originally suspected) because there's only one P and hence only one background mark worker, which disposes its per-P cache when it ends the concurrent mark phase. Rather, it's on the global work list. entering mark termination: |
The pending work isn't the problem. (It's almost certainly all recent allocations and write barriers for publishing these recent allocations, which means the pending pointers have almost no depth. That is, there's no choke-point in the work queue, so while there is pending work, it will be scanned in almost no time during mark termination.) The problem is that this program has a G with a ~70MB stack, which takes >150ms to scan. We see this cost both during the concurrent scan phase and during the re-scan in mark termination (which is why the two durations closely align in each cycle). This probably also explains the 2x slowdown compared to 1.4: this program spends most of its time in GC, and each GC cycle is doing roughly twice as much as it did in 1.4 because of the stack re-scan. I'm going to try implementing stack barriers to address this. While this example is just a toy program, it's easy to imagine that there are real programs that have large stacks like this and right now we can't deliver on our low-latency promise when active goroutines have large stacks. Assuming only the top of the stack changes during the concurrent GC phases, this should drastically reduce the stack re-scan time during mark termination. |
I don't think we can assume that only top of stack changes (it might be For example, the following contrived program can make the stack grow package main func F(x **int, i int) { func main() { And -gcflags -m confirms that x in main is allocated on stack, so this command-line-arguments./stk.go:3: moved to heap: i |
If you have multi-megabyte stacks and you've gone to the trouble of threading a stack pointer through 50 megabytes worth of stack, then, well, there's only so much we can do to reduce the stack scan cost. Don't do that. But otherwise, stack barriers will dramatically reduce the cost of scanning large stacks. |
I guess my question is why do we need to treat stacks separately
and rescan them in the mark termination step?
Writes to slots in stack still have writer barriers, so why are they
any different from normal heap objects?
Surely I'm missing something, so thanks for the explanations in
advance.
|
Writes to stack slots do not have write barriers. That's why they need to be rescanned. |
Only writes to the current frame might be missing write barriers. Writes to the non-current frame still have write barriers (because the compiler can't assume incoming pointer arguments are to the stack). |
ahh, i see. thanks. I was puzzled by my example program where
write to *x has write barrier because the compiler doesn't know
whether it's stack slot or heap slot.
if the huge stack is not just a single frame, then i think we can be
sure that writes to bottom stack slots are using writer barriers, so
perhaps we do have a way to know if we need to scan the whole
stack or not during mark termination.
|
Yeah, I also think that we only need to scan the most
current frame in mark termination.
This should fix the problem nicely (unless the user code
has a huge frame on the top, but then we can't do anything
about it unless we add write barriers for stack writes).
|
You still need to re-scan all of the frames that were created since the concurrent scan. Any of those frames can still contain a pointer to a white heap object. The stack barriers I'm implementing will tell the GC which set of frames have been created since the first scan of the stack and hence need to be rescanned. The fact that writes to up-pointers generate write barriers may simplify things, though. I was thinking I could only install stack barriers at frames where there are no pointers to higher frames (which is easy to compute, but does restrict where stack barriers can be installed, especially in examples like minux's). But if there are write barriers for these (and we don't simply ignore those write barriers like we do now), then I could install stack barriers wherever I want and use the write barrier to change the watermark if there's a write to a higher frame via a pointer. |
Actually, the fact that we generate write barriers for writes to up-pointers is incredibly useful. I was wrong that we ignore these write barriers: the slot may be in the stack, but if it's a write barrier we care about, the pointer will point to the heap and we'll grey the object. This is exactly what we need to handle up-pointers. Hence, we can put stack barriers between any two frames on the stack without worrying about up-pointers. |
I think we are on the same page but just to make sure. A stack barrier
On Tue, May 19, 2015 at 11:39 AM, Austin Clements [email protected]
|
Yes, it sounds like we're on the same page. On Tue, May 19, 2015 at 12:02 PM, Richard L. Hudson <
... where all frames older than that location were scanned during
|
CL https://golang.org/cl/10314 mentions this issue. |
CL https://golang.org/cl/13614 mentions this issue. |
The cost of scanning large stacks is currently dominated by the time spent looking up and decoding the pcvalue table. However, large stacks are usually large not because they contain calls to many different functions, but because they contain many calls to the same, small set of recursive functions. Hence, walking large stacks tends to make the same pcvalue queries many times. Based on this observation, this commit adds a small, very simple, and fast cache in front of pcvalue lookup. We thread this cache down from operations that make many pcvalue calls, such as gentraceback, stack scanning, and stack adjusting. This simple cache works well because it has minimal overhead when it's not effective. I also tried a hashed direct-map cache, CLOCK-based replacement, round-robin replacement, and round-robin with lookups disabled until there had been at least 16 probes, but none of these approaches had obvious wins over the random replacement policy in this commit. This nearly doubles the overall performance of the deep stack test program from issue #10898: name old time/op new time/op delta Issue10898 16.5s ±12% 9.2s ±12% -44.37% (p=0.008 n=5+5) It's a very slight win on the garbage benchmark: name old time/op new time/op delta XBenchGarbage-12 4.92ms ± 1% 4.89ms ± 1% -0.75% (p=0.000 n=18+19) It's a wash (but doesn't harm performance) on the go1 benchmarks, which don't have particularly deep stacks: name old time/op new time/op delta BinaryTree17-12 3.11s ± 2% 3.20s ± 3% +2.83% (p=0.000 n=17+20) Fannkuch11-12 2.51s ± 1% 2.51s ± 1% -0.22% (p=0.034 n=19+18) FmtFprintfEmpty-12 50.8ns ± 3% 50.6ns ± 2% ~ (p=0.793 n=20+20) FmtFprintfString-12 174ns ± 0% 174ns ± 1% +0.17% (p=0.048 n=15+20) FmtFprintfInt-12 177ns ± 0% 165ns ± 1% -6.99% (p=0.000 n=17+19) FmtFprintfIntInt-12 283ns ± 1% 284ns ± 0% +0.22% (p=0.000 n=18+15) FmtFprintfPrefixedInt-12 243ns ± 1% 244ns ± 1% +0.40% (p=0.000 n=20+19) FmtFprintfFloat-12 318ns ± 0% 319ns ± 0% +0.27% (p=0.001 n=19+20) FmtManyArgs-12 1.12µs ± 0% 1.14µs ± 0% +1.74% (p=0.000 n=19+20) GobDecode-12 8.69ms ± 0% 8.73ms ± 1% +0.46% (p=0.000 n=18+18) GobEncode-12 6.64ms ± 1% 6.61ms ± 1% -0.46% (p=0.000 n=20+20) Gzip-12 323ms ± 2% 319ms ± 1% -1.11% (p=0.000 n=20+20) Gunzip-12 42.8ms ± 0% 42.9ms ± 0% ~ (p=0.158 n=18+20) HTTPClientServer-12 63.3µs ± 1% 63.1µs ± 1% -0.35% (p=0.011 n=20+20) JSONEncode-12 16.9ms ± 1% 17.3ms ± 1% +2.84% (p=0.000 n=19+20) JSONDecode-12 59.7ms ± 0% 58.5ms ± 0% -2.05% (p=0.000 n=19+17) Mandelbrot200-12 3.92ms ± 0% 3.91ms ± 0% -0.16% (p=0.003 n=19+19) GoParse-12 3.79ms ± 2% 3.75ms ± 2% -0.91% (p=0.005 n=20+20) RegexpMatchEasy0_32-12 102ns ± 1% 101ns ± 1% -0.80% (p=0.001 n=14+20) RegexpMatchEasy0_1K-12 337ns ± 1% 346ns ± 1% +2.90% (p=0.000 n=20+19) RegexpMatchEasy1_32-12 84.4ns ± 2% 84.3ns ± 2% ~ (p=0.743 n=20+20) RegexpMatchEasy1_1K-12 502ns ± 1% 505ns ± 0% +0.64% (p=0.000 n=20+20) RegexpMatchMedium_32-12 133ns ± 1% 132ns ± 1% -0.85% (p=0.000 n=20+19) RegexpMatchMedium_1K-12 40.1µs ± 1% 39.8µs ± 1% -0.77% (p=0.000 n=18+18) RegexpMatchHard_32-12 2.08µs ± 1% 2.07µs ± 1% -0.55% (p=0.001 n=18+19) RegexpMatchHard_1K-12 62.4µs ± 1% 62.0µs ± 1% -0.74% (p=0.000 n=19+19) Revcomp-12 545ms ± 2% 545ms ± 3% ~ (p=0.771 n=19+20) Template-12 73.7ms ± 1% 72.0ms ± 0% -2.33% (p=0.000 n=20+18) TimeParse-12 358ns ± 1% 351ns ± 1% -2.07% (p=0.000 n=20+20) TimeFormat-12 369ns ± 1% 356ns ± 0% -3.53% (p=0.000 n=20+18) [Geo mean] 63.5µs 63.2µs -0.41% name old speed new speed delta GobDecode-12 88.3MB/s ± 0% 87.9MB/s ± 0% -0.43% (p=0.000 n=18+17) GobEncode-12 116MB/s ± 1% 116MB/s ± 1% +0.47% (p=0.000 n=20+20) Gzip-12 60.2MB/s ± 2% 60.8MB/s ± 1% +1.13% (p=0.000 n=20+20) Gunzip-12 453MB/s ± 0% 453MB/s ± 0% ~ (p=0.160 n=18+20) JSONEncode-12 115MB/s ± 1% 112MB/s ± 1% -2.76% (p=0.000 n=19+20) JSONDecode-12 32.5MB/s ± 0% 33.2MB/s ± 0% +2.09% (p=0.000 n=19+17) GoParse-12 15.3MB/s ± 2% 15.4MB/s ± 2% +0.92% (p=0.004 n=20+20) RegexpMatchEasy0_32-12 311MB/s ± 1% 314MB/s ± 1% +0.78% (p=0.000 n=15+19) RegexpMatchEasy0_1K-12 3.04GB/s ± 1% 2.95GB/s ± 1% -2.90% (p=0.000 n=19+19) RegexpMatchEasy1_32-12 379MB/s ± 2% 380MB/s ± 2% ~ (p=0.779 n=20+20) RegexpMatchEasy1_1K-12 2.04GB/s ± 1% 2.02GB/s ± 0% -0.62% (p=0.000 n=20+20) RegexpMatchMedium_32-12 7.46MB/s ± 1% 7.53MB/s ± 1% +0.86% (p=0.000 n=20+19) RegexpMatchMedium_1K-12 25.5MB/s ± 1% 25.7MB/s ± 1% +0.78% (p=0.000 n=18+18) RegexpMatchHard_32-12 15.4MB/s ± 1% 15.5MB/s ± 1% +0.62% (p=0.000 n=19+19) RegexpMatchHard_1K-12 16.4MB/s ± 1% 16.5MB/s ± 1% +0.82% (p=0.000 n=20+19) Revcomp-12 466MB/s ± 2% 466MB/s ± 3% ~ (p=0.765 n=19+20) Template-12 26.3MB/s ± 1% 27.0MB/s ± 0% +2.38% (p=0.000 n=20+18) [Geo mean] 97.8MB/s 98.0MB/s +0.23% Change-Id: I281044ae0b24990ba46487cacbc1069493274bc4 Reviewed-on: https://go-review.googlesource.com/13614 Reviewed-by: Keith Randall <[email protected]>
The program posted in https://groups.google.com/d/msg/golang-dev/_qY3IugLlAc/5ZIFjYHTHUwJ experiences very long mark termination times:
$ GODEBUG=gctrace=1 goscheme test.scm
gc #1 @0.240s 37%: 0+37+18+9+43 ms clock, 0+37+0+0/9/0+43 ms cpu, 4->5->2 MB, 4 MB goal, 1 P
gc #2 @0.514s 48%: 0+69+88+14+73 ms clock, 0+69+0+0/14/0+73 ms cpu, 4->5->3 MB, 4 MB goal, 1 P
gc #3 @0.756s 59%: 0+88+12+19+97 ms clock, 0+88+0+0/19/0+97 ms cpu, 4->6->4 MB, 5 MB goal, 1 P
gc #4 @1.107s 66%: 0+111+14+60+144 ms clock, 0+111+0+9/20/0+144 ms cpu, 5->8->6 MB, 7 MB goal, 1 P
gc #5 @1.664s 65%: 0+148+13+60+160 ms clock, 0+148+0+14/20/0+160 ms cpu, 6->10->7 MB, 8 MB goal, 1 P
gc #6 @2.166s 70%: 0+188+12+60+205 ms clock, 0+188+0+23/20/0+205 ms cpu, 7->12->9 MB, 8 MB goal, 1 P
gc #7 @2.812s 73%: 0+239+14+80+258 ms clock, 0+239+0+35/20/0+258 ms cpu, 14->16->12 MB, 15 MB goal, 1 P
gc #8 @3.934s 70%: 0+316+11+82+330 ms clock, 0+316+0+47/21/0+330 ms cpu, 19->20->14 MB, 19 MB goal, 1 P
It is also ~2x slower than it was on Go 1.4, which may be related.
@RLH
The text was updated successfully, but these errors were encountered: