-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: multi-ms sweep termination pauses #17831
Comments
Below are the distributions of GC STW pauses for the program (in milliseconds), via https://godoc.org/github.com/aclements/go-moremath/cmd/dist Sweep termination:
Mark termination:
|
Do we know what the running goroutines are doing? Could they be in a tight On Mon, Nov 7, 2016 at 11:06 AM, Rhys Hiltner [email protected]
|
Hmm, yes. It looks like every time the beginning of a GC cycle is delayed, one of the running goroutines appears to be in a single expensive call to encoding/base64.StdEncoding.DecodeString; the "End Stack Trace" indicates a line right after that kind of call. The base64 messages that are decoded are around 1MB, which could keep a goroutine busy for several milliseconds. |
Can that code be rewritten so that it makes a preemptable call once every On Mon, Nov 7, 2016 at 11:58 AM, Rhys Hiltner [email protected]
|
Are you suggesting a change to the encoding/base64 package, or to all applications that use it for large slices? My application uses encoding/base64 indirectly via a third-party package, so making a localized change isn't a straightforward solution. Fixing #10958 would be pretty helpful here. The program I've encountered this in is very throughput sensitive and somewhat latency sensitive, which goes against the assumptions used to de-prioritize that issue. Gradually disabling each P in preparation for a GC pause means that my 36-core server is effectively a 1- or 2-core server for several milliseconds each cycle. |
The appropriate place would seem to be the encoding/base64 package. On Mon, Nov 7, 2016 at 12:32 PM, Rhys Hiltner [email protected]
|
I strongly disagree. We are not going to litter the standard library and every package containing a backwards branch with calls to help the GC or scheduler. Maybe @rhysh can hack up a local fork of the standard library as a short-term workaround, but fixing #10958 has to be the right long term-term answer. |
The main reason that issue has been de-prioritized is not because of a concern over throughput (though that does complicate the implementation), but just because we haven't seen it arise in "real" code. So, this is actually an excellent example to motivate re-prioritizing it. Unfortunately, fixing this is rather complicated (and it's clearly not a regression), so I think it's too late in Go 1.8 to put a real fix in the compiler. @rhysh, to confirm that we really do understand the issue (and as a possible temporary fix), could you modify encoding/base64.Encoding.decode to force a preemption check at the top of the outer loop? It's slightly annoying to do this because our compiler is getting more clever, but calling the following function seems to do it: var alwaysFalse = false
//go:noinline
func preempt(x int) {
if alwaysFalse {
println()
}
} |
Whether the compiler puts it in or it is put in by hand the appropriate On Mon, Nov 7, 2016 at 2:12 PM, Brad Fitzpatrick [email protected]
|
I think a simpler way to add a preemption point would be to call |
I agree that that's simpler, but |
Sorry, my unstated assumption was that you would not do anything on every single loop iteration. Just every 1024 iterations or something. |
Oh, I see. That would be fine as a temporary solution, of course. Though in terms of an experiment with adding implicit preemption points in this loop (and in general), I'd rather try something closer to what the compiler would presumably do. |
Even if and when the compiler does this, I think the compiler should not add a check to every loop iteration. Any non-preemptible loop is by definition quickly executing. Better to keep a counter in a register for when to check for preemption. |
Thank you all for the pointers on this—it looks like we're on the right track. @aclements I added a preemption point to encoding/base64 as you suggested, and that stack no longer shows up as the ending stack of goroutines that continue running while the runtime is trying to stop the world. I've added three additional preemption points to encoding/json and a handful to github.com/golang/protobuf/proto where it looked like they were required based on the other end stack traces that showed up. And still, the 99th percentile STW times are close to 1ms since I haven't successfully tracked them all down. Below are updated STW time distributions for the program. Sweep termination:
Mark termination:
|
@ianlancetaylor, we keep assuming that we need a sophisticated solution to keep the costs low, but as far as I know, we haven't actually tried the simple solution and proven it insufficient. For example, it's not clear to me that checking and maintaining a counter in a loop is necessarily cheaper than the two instructions it would take to check the stack guard (at least on x86). I'd like us to try this experiment. @rhysh, thanks for trying that out. It looks like that significantly improved the tails, at least. Do you still see the regular gaps in the execution trace with just one P running during sweep termination and mark termination? |
We will be looking to see if we can find a solution that is save and has Rhys, you have inspected the routines that need to have backward branches On Mon, Nov 7, 2016 at 10:37 PM, Austin Clements [email protected]
|
@aclements I'm happy to test a simple solution if you've got one, even if it has up to ~50% overhead. We'd definitely want something better for the release, but I could at least help validate an approach. I'm no longer sure how to interpret the execution tracer output. Below is a screenshot of the execution tracer for the program after I added preemption points to several deserialization routines. It looks like Proc 31 is the last to stop running before the collection cycle is able to begin .. but Proc 35 looks like it never stops at all. How was the collection cycle able to run without stopping G92 on Proc 35? Also, the GC row claims that the cycle started around 545.4ms, but there are goroutines on Procs 15, 20, 22, and 26 that appear to be making syscalls and then continuing their work even after the "GC" row at the top indicates there's a collection going on. Procs 1, 3, 7, 10, 14, 16, 19, 24, 25, 28, 30, and 32 appear to schedule some goroutines for execution after 545.4ms. Comparing the "GC" and "Threads" rows, what is the runtime up to between the start of the "GC" row's "GC" span at 545.4ms and the beginning of the decline of "Threads" row's count of active threads at 545.7ms? @aclements does the runtime stop the world one P/M at a time in some order, or is each P/M able to learn independently that the world should be stopped? The execution traces make it look like there's some order so that if one goroutine is running a non-preemptable loop, any Ps "behind" it in line will continue scheduling goroutines as usual. @RLH I haven't been looking at the generated instructions; I've been looking at GC cycles with the execution tracer, finding goroutines that appear to have continued running when it looks like the scheduler is trying to stop the world, reading the end stack trace they list, and reading through the Go code to try to find the last non-preemptable loop that would have run right before it. Seeing samples of running stack traces (#16895) would be handy here. The loops I've found have been around 20–60 lines of Go code, so they'd probably count as "larger loops". |
@rhysh Here's a likely simple solution; there are loops that will evade its notice if they're cleverly nested or contain calls to nosplit functions, but otherwise, it does the job. |
Thanks @dr2chase — I tried CL 33093 PS 2 and it didn't have much of an impact on median STW times, likely within the noise. Looking at the disassembly of encoding/base64.(*Encoding).decode, it looks like CL 33093 works as the commit message describes, adding calls to Why are the inserted calls limited to innermost loops? Is it for performance, code size, or correctness? The three lines where calls to go/src/encoding/base64/base64.go Line 260 in 3df059e
go/src/encoding/base64/base64.go Line 281 in 3df059e
go/src/encoding/base64/base64.go Line 304 in 3df059e
I'll see what I can do for a small reproducer, though it'll necessarily only cover functions I know are related to the problem. |
First-implementation sanity. I will take a shot at getting the outer loops right, too. I could make a version that just stomped on all the backedges. |
I made a version (PS 3) that stomps on all the backedges, will be benchmarking it for overhead, wanted to see if it is good enough for your purposes. Only exceptions are (1) inside nosplit functions and (2) irreducible loops, which cannot be written w/o goto. |
Benchmarking results for all the backedges are more noticeable. |
@dr2chase CL 33093 PS 3 has a very good effect on sweep termination pauses for my server, reducing their median duration by a factor of 20, from 4ms to 0.17ms. This matches the gains I was able to get from manually inserting these checks into encoding/base64, encoding/json, and github.com/golang/protobuf/proto (though other programs would have a different set of packages in need of such modification), as I described in #17831 (comment). The median duration of the STW pauses is still longer than the 100µs goal. I really like CL 33093 PS 3 rather than the manual modifications—it'll allow work to continue on diagnosing any other causes of long pauses without needing to worry if there are tight loops that weren't hand-modified. The overhead hasn't been a problem for my server. Thank you! Here's sweep termination time for my server: With 3df059e:
With CL 33093 PS 3, checked out (not cherry-picked):
Here's mark termination time for my server: With 3df059e:
With CL 33093 PS 3, checked out (not cherry-picked):
|
Do you think you're going to want this for more than just a diagnostic tool? |
@dr2chase are you asking me, or asking @aclements / @RLH ? If the change can ensure that all loops are preemptible within some roughly-bounded time without adding too much overhead, I think it would make a good solution to #10958. I'm not in a position to judge the cost/benefit of that work. The change provides a good reduction in one source of long STW pauses. It looks like there are still other factors that can lead to STW pauses of >100µs; I have some other programs that this change didn't help much, which I have not yet diagnosed. I've updated to PS 4, which continues to work as expected. Could it be included in the releases guarded by the GOEXPERIMENT value, as framepointers were for Go 1.5 and 1.6? |
The intent is for this to land in go1.8 under GOEXPERIMENT, hopefully to make it into the first beta. I'm looking into reducing the overhead to make it less performance-risky for people to use/try, because there's a worry that maybe you're not the only person who has this problem -- it just happens that you're the person who's looking carefully. |
Since we've now confirmed that this is because of a lack of preemption points in loops, I'm going to close this as a dup of #10958 so the general discussion stays in one place. It's fine to keep discussing things specific to this application here.
The cost is definitely too high to turn on by default in its current form, but the current prototype is really an upper bound on the cost. There are many optimizations we can apply but haven't yet, so I suspect there's a lot of headroom. We'll probably just go down the list of optimizations until the cost is under 1% and then we'll stop. :) But some of the optimizations are fairly intrusive, so they wouldn't happen until the 1.9 time frame.
How much in excess of 100µs? I only tested up to 12 cores, so it's possible there are things remaining in mark termination that aren't perfectly scalable with the number of cores. I have a modified version that logs precise time stamps of every step of mark termination to stderr if you'd be interested in trying that with one of these applications. If it's not much more than 100µs, though, I'm not too worried. Replying to some of your earlier questions:
Whichever goroutine triggered the GC cycle is responsible for the process of stopping the world, so it keeps running in some sense, though it's all in the runtime.
It's hard to say for sure, but there are a few things that happen between emitting the trace event for GC start and actually stopping the world, so I'm not surprised that there are .3ms there, especially if there are a lot of Ps and goroutines.
The order is technically non-deterministic, but can appear to be sequential. Stopping the world sets a global flag that any P can observe when it passes through the scheduler, so any of them could stop at that point. It then attempts to forcibly preempt the running goroutine on each P in order so that it passes through the scheduler. This is a non-blocking operation (it just sets a flag on the g), but if goroutines are responding to preemption quickly, you'll see them descheduling roughly in P order. I don't think an unpreemptible P will do anything to prevent us from preempting later Ps. |
CL https://golang.org/cl/33093 mentions this issue. |
Gating via GOEXPERIMENT for Go 1.8 sounds great.
I have a program with a 95th percentile sweep termination time of 60ms, and 95th percentile mark termination time of around 30ms. This is measured with go version
It seems like pauses of more than 1ms are worth hunting down—are pauses of a few hundred microseconds worth reporting as well? I suppose anything more than 100µs is worth reporting even if addressing it isn't a priority for a release or two.
Ah! This makes a lot of sense, thank you. It means that the goroutine that looks like it keeps running is in fact the one that stopped (running user code) the earliest.
I see, the GC execution trace event is emitted early but sweep termination doesn't begin until later. When I look at the beginning of the decline in active Ps in the Threads row, the execution trace makes a lot more sense—although goroutines continue to be scheduled onto Ps after the GC row shows a collection is in progress, they are not scheduled once the Threads row begins to decline. Thank you for explaining! |
I filed #17969 for the other program that sometimes sees hundred-millisecond pauses. |
CL https://golang.org/cl/33910 mentions this issue. |
Loop breaking with a counter. Benchmarked (see comments), eyeball checked for sanity on popular loops. This code ought to handle loops in general, and properly inserts phi functions in cases where the earlier version might not have. Includes test, plus modifications to test/run.go to deal with timeout and killing looping test. Tests broken by the addition of extra code (branch frequency and live vars) for added checks turn the check insertion off. If GOEXPERIMENT=preemptibleloops, the compiler inserts reschedule checks on every backedge of every reducible loop. Alternately, specifying GO_GCFLAGS=-d=ssa/insert_resched_checks/on will enable it for a single compilation, but because the core Go libraries contain some loops that may run long, this is less likely to have the desired effect. This is intended as a tool to help in the study and diagnosis of GC and other latency problems, now that goal STW GC latency is on the order of 100 microseconds or less. Updates #17831. Updates #10958. Change-Id: I6206c163a5b0248e3f21eb4fc65f73a179e1f639 Reviewed-on: https://go-review.googlesource.com/33910 Run-TryBot: David Chase <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Keith Randall <[email protected]>
What version of Go are you using (
go version
)?go version
devel +3df059e Mon Nov 7 06:46:01 2016 +0000
What operating system and processor architecture are you using (
go env
)?linux/amd64
What did you do?
I have a server program that makes http requests to collect a large volume of data which it then unmarshals and transforms. It runs with around 1000 active goroutines, has a live heap of around 30–100MB and executes around 50 GCs per second. The GC does a fine job keeping up with the program's throughput demands.
What did you expect to see?
I expected sub-millisecond stop-the-world times during garbage collections; per #17503 (comment), "The hybrid barrier is now live on master."
What did you see instead?
I see mark termination STW times of over 1ms somewhat regularly (which I thought the issue 17503 hybrid barrier would eliminate).
I see sweep termination STW times of over 1ms in the majority of GC cycles, which I find particularly surprising since I thought there was very little work required in that phase.
It looks like sometimes the runtime stops most of the goroutines in preparation for a GC, but a few remain running. See for instance Proc 15 around 285–289ms (Proc 28 offscreen is also running), or Proc 0 and Proc 19 around 312–314ms
/cc @aclements @RLH
The text was updated successfully, but these errors were encountered: