-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: long GC STW pauses (≥80ms) #19378
Comments
Here is another
|
Is there anything I can do/any more information I can gather to help debug this? I know this probably isn't very high priority, but this is having a fairly severe effect on our system, so I'd love to do anything I can to get to the bottom of this. |
@obeattie, sorry, I just assigned it to @aclements. It's likely nobody on the runtime team saw it until now. |
@obeattie, thanks for the execution traces. Unfortunately, I spent a while looking at them and wasn't able to glean much about the actual problem. It's interesting that these are always periodic GCs and that your heap is under 4 MB. Looking through sweep termination, I could make guesses about what's going on, but they would all be shots in the dark. Background sweeping has long finished sweeping before we enter sweep termination, so I don't think the sweep termination part of sweep termination is the problem. Could you collect a pprof profile? Given how idle your program generally is, I suspect you would get some samples under runtime.forcegchelper that may give us a better idea of what's holding it up. You might want to crank up the profile sampling rate with |
Thanks for getting back to me 😄 I'll get this data as soon as I can. |
@obeattie, ping. |
So sorry for leaving this open for so long with no follow-up. 🤦♂️ Thank you for the ping! We have been able to mitigate the impact of this issue somewhat but it is still happening. Our investigations were hampered by some infrastructure issues. We've resolved those issues in the last week though, so we should be able to look into this more deeply now. My current thinking is that some behaviour of the Linux scheduler is involved. The fact that the GC times are so closely clustered around 100ms – which is the default |
Hello @obeattie, I got interested in your issue and decided to try to see if I could prove or disprove your theory about CFS before digging into it from a Go point of view. I wrote a little program that does network transactions (talking with itself, via localhost) and dying as soon as it sees one transaction that takes more than 80ms. It is here: https://play.golang.org/p/7ZXSoCeb18 (Put it in a directory called "server" and do "go build" on it.) Running "./server" does not trigger the long transactions, even with significant other load on the machine. (See the -spin argument for how I was generating load.) But it is possible to trigger the 80 ms transactions by putting the server inside of a cgroup which is using cpu.cfs_quota_us. To see it in action:
In another window, do the following:
As quota approaches period, the max latency goes up, eventually reaching 80ms and the log.Fatal triggers. If quota is at 200% of period (i.e. quota = 200000, period = 100000), you can even run "./server -spin" next to "./server" and you'll end up with max latencies of 50ms. (I can't quite explain why that is; I expected the max latency to still be 100 ms in that case, maybe I just got lucky.) What's happening is that when the quota is exhausted, all processes in the cgroup are starved of time slices until the next period starts. If period is 100ms, you can eventually expect to get unlucky and have one transaction take up to 100ms. The solution is to use a lower period, so that the worst case wait to finish any given transaction is lower. This config allows the spinner and the server to coexist with max latencies of 50 ms, at least on my machine:
Are you running your server inside of Docker, managed by kubernetes? If so, there's still some work to do to figure out how to get your cgroups
The second URL seems to indicate that Kubernetes will not allow you to change the period. :( |
@jeffallen Thank you so much for contributing your time to investigate this. Your findings match mine, and indeed these problems are seen when running under Kubernetes with CFS quotas enforced. I will need to work out some way to change the period under k8s… Based on these findings I'm now confident that this is not a Go bug and I'm going to close the ticket. I think that fixing #19497 would be very helpful to anyone facing similar issues; debugging behaviour like this is somewhat like trying to find a needle in a haystack at the moment. Thank you to everyone who's helped with this ❤️ |
What version of Go are you using (
go version
)?go1.8 linux/amd64
What operating system and processor architecture are you using (
go env
)?The host where the application is running does not have
go
(just the compiled binary), so this is actually from the host which built it.What did you do?
Following from a discussion on golang-nuts, I was advised to create an issue.
In our applications we are regularly seeing GC pauses of ≥80ms, even for extremely simple programs with light workloads. These long pauses seem to be sporadic; most GC's take fractions of a millisecond as I would expect, but in some runs it seems that the sweep termination phase (which I understand is STW) is taking a very long time. Given these applications are network RPC servers, pauses like these are extremely problematic for tail latencies.
I have captured a
runtime/trace
here showing one such pause, along with the output ofGODEBUG=gctrace=1
below for the same period. The long GC is at451,008.477 ms
in the trace and corresponds togc 12
below.I'd really appreciate any help investigating what could cause this.
The text was updated successfully, but these errors were encountered: