-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: process hangs after moving to Go 1.17 on Windows #52178
Comments
dlv output from process dump
|
Is this golang issue or something else? We still have no clue why it happens randomly. |
Does this issue also occur with Go 1.19? A few things I notice of interest:
|
We are not able to reproduce this issue in house. It happens on our client machines where we dont have access. One more thing, we have this code that periodically scans memory stats. Do you see any problem because of this?
|
I don't see anything fundamentally problematic with that loop, though you may consider if setting a memory limit (new in 1.19) is a better fit for your application: https://pkg.go.dev/runtime/debug#SetMemoryLimit |
My best guess is it's stuck trying to STW because there's a bunch of goroutines stuck in semacquire (probably the STW sema for a GC). There's also a goroutine suspiciously in syscall_windows.go in the runtime. If you're willing to share full goroutine stacks, that'll help a good bit in determining the root cause. |
Actually no, taking a closer look, there are a lot of goroutines stuck in |
Thanks @prattmic, I will use that SetMemoryLimit call when we upgrade to Go 1.19. @mknyszek I dont have full goroutine stack. We are not able to take goroutine trace via pprof when process hangs. We are not able to take any kind of traces via pprof. The attached one we obtained via windows process dump. |
I see that you're checking the process dump in Delve; shouldn't that be able to provide a backtrace of each goroutine? |
I believe I am having the exact same issue on Windows (x64), but I managed to get a stack trace of all goroutines (see attached files). From what I can see, every "active" goroutine seems to be waiting on GC and all goroutines involved in GC are "idle". While some other goroutines are waiting on channels/selects/IO, as soon as one attempts to allocate memory they hang as well. For example, I have enabled GoLang's built-in HTTP profiler. When the app hangs, I can fetch the "/debug/pprof/" home page exactly once, but if I attempt to click on any links to gather profiling info, that goroutine hangs indefinitely (and I can no longer even fetch the home page). To obtain the attached stack trace, I had to make a build with debug symbols, give it to the customer, wait for it to hang, and then use "delve attach" to the hung PID and "goroutines -t"... Note: I have seen this happen with builds made using GoLang versions 1.16.6 and 1.18.3. I have not tried 19.1 yet, or "asyncpreemptoff=1", and I also periodically call debug.FreeOSMemory() from main() (goroutine 1, which is not even a little close to the FreeOSMemory call when the app hangs). I do not use runtime.ReadMemStats the way sanjayvora does, I just call FreeOSMemory once every N minutes. Update: I almost forgot. We have never seen this occur on any development system, and we have several customers running this application 24/7 who have also never encountered it. So far it has happened on one customer's system, but it occurs very often, and that customer happens to have McAfee Endpoint Security installed. McAfee doesn't log any complaints about our app (and we see nothing about it in Windows Event Manager). It may not be a factor, but it wouldn't be the first time I've seen anti-virus drivers do unexpected things to apps. We may ask the customer to temporarily remove McAfee to see if it makes a difference, but if something like "asyncpreemptoff=1" can fix it... |
I believe that I stumbled across the same problem as described here. I wrote down my finding and thoughts in a containerd issue: containerd/containerd#6362 (comment) |
Update: Enabling GC traces revealed that right before the process gets unblocked, a forced GC is happening via the sysmon goroutine. |
Any updates for this issue? It seems that my applicatin also encounter such problem in linux. we use go1.17.8, and here is the gc log of my application
The GC log shows that the second STW lasts for about 10 minutes. But after that, the application can automatically return to normal. |
@codablock Sorry, somehow I totally missed your comment from back in November. That sounds like a different issue (this seems pretty Windows-specific; Also, reading your linked comment, I'm not sure I follow the conclusion that it's a bug in the Go runtime. It sounds like the process is getting stuck, but I don't think the ways you tried to perturb it would really shift what the Go runtime is doing. My gut feeling is an OS issue, but I could be wrong. @Jason7602 That also sounds like a different issue for the same reasons as above. In fact that's... quite surprising. Please file a new issue. Can you reproduce with the latest version of Go? Go 1.17 is no longer supported. |
Actually, looking at the goroutine dump earlier on, this looks like an issue I've seen before. I think it was around Go 1.17 as well, and I think we fixed it. What's really suspicious is the extremely long wait times reported (52 years?). Let me try to dig up the other issue, this particular issue might be fixed. |
Re-adding my 2 cents about the 1.17 question, which was:
*Note:* I have seen this happen with builds made using GoLang versions
1.16.6 and 1.18.3. I have not tried 19.1 yet
*Problem:* I can't reproduce this on any of my own development systems. I
can only reproduce it on a customer's system, and I can't waste a
customer's time hunting down an issue like this when I have a workaround. I
believe the customer had anti-virus installed, and it's possible that some
driver-level bug specific to that anti-virus was causing it, but I really
have no idea. If I ever manage to reproduce this on a development system, I
will add more info.
…On Mon, Apr 10, 2023 at 12:00 PM Michael Knyszek ***@***.***> wrote:
@codablock <https://github.com/codablock> Sorry, somehow I totally missed
your comment from back in November. That sounds like a different issue
(this seems pretty Windows-specific; there's Windows syscall callback
functionality that doesn't exist on other platforms and it seems to be
implicated here, re-reading the thread).
Also, reading your linked comment, I'm not sure I follow the conclusion
that it's a bug in the Go runtime. It sounds like the process is getting
stuck, but I don't think the ways you tried to perturb it would really
shift what the Go runtime is doing. My gut feeling is an OS issue, but I
could be wrong.
@Jason7602 <https://github.com/Jason7602> That also sounds like a
different issue for the same reasons as above. In fact that's... quite
surprising. Please file a new issue. Can you reproduce with the latest
version of Go? Go 1.17 is no longer supported.
—
Reply to this email directly, view it on GitHub
<#52178 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE7JSCXP3X34EJHWFNUBA6TXAQVAVANCNFSM5SVOEIRQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
@mknyszek Thank you for your reply. This is really a surprising phenomenon. It may be an OS problem, as you speculated, because if this is a bug in the go runtime, it is unlikely that it can be recovered after it hangs. In fact, this problem is not easy to reproduce. I will reproduce this problem as soon as possible, and then file a new issue and check whether this problem exists in the latest Go version. |
Hi @mknyszek Can we know which version of golang it was fixed? |
Delve bug, please disregard. |
Everyone seems to be ignoring my reply (which was merely to call out my post from 2022-09-21), but if he thinks it was fixed during the 1.17 release cycle... I was still able to reproduce it on GoLang versions 1.16.6 through 1.18.3, so I do not believe it was fixed during the 1.17 release cycle unless it was a late fix that was made after the 1.18.3 release. IMHO, the biggest problem with trying to eliminate this problem is how difficult it is to reproduce. I only saw the problem on 1 customer system (out of hundreds that ran our GoLang apps with no problems). That 1 customer was able to reproduce it fairly reliably, but only after running our product 24/7 for several days, so each test cycle was very long. The only thing that stood out on his server was McAfee Endpoint Security. That may have had something to do with it, but it may not have. Disabling async preemption fixed the problem permanently, and as we never managed to reproduce the problem in-house and we cannot use customers as guinea pigs, we just chose to disable that GoLang option permanently forever in all of our Windows GoLang apps. If it helps at all, my earlier post did include a full goroutine and environment dumps from the customer when the problem occurred. |
Yeah, sorry. Missed that. And based on @aarzilli's comment, that long wait time was a red herring. I take back my speculation about this possibly being fixed.
See above, I don't think it was.
Got it. Async preemption does appear to be the common piece here. I dug up this open issue (#36492) that also gets resolved with FTR we're running our full test suite against Windows machines (from different eras) regularly, and digging around our open issues (with the caveat that GitHub search is a bit hard to work with) I don't see any Windows-specific test timeout issues. I'm inclined to believe that what's happening is this feature is not playing well with some version of Windows, or something in the environment on some Windows machines. We've also fixed async preemption bugs over the years, so it's possible that now we've hidden the issue. :( At this point we're almost certainly not going to go back and fix up Go 1.17 as per our policy, but at the very least I can file an issue to catch any Windows timeouts on our builders. |
Filed #59576 to track any failures we see on our builders. It's closed right now on purpose; if we see any failures the bot will open the issue and we'll get it in the triage queue. |
Agreed. I've seen it on recent versions of Windows Server (definitely way beyond Windows 7). While this may not help you reproduce it (especially as code changes over time), my recommendation would be:
|
Hi all, I don't know if this helps, but I had this exact same issue but finally tracked it down to a coding error. Before 1.17, the coding error never caused any issues. In Linux, it still didn't cause an issue but in Windows it sometimes happened with the process just hanging with lots of processes stuck. Could be reproduced on the same box running Windows & but not on partition running Linux in separate partitions.
The fault in the code was that I had the unlock after I don't know if this helps, but maybe the fault is closer to home than you think. |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes with 1.17.6
Not tried with 1.18
What operating system and processor architecture are you using (
go env
)?Windows
go env
OutputWhat did you do?
We migrated our go version from go1.12 to go 1.17.2. After that we see that our process hangs sometimes on windows os. Most of the times it runs fine, but some times it hangs and it never comes out. Even it doesn't return go profiles (we have pprof included in process).
We get this issue in production, but not able to produce this in house, so dont have much details.
I am attaching process dump that we have taken from explorer. I do see it doing force GC, but can't actually figure out what is the issue.
One more thing: We have this runtime.ReadMemStats() happening every 30 sec. Not sure if that is related, but just noting down here.
What did you expect to see?
Process should continue and finish gracefully.
What did you see instead?
Process hangs
The text was updated successfully, but these errors were encountered: