-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: morestack_noctxt missing SPWRITE, causes "traceback stuck" assert #54332
Comments
Given that you commented on #52116, I assume that this application was also suffering from that bug before this (seemingly incomplete) fix? |
This stack looks weird. 0x12eb5 is not even an aligned PC (ARM64 instructions are always 4 bytes). Why would it appear on the stack? 0x3ff93 from another stack trace is similarly weird. Maybe they are not PCs? Given that a nearby stack word has content 0x00012ee60001ce60, maybe 0x12eb5, 0x12ee6, 0x1ce60 are not PCs but some (32-bit?) integers on stack? On both stack traces the SIGPROF lands on PC 0x79650. What is that PC? Is it a Go function? Thanks. |
|
|
Addendum: I just trawled errors, and indeed, we do have significant numbers of "traceback stuck" dating back multiple months, just in applications that were able to restart after crashing without coming to my attention. |
Here are some additional crashes from other applications:
|
https://github.com/golang/go/blob/go1.18.5/src/runtime/asm_arm64.s#L321 -- culprit lines here. similarity to #52829, does this need to not have NOFRAME ie https://go-review.googlesource.com/c/go/+/405482/? |
Also repros with go1.19.0
https://github.com/golang/go/blob/go1.19/src/runtime/asm_arm64.s#L323-L324 |
Thanks for the details @lizthegrey !
I don't think this is necessary. The ARM32 function contains a call in some configurations, whereas the ARM64 function should not contain a call. So the signal lands at the entry of If you need a workaround you can try comment out the "traceback stuck" throw. The profile sample's stacks may be inaccurate (which probably was the case before that CL) but it probably won't crash. |
Need any additional debugging info? Any way that I can help? |
We have special cases for morestack and systemstack here, but I don't believe they are relevant because (a) @lizthegrey is this readily reproducible (ideally OSS, but I'm also curious if you could manually reproduce)? Or only occurs in prod? |
It reproduces multiple times per day in all of our environments. Sadly I'm not sure I can reduce to a minimal OSS test case. One thing I'd be happy to do is to run a version of this that is a branch build that patches go runtime to hang rather than panic when this occurs, then I'd be happy to ssh in and dlv the binary. |
Change https://go.dev/cl/424196 mentions this issue: |
Change https://go.dev/cl/424195 mentions this issue: |
If you are willing to patch the runtime, there are a few changes that would be helpful.
I think the latter is a bit simpler than attaching with a debugger, since it can track iteration state (previous frames) that will be gone once we decide to crash. Plus you don't need to watch for hung tasks and manually inspect them, but that is an option as well if the patch is concerning. [1] If adding tags to your build is difficult for some reason, you can also swap the build tag lines at the top of debuglog_off.go and debuglog_on.go. |
There's some slew since 1.18.5:
I'll fix the patch vs our base go version and try this overnight. |
Fresh crash for you with the additional telemetry. Details
|
Oops, apologies, I made this CL based on tip rather than 1.18. Did you get this fixed? If not I can provide another on 1.18.
Did this get built with |
I thought it did, but apparently it's missing them. Does
|
Yes, I also see that the stack dump doesn't include the entire stack either (https://go.dev/cl/424195). I'd guess that your build isn't picking up the patches to the runtime at all (e.g., wrong GOROOT), except that it clearly did when you got a build error... You could double check the symbols in the binary. If debuglog is enabled, it should have various |
oh, whoops, good catch.
that binary is pinned to an older build. the build instructions were correct, prod just didn't have the correct one. egg on my face, sorry, I'll re-try and let you know when we have a crash with the correct build. |
Thanks. By the way, do you happen to be using musl libc (/Alpine Linux)? (Is this even a cgo binary?) I'm investigating #54306 in parallel, presumably an issue with musl. It is probably unrelated, but it would help to know if you are using musl. |
No, we run on Ubuntu 22.04 containers with standard libc |
@lizthegrey have you seen more failures and do you have any logs to share? Could you doublecheck that the binary includes the patch? Thanks! |
No failures over weekend. Yes, 100% sure the crash from Fri includes the patch, and I can manually scrape the logs from another system, it'll just take some time and thus hoping it happens somewhere else where I can grab the logs automatically instead. |
https://share.getcloudapp.com/p9uQpJww -- no crashes today either. the crash on Friday must have been a fluke, but at least we have the telemetry to detect it if it does happen over the next few weeks. I'll still work on reconstructing the logs from the Friday crash immediately post-deploy though. |
Found the problem -- the pod was a lingering holdover of the previous build, despite running chronologically long after the new release should have gone. It did not have the patch. You are good to propose the thank you thank you thank you @cherrymui and @prattmic for your patience. |
Thanks @lizthegrey ! This is great. I'll send a CL soon. |
Change https://go.dev/cl/425396 mentions this issue: |
@gopherbot please backport this to previous releases. This may cause a runtime crash. Thanks. |
Backport issue(s) opened: #54674 (for 1.18), #54675 (for 1.19). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/425615 mentions this issue: |
Change https://go.dev/cl/425616 mentions this issue: |
…architectures On LR architectures, morestack (and morestack_noctxt) are called with a special calling convention, where the caller doesn't save LR on stack but passes it as a register, which morestack will save to g.sched.lr. The stack unwinder currently doesn't understand it, and would fail to unwind from it. morestack already writes SP (as it switches stack), but morestack_noctxt (which tailcalls morestack) doesn't. If a profiling signal lands right in morestack_noctxt, the unwinder will try to unwind the stack and go off, and possibly crash. Marking morestack_noctxt SPWRITE stops the unwinding. Ideally we could teach the unwinder about the special calling convention, or change the calling convention to be less special (so the unwinder doesn't need to fetch a register from the signal context). This is a stop-gap solution, to stop the unwinder from crashing. Updates #54332. Fixes #54674. Change-Id: I75295f2e27ddcf05f1ea0b541aedcb9000ae7576 Reviewed-on: https://go-review.googlesource.com/c/go/+/425396 TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Michael Pratt <[email protected]> (cherry picked from commit e4be2ac) Reviewed-on: https://go-review.googlesource.com/c/go/+/425616
…architectures On LR architectures, morestack (and morestack_noctxt) are called with a special calling convention, where the caller doesn't save LR on stack but passes it as a register, which morestack will save to g.sched.lr. The stack unwinder currently doesn't understand it, and would fail to unwind from it. morestack already writes SP (as it switches stack), but morestack_noctxt (which tailcalls morestack) doesn't. If a profiling signal lands right in morestack_noctxt, the unwinder will try to unwind the stack and go off, and possibly crash. Marking morestack_noctxt SPWRITE stops the unwinding. Ideally we could teach the unwinder about the special calling convention, or change the calling convention to be less special (so the unwinder doesn't need to fetch a register from the signal context). This is a stop-gap solution, to stop the unwinder from crashing. Updates #54332. Fixes #54675. Change-Id: I75295f2e27ddcf05f1ea0b541aedcb9000ae7576 Reviewed-on: https://go-review.googlesource.com/c/go/+/425396 TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Michael Pratt <[email protected]> (cherry picked from commit e4be2ac) Reviewed-on: https://go-review.googlesource.com/c/go/+/425615
On LR architectures, morestack (and morestack_noctxt) are called with a special calling convention, where the caller doesn't save LR on stack but passes it as a register, which morestack will save to g.sched.lr. The stack unwinder currently doesn't understand it, and would fail to unwind from it. morestack already writes SP (as it switches stack), but morestack_noctxt (which tailcalls morestack) doesn't. If a profiling signal lands right in morestack_noctxt, the unwinder will try to unwind the stack and go off, and possibly crash. Marking morestack_noctxt SPWRITE stops the unwinding. Ideally we could teach the unwinder about the special calling convention, or change the calling convention to be less special (so the unwinder doesn't need to fetch a register from the signal context). This is a stop-gap solution, to stop the unwinder from crashing. Fixes golang#54332. Change-Id: I75295f2e27ddcf05f1ea0b541aedcb9000ae7576 Reviewed-on: https://go-review.googlesource.com/c/go/+/425396 TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Michael Pratt <[email protected]>
…architectures On LR architectures, morestack (and morestack_noctxt) are called with a special calling convention, where the caller doesn't save LR on stack but passes it as a register, which morestack will save to g.sched.lr. The stack unwinder currently doesn't understand it, and would fail to unwind from it. morestack already writes SP (as it switches stack), but morestack_noctxt (which tailcalls morestack) doesn't. If a profiling signal lands right in morestack_noctxt, the unwinder will try to unwind the stack and go off, and possibly crash. Marking morestack_noctxt SPWRITE stops the unwinding. Ideally we could teach the unwinder about the special calling convention, or change the calling convention to be less special (so the unwinder doesn't need to fetch a register from the signal context). This is a stop-gap solution, to stop the unwinder from crashing. Updates golang#54332. Fixes golang#54675. Change-Id: I75295f2e27ddcf05f1ea0b541aedcb9000ae7576 Reviewed-on: https://go-review.googlesource.com/c/go/+/425396 TryBot-Result: Gopher Robot <[email protected]> Run-TryBot: Cherry Mui <[email protected]> Reviewed-by: Michael Pratt <[email protected]> (cherry picked from commit e4be2ac) Reviewed-on: https://go-review.googlesource.com/c/go/+/425615
Change https://go.dev/cl/446895 mentions this issue: |
ref. CL 425396 Updates #54332. Change-Id: I1a235b0cca4dbf79cf61cf5f40b594fc2d940857 Reviewed-on: https://go-review.googlesource.com/c/go/+/446895 Reviewed-by: Cherry Mui <[email protected]> Reviewed-by: WANG Xuerui <[email protected]> Run-TryBot: Wayne Zuo <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Reviewed-by: xiaodong liu <[email protected]>
ref. CL 425396 Updates golang#54332. Change-Id: I1a235b0cca4dbf79cf61cf5f40b594fc2d940857 Reviewed-on: https://go-review.googlesource.com/c/go/+/446895 Reviewed-by: Cherry Mui <[email protected]> Reviewed-by: WANG Xuerui <[email protected]> Run-TryBot: Wayne Zuo <[email protected]> TryBot-Result: Gopher Robot <[email protected]> Reviewed-by: Ian Lance Taylor <[email protected]> Reviewed-by: xiaodong liu <[email protected]> (cherry picked from commit b5c8ae9)
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
yes, repros on go1.19.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
Run continuous profiling via scraping /debug/pprof endpoints including mutex and blocking profiles, while performing a workload consuming from Kafka using Sarama and Klauspost Zstd libraries.
What did you expect to see?
No crashes.
What did you see instead?
The following crashes implicating
unexpected return pc for runtime.sigtramp
andtraceback stuck
(assert added in https://go-review.googlesource.com/c/go/+/400575/ by @cherrymui / #52116 which was backported to go1.18.5):The text was updated successfully, but these errors were encountered: