-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: unexpected return pc crash on linux-amd64-alpine builder #54306
Comments
Change https://go.dev/cl/422097 mentions this issue: |
@rsc Assigning to you right now while you're updating the image, but feel free to unassign once you're done. Updates from internal discussion:
|
I updated the image already, just need to submit the CL. |
Also update Go version in buildlet/stage0 to make build work again. For golang/go#54306 (but does not fix it). Change-Id: I7dd656de9cb9f563b816929330fa53059c93b5b8 Reviewed-on: https://go-review.googlesource.com/c/build/+/422097 Run-TryBot: Russ Cox <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Russ Cox <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
Since the 10th, when https://go.dev/cl/422097 was submitted. (I'm almost certain the coordinator has been redeployed since then). 2022-08-16T20:39:44-e49e876/linux-amd64-alpine (Edit: rereading the CL, I see it wasn't intended to fix this) |
This is reproducible with
|
Elsewhere, @cherrymui mentioned that this looks like it could be corruption from a stack overflow. I agree. In the partial trace below, everything below
|
FWIW, I've been unable to reproduce this with |
I take that back. It look about an hour (instead of the usual ~5 minutes), but I did get a repro with |
I've been making slow progress on this. The most notable is that this reproduces when running only TestSetgid and TestSetgidStress, while it does not reproduce while running only various other tests I've tried. (I haven't tried each test individually, as there are dozens and the repro time is a bit high). So this may be related to setgid, or just signals in general. |
It looks like the problem is that signal 34 (SIGRT_2) used by musl for setgid is not getting If I'm interpreting strace correctly, it looks like this signal is still
Only later does musl install the signal handler:
|
musl does not install the
|
(!) That means even if we set SA_ONSTACK for their handler, they will reinstall and overwrite it? |
https://git.musl-libc.org/cgit/musl/tree/src/thread/synccall.c#n102 Does it mean that they remove the handler at exit of the call? Hm.... |
Correct, they don't even try to match the existing flags or forward to an existing handler, so we can't install a dummy SA_ONSTACK handler.
Yes, that is what I see:
|
To summarize:
I don't see how we can work around this in Go given that we can't adjust the signal handler flags, nor does There are several changes on the musl side that could address this:
|
Ah, it turns out this is a duplicate of #39857, which has been discussed at some length but not resolved. |
The revived linux-amd64-alpine builder has flaked twice in its short new lifetime with 'unexpected return pc' crashes during the cgo tests.
Here is a repro case using a gomote (note that if you ssh in, you have to set up your environment manually, and in particular you have to put /workdir/go/bin at the front of PATH and have to set GOROOT_BOOTSTRAP=/workdir/go1.4). Not sure why the environment is so messed up on Alpine. gomote run does not have these problems, only gomote ssh.
You may need to repeat the try.sh a few times depending on how flaky the machine is feeling but most runs get at least one failure.
Here are some failures from that script:
This one did not happen during garbage collection:
Here are the two build dashboard failures:
https://build.golang.org/log/658036e08c7a1d218c33808fdd1d6612b40502d8
and
https://build.golang.org/log/94cf14d78b116487dc76a921baf6ba76480a4c7a
Perhaps this is Alpine-specific, or perhaps it is musl-related.
The Alpine image may have an old Linux kernel; maybe we should update it.
There are a few other open 'unexpected return pc' issues.
Maybe they are all stale:
#35005 is the most interesting one but the repro case is a very large program running under Docker.
The text was updated successfully, but these errors were encountered: