-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: test timeouts / deadlocks on OpenBSD after CL 232298 #42237
Comments
Looks like it may also affect Windows: |
Given that it's happening significantly on OpenBSD 6.2 and not at all on OpenBSD 6.4, it may be a problem with the old OpenBSD version or the amount of resources it has. An important difference between those two builders, beyond the OpenBSD version, is that the 6.4 builder was configured to significantly improve performance at the cost of disabling Spectre/Meltdown mitigations (see golang/build@31ed75f and golang/build@d3ddf67). CC @golang/release. |
Looks like |
Just a thought: it may be worth stress testing this under the lock rank checker on these OSes. The dashboard only runs lock rank checking on Linux. |
Looping in @mknyszek , since he's pretty familiar with the timer code. |
Thanks. I have looked at most of the failures linked above and first impressions are that it's clearly not breaking the same way every time. The panic dumps look like a collection of blocked goroutines that are not consistent across failures. My first guess would be a missed thread wakeup somewhere leaving the runtime's P's under serviced. I haven't been able to do an in depth analysis or testing yet. I also wonder if I missed something when rebasing to resolve conflicts with https://go-review.googlesource.com/c/go/+/259578/, so any help @prattmic could give double checking that would be appreciated. |
I took a look at that rebase when I had to rebase http://golang.org/cl/264477 around your change, and it seems OK to me, but I'll take another look. |
I'm stress testing lock ranking on openbsd-amd64-62 now:
No luck so far, though. I'm not even getting the failures that are showing up at ~100% on the dashboard. |
I'm able to get occasional timeouts by running all.bash on openbsd-amd64-62, but haven't had any luck getting any useful extra information yet. |
It's slow going, but I've at least confirmed that we do indeed seem to be missing timer expirations. From one timeout:
All 3 of these timers should have expired 60-80s before this test failed, but didn't (unless I've missed a place where we leave expired timers in timer0When). |
Hi, the issue can be reproduced by the following command on linux/amd64: |
Thanks for that tip @erifan. Using that I was able to reproduce a deadlock pretty reliably. Setting some debug env vars ( Note that although I set the -test.timeout=30s that doesn't seem to work since the runtime gets deadlocked and timers aren't serviced. So on the run linked above I aborted the program after about 13s with When I look at the stack traces I am seeing a deadlock related to memory allocation. Most of the Ps have status
The stack trace for G1672 is disappointing:
Most all other stack traces show goroutines in a runnable state blocked on something trying to allocate or a function entry point, which I am guessing means they are blocked waiting for the GC, which isn't making progress for some reason. Maybe someone else can tease more information out of the report I've provided. Next step for me when I have some time is to revive some of the runtime debuglog code I preserved from when I was working on the timer changes in CL 232298. I've rebased that code and posted it at https://github.com/ChrisHines/go/tree/dlog-backup if anyone else wants to give that a try. |
Trying the same test on the prior commit I did not observe any deadlocks after several attempts. |
Now that we have a local reproducer, would it make sense to roll back CL 232298 until it can be diagnosed and fixed? |
FWIW, we got a couple hits on the vanilla |
I've got a case in gdb (thanks for the repro @erifan!) with similar symptoms to @ChrisHines. One there is in STW, waiting for running Ps to voluntarily stop:
Meanwhile:
P1 is in _Pidle. This indicates some kind of race. This P either should have been put in _Pgcstop directly by stopTheWorldWithSema, or it should have placed itself into _Pgcstop before sleeping. |
@bcmills The rollback will be a bit messy due to conflicting changes that have gone in since, but I'll prepare a CL. |
Change https://golang.org/cl/267198 mentions this issue: |
I am not getting any deadlocks if I run with |
Capturing a debuglog trace as I described above it looks like the stuck P is blocked in So it looks like a deadlock on |
I'm getting close: the immediate cause is the new Here, P 9 is acquired by that
|
@ChrisHines interesting, can you actually see the M owning the P in gcstopm? As a continuation of above, what I'm seeing is that the stuck idle P is sitting an
This M is still sitting in
AFAICT, this is a general problem with M wakeup: it would seem that the One thought is that the CL did remove a |
Ah, I think I figured it out: Sure enough, adding That said, I don't know what this bug really has to do with the original CL. [1] This could also happen on any other call after |
Change https://golang.org/cl/267257 mentions this issue: |
Thanks to @aclements and @mknyszek: the reason this CL trips up on this issue is that it introduces the first preemptible call to |
@prattmic I am finding the same behavior here. For example, in this case
Thanks for the bug fix. 🥇 |
@prattmic, any idea whether the NetBSD kernel has the same bug as OpenBSD? (https://build.golang.org/log/28fcb871ec9f9d01d787eec2c5f32f38450d5774 on CC @bsiegert |
OpenBSD only supports the last two releases - that is currently 6.8 (released last month) and 6.7 (released May 2020). Anything older (6.6 and earlier) are unsupported and no longer receive reliability or security fixes. Based on previous discussions, Go should target the same - there is little gained by intentionally supporting OpenBSD 6.6 or earlier, or running builders with EOL releases (in fact the opposite is true - Go risks running into problems by not running builders for supported releases). |
@prattmic @4a6f656c I came here to suggest/ask something similar. If we've determined here that the Go code is correct and this issue is due to a kernel bug in OpenBSD 6.2 which has been fixed in OpenBSD 6.4, then I think we should drop the builders. The primary reason we still have OpenBSD 6.4 and 6.2 builders is because we didn't have enough bandwidth to get them updated more frequently (the last update was in 2018, see CL 139857 and CL 143458). If the builder health were at 100%, we would have OpenBSD 6.8 and 6.7 builders for the supported versions of OpenBSD. (Issue #35712 is about adding a OpenBSD 6.8 builder.) If all this sounds right, I think we should add to Go 1.16 release notes that Go 1.16 requires at least OpenBSD 6.4, and disable the builder for Go 1.16 and newer. Opened #42426 for that. |
@bcmills From a quick glance at the NetBSD source, it looks fine. I can try testing there and see what I can find. |
@dmitshur - please let me know if you need assistance - while I can readily update the script that produces the OpenBSD images, I would not have the ability (at least as far as I know) to actually deploy the new builders, so would need someone from the Go team to push it over the line.
Sounds good to me. |
For easier testing, I created this simple test case. tl;dr, all the systems I tested seem fine expect for OpenBSD 6.2. #include <errno.h>
#include <inttypes.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/event.h>
#include <sys/time.h>
int64_t nanoseconds(const struct timespec* ts) {
return ts->tv_sec * 1000 * 1000 * 1000 + ts->tv_nsec;
}
int64_t nanotime(void) {
struct timespec ts;
int ret = clock_gettime(CLOCK_MONOTONIC, &ts);
if (ret < 0) {
perror("clock_gettime");
exit(1);
}
return nanoseconds(&ts);
}
void do_kevent(int kq, const struct timespec* timeout) {
int64_t start = nanotime();
if (timeout != NULL) {
printf("%" PRId64 ": %" PRId64 "ns timeout...\n", start, nanoseconds(timeout));
} else {
printf("%" PRId64 ": NULL timeout...\n", start);
}
fflush(stdout);
struct kevent event;
errno = 0;
int ret = kevent(kq, NULL, 0, &event, 1, timeout);
int64_t end = nanotime();
printf("%" PRId64 ": kevent = %d, %d (took %" PRId64 "ns)\n", end, ret, errno, end-start);
fflush(stdout);
}
int main(int argc, char** argv) {
// Die if we hang.
alarm(5);
int kq = kqueue();
if (kq < 0) {
perror("kqueue");
exit(1);
}
struct timespec ts;
ts.tv_sec = 0;
ts.tv_nsec = 0;
do_kevent(kq, &ts);
ts.tv_sec = 0;
ts.tv_nsec = 500;
do_kevent(kq, &ts);
ts.tv_sec = 0;
ts.tv_nsec = 1000;
do_kevent(kq, &ts);
ts.tv_sec = 1;
ts.tv_nsec = 0;
do_kevent(kq, &ts);
// Expect to hang.
do_kevent(kq, NULL);
return 0;
} Results: openbsd-amd64-62 (buggy!):
openbsd-amd64-64:
(N.B. alarm expected here; NULL should hang forever) netbsd-amd64-9_0:
freebsd-amd64-12_0:
darwin-amd64-10_15:
dragonfly-amd64-5_8:
|
@prattmic, the observed builder failures on NetBSD have been on the |
Hmm, actually, there's one on |
(So... do we need a separate issue to track the NetBSD deadlock?) |
I shall leave no stone unturned, so I just ran the test of every builder matching So, yes, we'll need a new issue for NetBSD. |
That builder runs macOS 10.11 El Capitan, and Go 1.14 is the last release that supports it, so it's normal for Go 1.16 to not work on it. |
#42402 was closed as a duplicate of this issue. |
@laboger , @prattmic , I think we're trying to keep this issue focused on the OpenBSD failure, which is now understood to be specific to an OpenBSD kernel bug. But there are clearly other issues on other platforms. Since @laboger reproduced timer hangs on linux/ppc64le and linux/amd64 in #42402, is #42424 really the right issue to merge that into? |
@aclements I agree that #42402 shouldn't be closed as duplicate of this; I've reopened it so it can be investigated. Given @prattmic's analysis above, that #42426 is in NeedsFix state now and we have other issues for non-OpenBSD issues, I think there's nothing more to do here for OpenBSD in Go 1.16 specifically and this can be closed. Is my understanding right? |
There appears to be an uptick in test timeouts (perhaps deadlocks?) in the OpenBSD and NetBSD builders on or around CL 232298. Since that CL affects runtime timers, we should investigate the failures to see whether it introduced a regression that could account for these timeouts.
2020-10-27T18:38:48-f0c9ae5/netbsd-amd64-9_0
2020-10-27T18:23:46-e3bb53a/openbsd-amd64-62
2020-10-27T18:13:59-3f6b1a0/netbsd-386-9_0
2020-10-27T18:13:59-3f6b1a0/netbsd-amd64-9_0
2020-10-27T18:13:59-3f6b1a0/openbsd-amd64-62
2020-10-27T17:24:42-8fdc79e/openbsd-amd64-62
CC @ALTree @ChrisHines @ianlancetaylor @aclements
The text was updated successfully, but these errors were encountered: