-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: TestGdbCoreCrashThreadBacktrace failures #64752
Comments
Found new dashboard test flakes for:
2023-12-14 22:24 linux-amd64-bullseye go@d73b4322 runtime.TestGdbCoreCrashThreadBacktrace (log)
|
This test is new in https://go.dev/cl/536895. The most obvious cause for this failure would be the crash thread failing to crash for >5s. However, we don't have any timing information in the test log. |
Actually, a better theory: |
Hi @prattmic , thanks! I would like to send a CL to fix this.
Thanks. Are you saying that we should add more logs about timing in signal_unix.go(maybe with how many ms still waiting SIGQUIT also) in each loop? Since it seems hard to print timing related in this test code except printing the overall time.
Sorry, I don't know if I really get this. You are talking about I guess, in this case, maybe it just really takes more than 5s to deliver the SIGQUIT signals since this case takes 7.67s in running.
However, it still stranges to me that the ms which are in
|
This has only failed once, so failure rate doesn't currently seem too bad, but ultimately this is testing a best-effort feature that intentionally "breaks" after 5s, so it won't ever be 100% reliable (given a slow enough machine). That makes it a good candidate for #39204. @zzkcode thanks for pointing out the total test time, I missed that. That does imply that 5s really did pass, rather than something else happening, like an early For now, perhaps we should modify the test to run a few times (3?) and at least verify that the functionality isn't failing 3/3 times? |
Hi @prattmic, thanks for confirmation.
IMHO, since this test case includes a small go+cgo program and only have few(5/6 maybe?) threads to quit, it should usually spend few seconds to exit. So we would expected to see this case pass(all threads received SIGQUIT and quit within 5 seconds) in the vast majority of cases, but not cover all cases(it will eventually quit anyway after 5 seconds). Like you said, however, it may be running on I don't know whether this is a best choice to loop test more times, or just skip it on some platforms? I would prefer the former one for now since it only reported once as you mention before, but it won't supprise me if it continues fails 3 times in the future(rare event). So let's just loop 3 times, and if it pass, we just break and skip the rest loops. Otherwise, give it another 2 chances? It may cost more times(15s+) to run this single case in extreme scenarios. I want to submit a pr to fix this if this is OK. One more tiny thing here(unrelated to this test, but related to the code change), I don't know if this is worth to be a change(maybe should fire another issue): as the 1.22rc1 is released, I had tried it to serveal times. It would usually spend at least 500ms(per loop) to quit when crash, kind of slower compared to 1.21 or earier releases(like sticking in somewhere for a very few moment while waiting the first thread to crash). Maybe we could use exponential backoff or just double the sleep time if this is a issue. Should not a big deal, just an experience thing. FYI @cherrymui @bcmills |
In general it is not a good idea for tests to assume that something happens within any particular arbitrary timeframe. Is there some way to eliminate the arbitrary 5-second interval? For example, could the test wait indefinitely instead of just for 5 seconds, or poll indefinitely to check whether the backtrace has been recorded yet? Then the timeout behavior of the test would follow the |
Not easily, the timeout is hard-coded into a sleep in the fatal signal handler: https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;l=784;drc=de5b418bea70aaf27de1f47e9b5813940d1e15a4. But I suppose we could use a global variable and have the test use |
This is a good point. This is an effect of https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;l=781;drc=de5b418bea70aaf27de1f47e9b5813940d1e15a4. The crashing thread will almost certainly sleep for 500ms at least once while waiting for SIGQUIT to make its rounds. IMO, we should just reduce this is something small, like 1ms, as it shouldn't take long for other threads to exit. |
Oof, that's unfortunate. Is there some reason the runtime's watchdog needs to be that short? Perhaps we could extend the sleep until |
@prattmic Maybe it will need more time to crash in a large enough application, like having thounands of thread? In that case, it may have to loop many times in
@bcmills I don't have knowledge about its original backgroud. However, this change is intended to be a fix in https://go.dev/cl/536895. Before this, the thread who first see all threads had received SIGQUIT will start to crash(usually the last thread received that signal). And all the other threads will start sleep for 5 seconds. If the thread(the one should crash) miss crash in some edge case, the others will be able to crash process. So the process will always take no more than 5 seconds to crash. This change just follow this. Maybe it just for crashing asap? FYI @ianlancetaylor @prattmic @bcmills Did a tiny change, is this acceptable?
diff --git a/src/runtime/signal_unix.go b/src/runtime/signal_unix.go
index 84391d58ed..dc48067b45 100644
--- a/src/runtime/signal_unix.go
+++ b/src/runtime/signal_unix.go
@@ -774,14 +774,16 @@ func sighandler(sig uint32, info *siginfo, ctxt unsafe.Pointer, gp *g) {
print("\n-----\n\n")
raiseproc(_SIGQUIT)
}
+ totalSleepTimeUs := 10 * 1000 * 1000
if isCrashThread {
- i := 0
- for (crashing.Load() < mcount()-int32(extraMLength.Load())) && i < 10 {
- i++
- usleep(500 * 1000)
+ perLoopSleepTimeUs := 1000
+ i := totalSleepTimeUs / perLoopSleepTimeUs
+ for (crashing.Load() < mcount()-int32(extraMLength.Load())) && i > 0 {
+ i--
+ usleep(uint32(perLoopSleepTimeUs))
}
} else {
- usleep(5 * 1000 * 1000)
+ usleep(uint32(totalSleepTimeUs))
}
printDebugLog()
crash() Test pass on Linux. |
I was unsure how long making these crash rounds takes, so I added some quick and dirty logging to the runtime and got:
So approximately 1ms per thread (this feels rather high to me, but now isn't the time to worry about optimizing crash too much). Perhaps 5ms is a good sleep value given these numbers. We could get fancy and have this depend on the M count, but I'm really not very concerned about the inefficiency of one thread waking a bit too frequently when we are crashing. Your patch above looks good, feel free to send a CL. Note that you'll need to make |
Change https://go.dev/cl/554615 mentions this issue: |
Hi @prattmic , I had submitted a CL. However, I don't know whether I get you correctly or I just do it the wrong way: the Sorry if I get you wrong and please advise me the right way to do this. Thanks. |
Some analysis of the test log reported by The program ends up running six goroutines in total. When run on my workstation, at the time of the crash the goroutines are parked in the following locations:
In a successful run on my local workstation, I see
In the builder log, we see
The thread in
So this looks like a genuine watchdog failure: the kernel defers delivery of the final Perhaps |
I wonder if this race is also responsible for one or more of the other patterns of test flakes, such as:
|
Thanks for the analysis @bcmills! I agree with you, this looks like a genuine race condition. It is also note new in https://go.dev/cl/536895, it appears to have ~always existed (https://go.dev/cl/245018 extended the race window, but it was present before that too). Regarding a fix, I'll need to think some more. I'm not thrilled about masking signals, but I don't have a better idea at the moment. I could imagine this causing some of those other crashes, but not all. e.g., #60316: I don't see how this could manifest as a SIGCHLD death. Since this bug isn't new, fixing it need not block the release, but we should still do something about the test. Should we mark it flaky until this bug is fixed? |
Yep, I agree that a |
Change https://go.dev/cl/556356 mentions this issue: |
This test exercises the SIGQUIT crash loop and managed to trigger the race from golang#65138 at least once. For golang#65138. Fixes golang#64752. Change-Id: I11091510aa7ae4f58b1d748e53df2e3e3dbfb323 Reviewed-on: https://go-review.googlesource.com/c/go/+/556356 Auto-Submit: Michael Pratt <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Reviewed-by: Cherry Mui <[email protected]>
Running with few threads usually does not need 500ms to crash, so let it crash as soon as possible. While the test may caused more time on slow machine, try to expand the sleep time in test. Updates #64752 Change-Id: I635fab846bd5e1735808d4d47bb9032d5a04cc2b GitHub-Last-Rev: 84f3844 GitHub-Pull-Request: #65018 Reviewed-on: https://go-review.googlesource.com/c/go/+/554615 Reviewed-by: Michael Pratt <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Auto-Submit: Cherry Mui <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Cherry Mui <[email protected]>
Issue created automatically to collect these failures.
Example (log):
— watchflakes
The text was updated successfully, but these errors were encountered: