runtime: TestGdbCoreCrashThreadBacktrace failures #64752

gopherbot · 2023-12-15T19:05:10Z

#!watchflakes
post <- pkg == "runtime" && test == "TestGdbCoreCrashThreadBacktrace"

Issue created automatically to collect these failures.

Example (log):

--- FAIL: TestGdbCoreCrashThreadBacktrace (7.67s)
    runtime-gdb_test.go:81: gdb version 10.1
    runtime-gdb_unix_test.go:334: child output:
        Setting RLIMIT_CORE = syscall.Rlimit{Cur:0xffffffffffffffff, Max:0xffffffffffffffff}
        SIGSEGV: segmentation violation
        PC=0x48cc65 m=0 sigcode=1 addr=0x0
        signal arrived during cgo execution

        goroutine 1 gp=0xc0000081c0 m=0 mp=0x548740 [syscall]:
        runtime.cgocall(0x48cc6e, 0xc00006ff20)
...
        #19 0x000000000043b076 in runtime.mstart0 () at /workdir/go/src/runtime/proc.go:1684
        #20 0x0000000000462c65 in runtime.mstart () at /workdir/go/src/runtime/asm_amd64.s:394
        #21 0x000000000048d669 in crosscall1 () at gcc_amd64.S:42
        #22 0x0000000000802000 in ?? ()
        #23 0x00007f8b74b75fc0 in ?? ()
        #24 0x00007ffd0d4c302f in ?? ()
        #25 0x000000c000008a80 in ?? ()
        #26 0x0000000000462c60 in ?? ()
        #27 0x0000000000000000 in ?? ()
    runtime-gdb_unix_test.go:375: could not find trigger_crash in backtrace

— watchflakes

The text was updated successfully, but these errors were encountered:

gopherbot · 2023-12-15T19:05:11Z

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "runtime" && test == "TestGdbCoreCrashThreadBacktrace"

2023-12-14 22:24 linux-amd64-bullseye go@d73b4322 runtime.TestGdbCoreCrashThreadBacktrace (log)

--- FAIL: TestGdbCoreCrashThreadBacktrace (7.67s)
    runtime-gdb_test.go:81: gdb version 10.1
    runtime-gdb_unix_test.go:334: child output:
        Setting RLIMIT_CORE = syscall.Rlimit{Cur:0xffffffffffffffff, Max:0xffffffffffffffff}
        SIGSEGV: segmentation violation
        PC=0x48cc65 m=0 sigcode=1 addr=0x0
        signal arrived during cgo execution

        goroutine 1 gp=0xc0000081c0 m=0 mp=0x548740 [syscall]:
        runtime.cgocall(0x48cc6e, 0xc00006ff20)
...
        #19 0x000000000043b076 in runtime.mstart0 () at /workdir/go/src/runtime/proc.go:1684
        #20 0x0000000000462c65 in runtime.mstart () at /workdir/go/src/runtime/asm_amd64.s:394
        #21 0x000000000048d669 in crosscall1 () at gcc_amd64.S:42
        #22 0x0000000000802000 in ?? ()
        #23 0x00007f8b74b75fc0 in ?? ()
        #24 0x00007ffd0d4c302f in ?? ()
        #25 0x000000c000008a80 in ?? ()
        #26 0x0000000000462c60 in ?? ()
        #27 0x0000000000000000 in ?? ()
    runtime-gdb_unix_test.go:375: could not find trigger_crash in backtrace

— watchflakes

prattmic · 2023-12-15T19:21:12Z

This test is new in https://go.dev/cl/536895. The most obvious cause for this failure would be the crash thread failing to crash for >5s. However, we don't have any timing information in the test log.

cc @zzkcode @cherrymui

prattmic · 2023-12-15T19:23:06Z

Actually, a better theory: usleep doesn't loop on early return of the nanosleep syscall (e.g., if another signal is delivered), so it could return in <5s. We should really fix that.

zzkcode · 2023-12-16T04:15:51Z

Hi @prattmic , thanks! I would like to send a CL to fix this.

This test is new in https://go.dev/cl/536895. The most obvious cause for this failure would be the crash thread failing to crash for >5s. However, we don't have any timing information in the test log.

cc @zzkcode @cherrymui

Thanks. Are you saying that we should add more logs about timing in signal_unix.go(maybe with how many ms still waiting SIGQUIT also) in each loop? Since it seems hard to print timing related in this test code except printing the overall time.

Actually, a better theory: usleep doesn't loop on early return of the nanosleep syscall (e.g., if another signal is delivered), so it could return in <5s. We should really fix that.

Sorry, I don't know if I really get this. You are talking about usleep may sleep less than what we want if a signal is delivered, and maybe one of the thread(not the faulting thread which cause crash) may crash the process instead of the faulting thread? So use nanosleep instead?

I guess, in this case, maybe it just really takes more than 5s to deliver the SIGQUIT signals since this case takes 7.67s in running.

--- FAIL: TestGdbCoreCrashThreadBacktrace (7.67s)

However, it still stranges to me that the ms which are in usleep state should either the faulting thread or the thread that had received the SIGQUIT. In the former case, this test should still pass; in the latter, won't they just blocked the signal?

// All m's that have already received SIGQUIT have signal masks blocking
// receipt of any signals, so the SIGQUIT will go to an m that hasn't seen it yet.

bcmills · 2024-01-02T17:16:07Z

@prattmic, since this test is new as of Go 1.22, should it be marked as a release blocker?

If an end user runs go test all, that will also (necessarily) run the tests for runtime, so spurious failures in that package can add a lot of noise depending on the user's environment (compare #39204).

prattmic · 2024-01-02T21:40:03Z

This has only failed once, so failure rate doesn't currently seem too bad, but ultimately this is testing a best-effort feature that intentionally "breaks" after 5s, so it won't ever be 100% reliable (given a slow enough machine). That makes it a good candidate for #39204.

@zzkcode thanks for pointing out the total test time, I missed that. That does imply that 5s really did pass, rather than something else happening, like an early usleep return. It would be nice to get the core file to see what was happening. LUCI is theoretically capable of collecting extra files, but that isn't hooked up yet.

For now, perhaps we should modify the test to run a few times (3?) and at least verify that the functionality isn't failing 3/3 times?

zzkcode · 2024-01-03T14:41:46Z

Hi @prattmic, thanks for confirmation.

For now, perhaps we should modify the test to run a few times (3?) and at least verify that the functionality isn't failing 3/3 times?

IMHO, since this test case includes a small go+cgo program and only have few(5/6 maybe?) threads to quit, it should usually spend few seconds to exit. So we would expected to see this case pass(all threads received SIGQUIT and quit within 5 seconds) in the vast majority of cases, but not cover all cases(it will eventually quit anyway after 5 seconds). Like you said, however, it may be running on a slow enough machine(bullseye here, a kind of Raspberry Pi OS?), which is indeed slower comared to PCs.

I don't know whether this is a best choice to loop test more times, or just skip it on some platforms? I would prefer the former one for now since it only reported once as you mention before, but it won't supprise me if it continues fails 3 times in the future(rare event). So let's just loop 3 times, and if it pass, we just break and skip the rest loops. Otherwise, give it another 2 chances? It may cost more times(15s+) to run this single case in extreme scenarios. I want to submit a pr to fix this if this is OK.

One more tiny thing here(unrelated to this test, but related to the code change), I don't know if this is worth to be a change(maybe should fire another issue): as the 1.22rc1 is released, I had tried it to serveal times. It would usually spend at least 500ms(per loop) to quit when crash, kind of slower compared to 1.21 or earier releases(like sticking in somewhere for a very few moment while waiting the first thread to crash). Maybe we could use exponential backoff or just double the sleep time if this is a issue. Should not a big deal, just an experience thing.

FYI @cherrymui @bcmills

bcmills · 2024-01-03T16:39:08Z

In general it is not a good idea for tests to assume that something happens within any particular arbitrary timeframe. Is there some way to eliminate the arbitrary 5-second interval?

For example, could the test wait indefinitely instead of just for 5 seconds, or poll indefinitely to check whether the backtrace has been recorded yet? Then the timeout behavior of the test would follow the go test -timeout flag rather than an arbitrary machine-specific constant.

prattmic · 2024-01-03T18:39:21Z

Is there some way to eliminate the arbitrary 5-second interval?

Not easily, the timeout is hard-coded into a sleep in the fatal signal handler: https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;l=784;drc=de5b418bea70aaf27de1f47e9b5813940d1e15a4. But I suppose we could use a global variable and have the test use //go:linkname to change the global to a very large number. Maybe that is the best path, however ugly.

prattmic · 2024-01-03T18:46:19Z

It would usually spend at least 500ms(per loop) to quit when crash, kind of slower compared to 1.21 or earier releases(like sticking in somewhere for a very few moment while waiting the first thread to crash).

This is a good point. This is an effect of https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;l=781;drc=de5b418bea70aaf27de1f47e9b5813940d1e15a4. The crashing thread will almost certainly sleep for 500ms at least once while waiting for SIGQUIT to make its rounds. IMO, we should just reduce this is something small, like 1ms, as it shouldn't take long for other threads to exit.

bcmills · 2024-01-03T18:54:10Z

the timeout is hard-coded into a sleep in the fatal signal handler

Oof, that's unfortunate. Is there some reason the runtime's watchdog needs to be that short?
AFAICT the SIGQUIT sequence is O(M) with the number of runtime Ms, so an O(1) timeout on that sequence seems out of place.

Perhaps we could extend the sleep until crashing.Load() stops increasing? That would allow an arbitrarily long O(N) crash sequence while still providing a watchdog in case the crash sequence wedges.

zzkcode · 2024-01-04T16:30:10Z

The crashing thread will almost certainly sleep for 500ms at least once while waiting for SIGQUIT to make its rounds. IMO, we should just reduce this is something small, like 1ms, as it shouldn't take long for other threads to exit.

@prattmic Maybe it will need more time to crash in a large enough application, like having thounands of thread? In that case, it may have to loop many times in check -> sleep -> wake up -> check ...? How about a slightly larger value, like 10ms?

Oof, that's unfortunate. Is there some reason the runtime's watchdog needs to be that short?

@bcmills I don't have knowledge about its original backgroud. However, this change is intended to be a fix in https://go.dev/cl/536895. Before this, the thread who first see all threads had received SIGQUIT will start to crash(usually the last thread received that signal). And all the other threads will start sleep for 5 seconds. If the thread(the one should crash) miss crash in some edge case, the others will be able to crash process. So the process will always take no more than 5 seconds to crash. This change just follow this. Maybe it just for crashing asap? FYI @ianlancetaylor

@prattmic @bcmills Did a tiny change, is this acceptable?

Increase total sleep time to 10s for fixing test flake(May need 10s to see it exit in the worst case);
sleep 1ms every loop for crashing asap(May need loop thounands of times in the worst case)

diff --git a/src/runtime/signal_unix.go b/src/runtime/signal_unix.go
index 84391d58ed..dc48067b45 100644
--- a/src/runtime/signal_unix.go
+++ b/src/runtime/signal_unix.go
@@ -774,14 +774,16 @@ func sighandler(sig uint32, info *siginfo, ctxt unsafe.Pointer, gp *g) {
                        print("\n-----\n\n")
                        raiseproc(_SIGQUIT)
                }
+               totalSleepTimeUs := 10 * 1000 * 1000
                if isCrashThread {
-                       i := 0
-                       for (crashing.Load() < mcount()-int32(extraMLength.Load())) && i < 10 {
-                               i++
-                               usleep(500 * 1000)
+                       perLoopSleepTimeUs := 1000
+                       i := totalSleepTimeUs / perLoopSleepTimeUs
+                       for (crashing.Load() < mcount()-int32(extraMLength.Load())) && i > 0 {
+                               i--
+                               usleep(uint32(perLoopSleepTimeUs))
                        }
                } else {
-                       usleep(5 * 1000 * 1000)
+                       usleep(uint32(totalSleepTimeUs))
                }
                printDebugLog()
                crash()

Test pass on Linux.

prattmic · 2024-01-05T17:07:43Z

@prattmic Maybe it will need more time to crash in a large enough application, like having thounands of thread? In that case, it may have to loop many times in check -> sleep -> wake up -> check ...? How about a slightly larger value, like 10ms?

I was unsure how long making these crash rounds takes, so I added some quick and dirty logging to the runtime and got:

8 threads: crash time us 5430
16 threads: crash time us 10572
32 threads: crash time us 16223
64 threads: crash time us 40343
128 threads: crash time us 100810
256 threads: crash time us 203888
512 threads: crash time us 406889

So approximately 1ms per thread (this feels rather high to me, but now isn't the time to worry about optimizing crash too much).

Perhaps 5ms is a good sleep value given these numbers. We could get fancy and have this depend on the M count, but I'm really not very concerned about the inefficiency of one thread waking a bit too frequently when we are crashing.

Your patch above looks good, feel free to send a CL. Note that you'll need to make totalSleepTimeUs a global variable for the test to be able to modify it with linkname.

Fixes golang#64752

gopherbot · 2024-01-08T15:04:36Z

Change https://go.dev/cl/554615 mentions this issue: runtime: crash asap and extend total sleep time for slow machine in test

zzkcode · 2024-01-08T15:07:24Z

Hi @prattmic , I had submitted a CL. However, I don't know whether I get you correctly or I just do it the wrong way: the totalSleepTimeUs in test had successfully linked runtime.totalSleepTimeUs, but it seems failed to change it during the test. The test case TestGdbCoreCrashThreadBacktrace will use go tool(from testenv.GoToolPath) to build from source code coreCrashThreadSource, and since this change: totalSleepTimeUs = 60 * 1000 * 1000 doesn't work as the go(from from testenv.GoToolPath) it used does not include this change(not building with the test souce code)?

Sorry if I get you wrong and please advise me the right way to do this. Thanks.

Fixes golang#64752

bcmills · 2024-01-10T17:00:41Z

Some analysis of the test log reported by watchflakes.

The program ends up running six goroutines in total. When run on my workstation, at the time of the crash the goroutines are parked in the following locations:

main in C.trigger_crash
forcegchelper in goparkunlock
bgsweep in goparkunlock
bgscavenge in goparkunlock
runfinq in gopark
sysmon in notetsleep
- (Note that the sysmon goroutine is not included in the output from tracebackothers — I guess because it doesn't have an associated P or G?)

In a successful run on my local workstation, I see SIGQUIT: quit logs for five background threads:

one in notetsleep via sysmon
one in notesleep via templateThread
three in stopm via schedule and findRunnable

In the builder log, we see SIGQUIT: quit logs for only four background threads:

one in notetsleep via sysmon (as expected)
one in notesleep via templateThread (as expected)
one in allocm via schedule, resetspinning, and newm
one in stopm via schedule and startlockedm

The thread in allocm looks suspiciously like a race in the runtime. In particular:

startm calls mReserveID (incrementing mnext) before it calls newm
The newm thread was interrupted by SIGQUIT during allocm, which is before the new thread is actually started for the M (in newm1).

So this looks like a genuine watchdog failure: the kernel defers delivery of the final SIGQUIT signal because all of the threads that could receive it are already handling a SIGQUIT (with that signal presumably masked), and the main thread is spinning in the docrash loop waiting for acknowledgement from a fifth thread that had an M ID reserved but was never actually started.

Perhaps startm needs to block signals from at some point before calling mReserveID until newm has returned?

bcmills · 2024-01-11T16:35:43Z

I wonder if this race is also responsible for one or more of the other patterns of test flakes, such as:

prattmic · 2024-01-11T22:12:46Z

Thanks for the analysis @bcmills! I agree with you, this looks like a genuine race condition. It is also note new in https://go.dev/cl/536895, it appears to have ~always existed (https://go.dev/cl/245018 extended the race window, but it was present before that too).

Regarding a fix, I'll need to think some more. I'm not thrilled about masking signals, but I don't have a better idea at the moment.

I could imagine this causing some of those other crashes, but not all. e.g., #60316: I don't see how this could manifest as a SIGCHLD death.

Since this bug isn't new, fixing it need not block the release, but we should still do something about the test. Should we mark it flaky until this bug is fixed?

bcmills · 2024-01-11T22:25:25Z

Yep, I agree that a testenv.SkipFlaky in the new test would be fine to unblock the Go 1.22 release.

gopherbot · 2024-01-17T18:49:40Z

Change https://go.dev/cl/556356 mentions this issue: runtime: mark TestGdbCoreCrashThreadBacktrace as flaky

This test exercises the SIGQUIT crash loop and managed to trigger the race from golang#65138 at least once. For golang#65138. Fixes golang#64752. Change-Id: I11091510aa7ae4f58b1d748e53df2e3e3dbfb323 Reviewed-on: https://go-review.googlesource.com/c/go/+/556356 Auto-Submit: Michael Pratt <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Reviewed-by: Cherry Mui <[email protected]>

Fixes golang#64752

Running with few threads usually does not need 500ms to crash, so let it crash as soon as possible. While the test may caused more time on slow machine, try to expand the sleep time in test. Updates #64752 Change-Id: I635fab846bd5e1735808d4d47bb9032d5a04cc2b GitHub-Last-Rev: 84f3844 GitHub-Pull-Request: #65018 Reviewed-on: https://go-review.googlesource.com/c/go/+/554615 Reviewed-by: Michael Pratt <[email protected]> Reviewed-by: Bryan Mills <[email protected]> Auto-Submit: Cherry Mui <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Cherry Mui <[email protected]>

gopherbot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 15, 2023

gopherbot added this to Test Flakes Dec 15, 2023

gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 15, 2023

bcmills added this to the Go1.22 milestone Jan 2, 2024

bcmills mentioned this issue Jan 2, 2024

runtime: move tests for runtime-gdb.py to a separate package and do not run them during all.bash #39204

Open

prattmic added the release-blocker label Jan 2, 2024

github-project-automation bot added this to Release Blockers Jan 2, 2024

mknyszek added this to Go Compiler / Runtime Jan 3, 2024

mknyszek moved this to In Progress in Go Compiler / Runtime Jan 3, 2024

mknyszek assigned prattmic Jan 3, 2024

zzkcode added a commit to zzkcode/go that referenced this issue Jan 8, 2024

runtime: crash asap and extend total sleep time for slow machine in test

01536b9

Fixes golang#64752

zzkcode added a commit to zzkcode/go that referenced this issue Jan 8, 2024

runtime: crash asap and extend total sleep time for slow machine in test

2eb8fef

Fixes golang#64752

zzkcode mentioned this issue Jan 8, 2024

runtime: crash asap and extend total sleep time for slow machine in test #65018

Closed

dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jan 8, 2024

dmitshur added the FixPending Issues that have a fix which has not yet been reviewed or submitted. label Jan 8, 2024

dmitshur moved this to In Progress in Release Blockers Jan 8, 2024

zzkcode added a commit to zzkcode/go that referenced this issue Jan 10, 2024

runtime: crash asap and extend total sleep time for slow machine in test

2a703a3

Fixes golang#64752

bcmills removed the FixPending Issues that have a fix which has not yet been reviewed or submitted. label Jan 10, 2024

prattmic mentioned this issue Jan 17, 2024

runtime: SIGQUIT crash loop + startm race condition deadlock #65138

Open

gopherbot closed this as completed in 3c1155e Jan 17, 2024

github-project-automation bot moved this from In Progress to Done in Release Blockers Jan 17, 2024

github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Jan 17, 2024

github-project-automation bot moved this to Done in Test Flakes Jan 17, 2024

mknyszek removed this from Go Compiler / Runtime Feb 28, 2024

archanaravindar mentioned this issue May 10, 2024

runtime: frequent TestCrashDumpsAllThreads failures on linux/s390x #64650

Open

zzkcode added a commit to zzkcode/go that referenced this issue May 10, 2024

runtime: crash asap and extend total sleep time for slow machine in test

83d9ae6

Fixes golang#64752

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: TestGdbCoreCrashThreadBacktrace failures #64752

runtime: TestGdbCoreCrashThreadBacktrace failures #64752

gopherbot commented Dec 15, 2023 •

edited by bcmills

Loading

gopherbot commented Dec 15, 2023

prattmic commented Dec 15, 2023

prattmic commented Dec 15, 2023

zzkcode commented Dec 16, 2023 •

edited

Loading

bcmills commented Jan 2, 2024

prattmic commented Jan 2, 2024

zzkcode commented Jan 3, 2024 •

edited

Loading

bcmills commented Jan 3, 2024

prattmic commented Jan 3, 2024

prattmic commented Jan 3, 2024

bcmills commented Jan 3, 2024

zzkcode commented Jan 4, 2024 •

edited

Loading

prattmic commented Jan 5, 2024

gopherbot commented Jan 8, 2024

zzkcode commented Jan 8, 2024 •

edited

Loading

bcmills commented Jan 10, 2024 •

edited

Loading

bcmills commented Jan 11, 2024

prattmic commented Jan 11, 2024

bcmills commented Jan 11, 2024

gopherbot commented Jan 17, 2024

runtime: TestGdbCoreCrashThreadBacktrace failures #64752

runtime: TestGdbCoreCrashThreadBacktrace failures #64752

Comments

gopherbot commented Dec 15, 2023 • edited by bcmills Loading

gopherbot commented Dec 15, 2023

prattmic commented Dec 15, 2023

prattmic commented Dec 15, 2023

zzkcode commented Dec 16, 2023 • edited Loading

bcmills commented Jan 2, 2024

prattmic commented Jan 2, 2024

zzkcode commented Jan 3, 2024 • edited Loading

bcmills commented Jan 3, 2024

prattmic commented Jan 3, 2024

prattmic commented Jan 3, 2024

bcmills commented Jan 3, 2024

zzkcode commented Jan 4, 2024 • edited Loading

prattmic commented Jan 5, 2024

gopherbot commented Jan 8, 2024

zzkcode commented Jan 8, 2024 • edited Loading

bcmills commented Jan 10, 2024 • edited Loading

bcmills commented Jan 11, 2024

prattmic commented Jan 11, 2024

bcmills commented Jan 11, 2024

gopherbot commented Jan 17, 2024

gopherbot commented Dec 15, 2023 •

edited by bcmills

Loading

zzkcode commented Dec 16, 2023 •

edited

Loading

zzkcode commented Jan 3, 2024 •

edited

Loading

zzkcode commented Jan 4, 2024 •

edited

Loading

zzkcode commented Jan 8, 2024 •

edited

Loading

bcmills commented Jan 10, 2024 •

edited

Loading