Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/signal: TestSignalTrace failures with "timeout after 100ms waiting for hangup" #46736

Closed
bcmills opened this issue Jun 14, 2021 · 11 comments · Fixed by ferrmin/go#87
Closed
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jun 14, 2021
@bcmills bcmills added this to the Go1.17 milestone Jun 14, 2021
@bcmills
Copy link
Contributor Author

bcmills commented Jun 14, 2021

Marking as release-blocker for Go 1.17 (CC @golang/release) because this test is new as of 1.17 — we shouldn't be shipping new tests that are known to be flaky.

If we can confirm that the test failures are not due to an actual regression in Go 1.17, we can add a call to testenv.SkipFlaky to unblock the release while we figure out how to make the test more robust.

@AndrewGMorgan
Copy link
Contributor

The #45773 issue referred to exclusively linux failures. The code I was testing when I added this test case (for #44193) was linux specific, so they seemed relevant - I just hadn't seen #45773 until being cc:d on this one.

I recall being concerned while developing the fix for the #44193 that I wasn't breaking the thing it tests with my change. It didn't occur to me that the pre-existing code might not be working. Further, the fact that this present bug seems to be for non-linux code: darwin and illumos, I'm fairly sure they must have a different root cause.

Have we seen this issue on the 1.16 branch? Or do we believe this is a 1.17 regression?

@bcmills
Copy link
Contributor Author

bcmills commented Jun 15, 2021

Have we seen this issue on the 1.16 branch? Or do we believe this is a 1.17 regression?

TestSignalTrace itself is new in 1.17 (added in CL 305149, so the test flakiness is by definition a regression.

The new test verifies the fix for the bug found in #44193, which was similar to #43149, which was present in 1.16.

So my best guess is that the underlying cause was either similar or even worse in 1.16, and the test failures indicate either an incomplete fix or a bug in the test itself.

@AndrewGMorgan
Copy link
Contributor

I'm a bit confused, so here are some details that seem relevant to me. Please correct any of them, or add some more data points if known:

So, it feels like an important data point to seek is, do we have any crash logs like this from 1.16 (after CL 316869)?

If not, are we convinced that this present bug isn't purely a 1.17 regression with code paths tested by TestSignalTrace()?

@bcmills
Copy link
Contributor Author

bcmills commented Jun 16, 2021

So, it feels like an important data point to seek is, do we have any crash logs like this from 1.16 (after CL 316869)?

I don't see any failures on the dashboard for the 1.16 branch, but given how few test runs occur on that branch that doesn't tell us much. (The rate of failures at head is high enough to rule out a hardware flake, but the failures are still relatively infrequent overall.)

@toothrot
Copy link
Contributor

@AndrewGMorgan
Copy link
Contributor

The build failure logs for linux-ppc* seem to be all around the timeline of just before this:

https://go-review.googlesource.com/c/go/+/315049/ (submitted may 4)

is it reasonable to discount the linux-ppc* examples from the list at the top of this present bug? Or were those after that CL was applied? If so, it looks like illumos and darwin are the architectures that remain occasionally hiccuping, with a 100ms timeout. Is there some reason to expect signal delivery to be that long on these? Or trace stop/start to be significantly slower on these architectures?

@bcmills
Copy link
Contributor Author

bcmills commented Jun 17, 2021

Yes, I think it's reasonable to focus on the illumos and darwin failures.

I don't know why darwin would be particularly slow. The illumos builder is a reverse-builder run by @jclulow, and I think it's a bit more heavily loaded than many of the other builders.

(But, really, tests in the standard library should be written so that they don't assume fast hardware, because Go users in general should be able to run go test all on their code, and many Go users do have slow or older machines.)

@AndrewGMorgan
Copy link
Contributor

To be quite honest, when I added this test, I was just reusing the pre-existing waitSig() function, assuming it must be the standard way to do all this. Tuning whatever timeout never crossed my mind.

Given these occasional timeouts, I'd be tempted to replace all the complex code in signal_test.go:init() with simply setting the timeout to 30 * time.Second. Given that it is a timeout that is fatal if it ever occurs, and we don't expect it to ever fire for cause, this seems like a pragmatic way to avoid false positives on all architectures. If I read all the workaround code it seems to be saying that the OS and load etc. makes the timing unpredictable.

@ianlancetaylor
Copy link
Member

We currently use settleTime for two different things: for waitSig and friends, and for quiesce. We don't want to increase the time used by quiesce arbitrarily, because that will slow down the tests. But I agree that there doesn't seem to be a reason to use a short timeout for waitSig.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/329502 mentions this issue: os/signal: test with a significantly longer fatal timeout

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants