-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syscall: TestSetuidEtc failures on linux-ppc64-buildlet #42462
Comments
Marking as release-blocker at least until we determine whether this is due to an API change in Go 1.16 (per #42178 (comment)). |
Please assign this bug report to me. |
I noticed this on the dashboard. It seems like there is some kind of timing issue in reading the /proc//status file within one process. I briefly tried to reproduce it and was not able to but didn't try too hard. Both threads are trying to read the same file. Odd that it only happens on ppc64 and not ppc64le. |
Reviewing the failure logs, these three all share the same failure characteristics (root cause currently unknown)
Looking at build.golang.org, this failure mode appears to be rare. These failures are believed to have been resolved in 3a819e8 11 days ago (ie., the failure signature matches something explicitly fixed with that commit): |
Looking for patterns, the three failures are listed below. The Commonality visible in the first and third log appears to be that, as Lynn points out, the listed thread id is common to the failing test case. I'm struck by how close together the three listed thread IDs are: 22053, 21921 and 21923. Is this to be expected? For example, is the build run in a container launched just to run the build tests, so the number of processes consumed by the time the test runs is generally about the same? Or is the available process range pretty small in this container and this is just evidence of ID reuse? From debugging #42178 we determined that ppc64 (sans "le") runs without cgo, so this code is executing the syscall.AllThreadsSyscall() function behind the scenes when the test calls syscall.Set*(). This may or may not be important. I wonder if similar sporadic failures have been seen in the nocgo build tests on other platforms?
|
Looking at this output using the following method, we see that there is evidence that the threads individually stop behaving as part of the collective process. In the first case, this is immediately - the Setegid(1) fails to take. In the third case that call succeeds, but the next Seteuid(1) call fails. Looking at the code for the test https://github.com/golang/go/blob/master/src/syscall/syscall_linux_test.go#L596 All of the test cases after the first failing one also fail, unless the test is looking for the stuck state. What also seems to happen is that the test stops failing at some point before getting to the last test (which should be numbered [20]). This suggests that whatever is causing the test to fail, cleans itself up before the test reaches the end. My suspicion is that this test is, in some way, racing thread termination on this architecture's test environment. "grep uid" on the first and third of these yields:
and
and "grep gid" yields:
and
|
Racing with thread termination would also explain the second failure. |
This passes for me (total running time about 1m26s) on linux-ppc64-buildlet: go/bin/go test syscall -v -run=TestSetuidEtc -count=10000 Given the rarity of this failure mode, I'm going to have to think a bit more about how to fix it. |
Without the -v it runs quicker, but similarly no failures. I'm going to explore modifying the test to see if I can make it fail quicker, and/or more verbosely. |
Based on the theory that the issue is possibly a race somewhere with terminating a thread, I modified
[*] Note, this patch includes a fix that is unrelated to the bug in question, and fixes an issue with the last test fix... I was able to get some failures on my non-ppc64 workstation, but those turned out to be due to the fact that Using that I generated some similar failures on
|
Change https://golang.org/cl/268717 mentions this issue: |
The issue appears to be another race with interpreting the /proc/ filesystem in the testing code: |
Note, in passing, that the master branch just threw up another failure for linux-ppc-buildlet: https://build.golang.org/log/650f97099ef1a84e84bb7f2d17f63a20d4c04559 This doesn't yet include the https://golang.org/cl/268717 fix. |
So, the first version of the changes unexpectedly failed: 2 of 23 SlowBots failed: My plan is to revert the thread termination part of the change to misc/cgo/test/issue1435.go and retry. I'm not yet clear if this is a newly uncovered bug, or something related to what I thought I was fixing. |
Some investigation later, there were two issues:
My plan is to submit a fix for this bug in two stages - fix the original bug as reported upto and including (1) of this comment. Then investigate why (2) is failing. It could be some subtlety of glibc's thread implementation, since launching C.pthread_create() is the only way the issue1435.go test differs from the one in syscall_linux_test.go. |
These two test continue to fail: 2 of 23 SlowBots failed: All evidence of (1) is gone now, so I'll focus on what is up with the cgo failures now. The net change with this fix "shouldn't" have triggered any new failures, so I suspect we're tripping over what must have been let's say a heisenbug lurking all along with the cgo side of the feature... [Here's hoping it is not a glibc issue.] |
I've decided to split this issue into two (see #42494 for the aspect of the issue I plan not to resolve in this present bug). For this present bug, I want to get the checked in source back to testing stability by addressing the issue with the ppc64 build testing. |
2020-11-06T19:42:05-362d25f/linux-ppc64-buildlet
2020-10-30T00:03:40-01efc9a/linux-ppc64-buildlet
2020-10-29T19:03:09-f2c0c2b/linux-ppc64-buildlet
2020-10-23T23:01:52-64dc25b/linux-ppc64-buildlet
2020-10-23T20:54:25-5f616a6/linux-ppc64-buildlet
See previously #42178 (@AndrewGMorgan, @laboger, @ianlancetaylor)
The text was updated successfully, but these errors were encountered: