-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: macOS Sierra builders spinning #18751
Comments
(Other comment deleted. The SIGQUIT shown in the deleted post was the buildlet process, not the spinning one.) |
Probably unrelated, but just in case...yesterday, the net/http tests starting hanging on my laptop, and required a reboot to get passing again. The test that was hanging was TestTransportPersistConnReadLoopEOF. I'm afraid I don't have the stack traces anymore. Again, probably unrelated, many of the net/http benchmarks fail on Sierra with 'too many open files'. |
@josharian, this is just |
@bradfitz, can you |
I did I then also tried a Ctrl-\ at parent (the buildlet) and got a backtrace of only the buildlet (backtrace paths were like /home/bradfitz instead of the expected /Users/gopher) Trying again now. |
Possibly related: #17161 . Infinite retry was added to solve that issue. From the second stacktrace, it looks like it may be going into that infinite retry loop. |
That's too bad. A few other things come to mind: Does it freeze even if you run a trivial subcommand like, say,
I would expect to see Mach semaphore calls in the dtrace if this were the case, though maybe those are at the wrong level in the OS X kernel for dtrace? (I've always found this confusing.) The second traceback was just from the buildlet, so I don't think it's related. That's also exactly what I would expect a normal parked P's stack to look like. |
Can you add a In case it's not obvious the |
@aclements |
Additionally, dtrace can only handle cgo using (dynamic linking) Darwin
binaries.
|
Most likely explanation is you are bootstrapping with a stale Go 1.4, without the Sierra time fix patches. |
@rsc, I addressed that in the top comment. That was a problem when I first set up the builder (cloning from 10.11), but it very quickly revealed itself to be a problem. But I just double-checked anyway. I logged into a 10.12 VM, extracted https://storage.googleapis.com/golang/go1.4-bootstrap-20161024.tar.gz from the https://golang.org/doc/install/source page into a new directory and diffed the VM's baked-in $HOME/go1.4 against the new extracted dir. They're the same. So the Sierra machine's $HOME/go1.4 is a modern one. Unless https://storage.googleapis.com/golang/go1.4-bootstrap-20161024.tar.gz is bogus, but people have been successfully using it. |
@josharian your hang is #18541. I think that's not the issue here (no network involved). |
A little concerned about the go1.4-bootstrap tar not having e39f2da (fix unsigned shift "warnings"), but I just tried flipping my go1.4 to the CL from the bootstrap tar and it works OK. I still think this seems like the time problem, but clearly not that exact time problem (sorry for missing that in the original post). Maybe something is different when running under VMware. @bradfitz can you send me instructions to connect to the VMs off-list? |
Done. |
OK, I'm running make.bash in a loop waiting for a hang to poke at. In the meantime, I've remembered #18540, which may be the same thing - unexplained hang waiting for child. |
I have run 61 successful make.bash in a row on that VM. I also restarted the buildlet. Lots of Go 1.6 SIGSEGVs, not surprising. On Go 1.8, one hang in crypto/ssh during netPipe, which does:
That looks a little suspect. I don't know if it's guaranteed that the Dial will return before the Accept executes. I think there's a second instance of this happening right now, but it's a 10 minute timeout. Still waiting for a hang like in the original report. |
The buildlet is hung right now running 'go list', but the specific binary is Go 1.6.4, which doesn't have the time fixes. It looks to me like the buildlet is being used for the subrepo tests as well, including "subrepos on Go 1.6" (right now it is stuck on go list golang.org/x/blog/...). That's just not going to work on Sierra and we should avoid running those tests on the Sierra buildlet. |
CL https://golang.org/cl/35643 mentions this issue. |
@rsc, ugh, fun. The JSON at https://build.golang.org/?mode=json doesn't give me quite enough info to work with. I get:
Which is "master" for the "blog" repo, but the listed goRevision of aa1e69f is the Go 1.6 branch. I guess I need to query the git graph in the build system to figure out whether that revision has the Sierra time fix or not. |
Updates golang/go#18751 Change-Id: Iadd7dded079376a9bf9717ce8071604cee95d8ef Reviewed-on: https://go-review.googlesource.com/35643 Reviewed-by: Kevin Burke <[email protected]> Reviewed-by: Russ Cox <[email protected]>
Okay, the Sierra builder is only doing Go 1.7+ branches now, and isn't building subrepos. And the buildlet binary is verified to be built with Go 1.8. What's up with all the TLS timeout errors? See the third column at https://build.golang.org/ e.g. https://build.golang.org/log/e0a86210e3754182956958d6e8e76b5069c7788e |
FWIW, the macOS firewall is off that VM. |
Maybe the kernel module is buggy even when the firewall is turned off? This looks like the usual bad firewall:
Each goroutine is waiting for the other (in the network stack). |
Bumping to Go 1.10, but it'd be nice to understand why we can't run all.bash in a loop on Sierra without wedging the machine. |
Is this still an issue? |
Well, we worked around the issue by having the macOS Sierra VMs be single use. After each build, the VM is nuked and recloned/booted from its base image. It adds about 60 seconds of wasted time, but it's probably the right thing to do from an isolation standpoint anyway. But I assume this issue remains, that we can't do a bunch run all.bash in a loop on Sierra. |
Is it easy to try taking the nuke-the-VM step out? I found and fixed a big problem with Sierra stability vs all.bash a few weeks ago. After that I was able to run all.bash in a loop (4 loops in parallel in different directories, just to make things interesting) for 24 hours on a standard Sierra-image MacBook Pro. It would be interesting to know if the VMs are differently broken or had the same root cause. |
…ersion We used to do this only by necessity on Sierra, due to Sierra issues, but now that x/build/cmd/makemac is better, we can just do this all the time. Doing it always is better anyway, to guarantee fresh environments per build. Also add the forgotten sudo in the Mac halt. The env/darwin/macstadium/image-setup-notes.txt even calls out how to setup password-free sudo to the images, but then I forgot to use it. It only worked before (if it did?) because the process ended and failed its heartbeat, at least some of the time. It's also possible it was never working. The old reason that Sierra machines were special-cased to reboot was reportedly fixed anyway, in golang/go#18751 (comment). Updates golang/go#9495 Change-Id: Iea21d7bc07467429cde79f4212c2b91458f8d8d8 Reviewed-on: https://go-review.googlesource.com/82355 Reviewed-by: Brad Fitzpatrick <[email protected]>
It's probably worth retrying this after all the work on #17490. |
Posting here since #18541 (which mentions this test more directly) is closed with:
I just got this Adding a data point here since it's not already mentioned in this issue. Test Failure Output
|
They seem fine lately. Closing. |
Just got a failure on master. Reopening.
|
What makes you think this is the same bug? This bug was about something pretty specific. Not just "I saw a flaky test once". |
@bradfitz isn't my error the exact same as #18751 (comment)? |
That's not the point. Either because our tests sucked or macOS sucked (#18541 etc), a number of our network-intensive tests have been flaky on Macs. Independently (this bug), our macOS Sierra builders on VMware were spinning, manifesting itself in lots of weird errors, perhaps including the one you saw. Or not. They were related but different. Because we could not diagnose either well, we eventually closed #18541 as likely being related to this one one we stopped seeing the errors (but note we also saw them on non-VMware Macs, like Russ's laptop), and then we closed this one when we upgraded our VMware cluster to a new version and also had redid all the Macs VMs to be emphemeral, boot-per-build. I really doubt this bug is related to the CPU spinning. If you have evidence that it is, let me know, but I think this is just a flaky test. (#18541) I'd rather not re-use this bug. |
I'll open a new bug if it reoccurs. |
I finally (sorry) set up macOS Sierra VMs last week.
They keep hanging in infinite loops. I just attached to one of them to see what's going on:
The
go_bootstrap install -v std cmd
command is spinning.Using
dtrace -p <pid>
, I see an endless stream of:The GOROOT_BOOTSTRAP is the Go 1.4 branch with the Sierra fixes. (it barely makes it a second into the build otherwise)
The VMs are running under VMWare ESXi, just like all the OS X 10.8, 10.10, and 10.11 builders, which are all fine.
The text was updated successfully, but these errors were encountered: