Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: macOS Sierra builders spinning #18751

Closed
bradfitz opened this issue Jan 22, 2017 · 40 comments
Closed

runtime: macOS Sierra builders spinning #18751

bradfitz opened this issue Jan 22, 2017 · 40 comments
Labels
FrozenDueToAge help wanted OS-Darwin Testing An issue that has been verified to require only test changes, not just a test failure.
Milestone

Comments

@bradfitz
Copy link
Contributor

I finally (sorry) set up macOS Sierra VMs last week.

They keep hanging in infinite loops. I just attached to one of them to see what's going on:

screen shot 2017-01-22 at 10 22 00 am

The go_bootstrap install -v std cmd command is spinning.

Using dtrace -p <pid>, I see an endless stream of:

select(0x0, 0x0, 0x0, 0x0, 0xC82003DF00) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0xC82003DF00) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0xC82003DF00) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0xC82003DF00) = 0 0
select(0x0, 0x0, 0x0, 0x0, 0xC82003DF00) = 0 0

The GOROOT_BOOTSTRAP is the Go 1.4 branch with the Sierra fixes. (it barely makes it a second into the build otherwise)

The VMs are running under VMWare ESXi, just like all the OS X 10.8, 10.10, and 10.11 builders, which are all fine.

@bradfitz
Copy link
Contributor Author

(Other comment deleted. The SIGQUIT shown in the deleted post was the buildlet process, not the spinning one.)

@josharian
Copy link
Contributor

Probably unrelated, but just in case...yesterday, the net/http tests starting hanging on my laptop, and required a reboot to get passing again. The test that was hanging was TestTransportPersistConnReadLoopEOF. I'm afraid I don't have the stack traces anymore. Again, probably unrelated, many of the net/http benchmarks fail on Sierra with 'too many open files'.

@bradfitz
Copy link
Contributor Author

@josharian, this is just go install, so we're not even running tests yet. But I could imagine the highly-parallel net/http tests using lots of FDs. Please open a separate bug for that.

@aclements
Copy link
Member

@bradfitz, can you kill -QUIT the go_boostrap process directly to get a traceback? Or if that doesn't work maybe send SIGQUIT to the whole make.bash process group (like ^[ at the terminal would).

@bradfitz
Copy link
Contributor Author

I did kill -QUIT the go_bootstrap process directly to no avail, several times. It didn't appear to quit. Maybe that's telling in itself.

I then also tried a Ctrl-\ at parent (the buildlet) and got a backtrace of only the buildlet (backtrace paths were like /home/bradfitz instead of the expected /Users/gopher)

Trying again now.

@JayNakrani
Copy link
Contributor

JayNakrani commented Jan 23, 2017

Possibly related: #17161 . Infinite retry was added to solve that issue. From the second stacktrace, it looks like it may be going into that infinite retry loop.

@aclements
Copy link
Member

It didn't appear to quit. Maybe that's telling in itself.

That's too bad. A few other things come to mind: Does it freeze even if you run a trivial subcommand like, say, go_bootstrap version? Can you dtrace go_bootstrap from the start? Can this be bisected?

Possibly related: #17161 . Infinite retry was added to solve that issue. From the second stacktrace, it looks like it may be going into that infinite retry loop.

I would expect to see Mach semaphore calls in the dtrace if this were the case, though maybe those are at the wrong level in the OS X kernel for dtrace? (I've always found this confusing.) The second traceback was just from the buildlet, so I don't think it's related. That's also exactly what I would expect a normal parked P's stack to look like.

@ianlancetaylor
Copy link
Member

Can you add a -x to the go_bootstrap command near the end of make.bash so that we can see where it is hanging?

In case it's not obvious the select calls are from calls to runtime.usleep, but unfortunately we don't know the duration of the sleep so we don't know precisely which call is looping. It could just be the sysmon thread.

@quentinmit
Copy link
Contributor

@aclements dtrace only shows traditional BSD-style syscalls, not Mach calls.

@minux
Copy link
Member

minux commented Jan 23, 2017 via email

@rsc
Copy link
Contributor

rsc commented Jan 23, 2017

Most likely explanation is you are bootstrapping with a stale Go 1.4, without the Sierra time fix patches.

@bradfitz
Copy link
Contributor Author

@rsc, I addressed that in the top comment. That was a problem when I first set up the builder (cloning from 10.11), but it very quickly revealed itself to be a problem.

But I just double-checked anyway.

I logged into a 10.12 VM, extracted https://storage.googleapis.com/golang/go1.4-bootstrap-20161024.tar.gz from the https://golang.org/doc/install/source page into a new directory and diffed the VM's baked-in $HOME/go1.4 against the new extracted dir. They're the same.

So the Sierra machine's $HOME/go1.4 is a modern one.

Unless https://storage.googleapis.com/golang/go1.4-bootstrap-20161024.tar.gz is bogus, but people have been successfully using it.

@rsc
Copy link
Contributor

rsc commented Jan 23, 2017

@josharian your hang is #18541. I think that's not the issue here (no network involved).

@rsc
Copy link
Contributor

rsc commented Jan 23, 2017

A little concerned about the go1.4-bootstrap tar not having e39f2da (fix unsigned shift "warnings"), but I just tried flipping my go1.4 to the CL from the bootstrap tar and it works OK. I still think this seems like the time problem, but clearly not that exact time problem (sorry for missing that in the original post). Maybe something is different when running under VMware.

@bradfitz can you send me instructions to connect to the VMs off-list?

@bradfitz
Copy link
Contributor Author

@bradfitz can you send me instructions to connect to the VMs off-list?

Done.

@rsc
Copy link
Contributor

rsc commented Jan 23, 2017

OK, I'm running make.bash in a loop waiting for a hang to poke at. In the meantime, I've remembered #18540, which may be the same thing - unexplained hang waiting for child.

@rsc
Copy link
Contributor

rsc commented Jan 23, 2017

I have run 61 successful make.bash in a row on that VM.

I also restarted the buildlet. Lots of Go 1.6 SIGSEGVs, not surprising. On Go 1.8, one hang in crypto/ssh during netPipe, which does:

func netPipe() (net.Conn, net.Conn, error) {
	listener, err := net.Listen("tcp", "127.0.0.1:0")
	if err != nil {
		return nil, nil, err
	}
	defer listener.Close()
	c1, err := net.Dial("tcp", listener.Addr().String())
	if err != nil {
		return nil, nil, err
	}

	c2, err := listener.Accept()
	if err != nil {
		c1.Close()
		return nil, nil, err
	}

	return c1, c2, nil
}

That looks a little suspect. I don't know if it's guaranteed that the Dial will return before the Accept executes. I think there's a second instance of this happening right now, but it's a 10 minute timeout.

Still waiting for a hang like in the original report.

@rsc
Copy link
Contributor

rsc commented Jan 24, 2017

The buildlet is hung right now running 'go list', but the specific binary is Go 1.6.4, which doesn't have the time fixes.

It looks to me like the buildlet is being used for the subrepo tests as well, including "subrepos on Go 1.6" (right now it is stuck on go list golang.org/x/blog/...). That's just not going to work on Sierra and we should avoid running those tests on the Sierra buildlet.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/35643 mentions this issue.

@bradfitz
Copy link
Contributor Author

@rsc, ugh, fun. The JSON at https://build.golang.org/?mode=json doesn't give me quite enough info to work with. I get:

		{
			"repo": "blog",
			"revision": "59665c2db910d030fff2c1197aa31438eedd296e",
			"parentRevisions": [
				"165bdf8fabc3238454c4083e9d32e156a3440a5d"
			],
			"goRevision": "aa1e69f3fc21795b6fab531a07008e0744ffe5bf",
			"date": "2016-12-13T21:33:03Z",
			"branch": "master",

Which is "master" for the "blog" repo, but the listed goRevision of aa1e69f is the Go 1.6 branch.

I guess I need to query the git graph in the build system to figure out whether that revision has the Sierra time fix or not.

gopherbot pushed a commit to golang/build that referenced this issue Jan 25, 2017
Updates golang/go#18751

Change-Id: Iadd7dded079376a9bf9717ce8071604cee95d8ef
Reviewed-on: https://go-review.googlesource.com/35643
Reviewed-by: Kevin Burke <[email protected]>
Reviewed-by: Russ Cox <[email protected]>
@bradfitz
Copy link
Contributor Author

Okay, the Sierra builder is only doing Go 1.7+ branches now, and isn't building subrepos.

And the buildlet binary is verified to be built with Go 1.8.

What's up with all the TLS timeout errors? See the third column at https://build.golang.org/

It's quite red:
screen shot 2017-01-26 at 2 45 08 pm

e.g. https://build.golang.org/log/e0a86210e3754182956958d6e8e76b5069c7788e

@bradfitz
Copy link
Contributor Author

FWIW, the macOS firewall is off that VM.

@rsc
Copy link
Contributor

rsc commented Jan 27, 2017

Maybe the kernel module is buggy even when the firewall is turned off? This looks like the usual bad firewall:

goroutine 186 [IO wait, 2 minutes]:
net.runtime_pollWait(0x17af590, 0x77, 0xc420051301)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc4200caa08, 0x77, 0x0, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitWrite(0xc4200caa08, 0xc42007ae38, 0x14cd0c0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_poll_runtime.go:84 +0x34
net.(*netFD).connect(0xc4200ca9a0, 0x14cd0c0, 0xc42007ae38, 0x0, 0x0, 0x14c8900, 0xc4203cea80, 0x0, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_unix.go:156 +0x24e
net.(*netFD).dial(0xc4200ca9a0, 0x14cd0c0, 0xc42007ae38, 0x14ce480, 0x0, 0x14ce480, 0xc420010e40, 0xc4202d8548, 0x11a3fce)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/sock_posix.go:140 +0xe9
net.socket(0x14cd0c0, 0xc42007ae38, 0x1366525, 0x3, 0x2, 0x1, 0x0, 0x0, 0x14ce480, 0x0, ...)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/sock_posix.go:92 +0x1a1
net.internetSocket(0x14cd0c0, 0xc42007ae38, 0x1366525, 0x3, 0x14ce480, 0x0, 0x14ce480, 0xc420010e40, 0x1, 0x0, ...)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/ipsock_posix.go:154 +0x129
net.doDialTCP(0x14cd0c0, 0xc42007ae38, 0x1366525, 0x3, 0x0, 0xc420010e40, 0x1517fd0, 0xc4202d8748, 0x1517fd0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/tcpsock_posix.go:62 +0xb9
net.dialTCP(0x14cd0c0, 0xc42007ae38, 0x1366525, 0x3, 0x0, 0xc420010e40, 0xc42007ae38, 0x40, 0x2)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/tcpsock_posix.go:58 +0xe4
net.dialSingle(0x14cd0c0, 0xc42007ae38, 0xc42001a080, 0x14cac80, 0xc420010e40, 0x0, 0x0, 0x0, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:521 +0x479
net.dialSerial(0x14cd0c0, 0xc42007ae38, 0xc42001a080, 0xc420015330, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:489 +0x220
net.(*Dialer).DialContext(0xc420051b88, 0x14cd0c0, 0xc42007ae38, 0x1366525, 0x3, 0xc420015310, 0xf, 0x0, 0x0, 0x0, ...)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:371 +0x70b
net.(*Dialer).Dial(0xc4202d8b88, 0x1366525, 0x3, 0xc420015310, 0xf, 0xc4202d8b48, 0x0, 0x0, 0xc4202d8b00)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/dial.go:294 +0x75
crypto/tls.DialWithDialer(0xc4202d8b88, 0x1366525, 0x3, 0xc420015310, 0xf, 0xc4200d8180, 0x11a220a, 0xc420015300, 0x9)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/crypto/tls/tls.go:120 +0xb0
crypto/tls.Dial(0x1366525, 0x3, 0xc420015310, 0xf, 0xc4200d8180, 0xc4200785a0, 0xc4202ec000, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/crypto/tls/tls.go:170 +0xa6
crypto/tls.TestTLSUniqueMatches(0xc4200821a0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/crypto/tls/tls_test.go:313 +0x2be
testing.tRunner(0xc4200821a0, 0x137fa10)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:657 +0x96
created by testing.(*T).Run
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/testing/testing.go:697 +0x2ca

goroutine 187 [IO wait, 2 minutes]:
net.runtime_pollWait(0x17af650, 0x72, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/runtime/netpoll.go:164 +0x59
net.(*pollDesc).wait(0xc4200ca998, 0x72, 0x0, 0xc4204e4000)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_poll_runtime.go:75 +0x38
net.(*pollDesc).waitRead(0xc4200ca998, 0xffffffffffffffff, 0x0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_poll_runtime.go:80 +0x34
net.(*netFD).accept(0xc4200ca930, 0x0, 0x14c8700, 0xc4204e4000)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/fd_unix.go:430 +0x1e5
net.(*TCPListener).accept(0xc420156008, 0xc400000076, 0x1, 0xc42047de08)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/tcpsock_posix.go:136 +0x2e
net.(*TCPListener).Accept(0xc420156008, 0x17, 0xc4203ce9a0, 0x17, 0xc4203ce9a0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/net/tcpsock.go:228 +0x49
crypto/tls.TestTLSUniqueMatches.func1(0x14cc940, 0xc420156008, 0xc4200821a0, 0xc4200785a0)
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/crypto/tls/tls_test.go:296 +0x60
created by crypto/tls.TestTLSUniqueMatches
	/var/folders/dx/k53rs1s93538b4x20g46cj_w0000gn/T/workdir/go/src/crypto/tls/tls_test.go:309 +0xf7

Each goroutine is waiting for the other (in the network stack).

@bradfitz
Copy link
Contributor Author

Bumping to Go 1.10, but it'd be nice to understand why we can't run all.bash in a loop on Sierra without wedging the machine.

@aclements
Copy link
Member

Is this still an issue?

@bradfitz
Copy link
Contributor Author

bradfitz commented Nov 8, 2017

Well, we worked around the issue by having the macOS Sierra VMs be single use. After each build, the VM is nuked and recloned/booted from its base image. It adds about 60 seconds of wasted time, but it's probably the right thing to do from an isolation standpoint anyway. But I assume this issue remains, that we can't do a bunch run all.bash in a loop on Sierra.

@rsc
Copy link
Contributor

rsc commented Nov 9, 2017

Is it easy to try taking the nuke-the-VM step out? I found and fixed a big problem with Sierra stability vs all.bash a few weeks ago. After that I was able to run all.bash in a loop (4 loops in parallel in different directories, just to make things interesting) for 24 hours on a standard Sierra-image MacBook Pro. It would be interesting to know if the VMs are differently broken or had the same root cause.

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017
gopherbot pushed a commit to golang/build that referenced this issue Dec 7, 2017
…ersion

We used to do this only by necessity on Sierra, due to Sierra issues,
but now that x/build/cmd/makemac is better, we can just do this all
the time. Doing it always is better anyway, to guarantee fresh
environments per build.

Also add the forgotten sudo in the Mac halt. The
env/darwin/macstadium/image-setup-notes.txt even calls out how to
setup password-free sudo to the images, but then I forgot to use it.
It only worked before (if it did?) because the process ended and
failed its heartbeat, at least some of the time. It's also possible it
was never working. The old reason that Sierra machines were
special-cased to reboot was reportedly fixed anyway, in
golang/go#18751 (comment).

Updates golang/go#9495

Change-Id: Iea21d7bc07467429cde79f4212c2b91458f8d8d8
Reviewed-on: https://go-review.googlesource.com/82355
Reviewed-by: Brad Fitzpatrick <[email protected]>
@ianlancetaylor
Copy link
Member

It's probably worth retrying this after all the work on #17490.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 10, 2018
@dmitshur
Copy link
Contributor

dmitshur commented Nov 2, 2018

Posting here since #18541 (which mentions this test more directly) is closed with:

Closing. #18751 tracks the more general Sierra weirdnesses.

I just got this net/http TestTransportPersistConnLeak test failure on my work Mac with 10.13.6 while running all.bash on CL 146860. It seems like a flaky test from issue search results, and the fact that I only got it once, but second time it passed.

Adding a data point here since it's not already mentioned in this issue.

Test Failure Output
--- FAIL: TestTransportPersistConnLeak (0.00s)
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58386->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58389->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58391->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58390->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58388->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58395->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58394->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58396->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58393->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58392->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58405->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58410->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58404->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58407->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58401->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58387->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58402->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58399->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58400->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58398->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58403->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58408->127.0.0.1:58385: read: connection reset by peer
    transport_test.go:1413: client fetch error: Get http://127.0.0.1:58385: read tcp 127.0.0.1:58409->127.0.0.1:58385: read: connection reset by peer
panic: Fail in goroutine after TestTransportPersistConnLeak has completed

goroutine 2061 [running]:
testing.(*common).Fail(0xc000376600)
	/Users/dmitshur/gotip/src/testing/testing.go:553 +0x128
testing.(*common).Errorf(0xc000376600, 0x1511423, 0x16, 0xc0003c1f98, 0x1, 0x1)
	/Users/dmitshur/gotip/src/testing/testing.go:645 +0x83
net/http_test.TestTransportPersistConnLeak.func2(0xc000285b00, 0xc00051cc60, 0xc000319e00, 0xc000376600, 0xc000319e80)
	/Users/dmitshur/gotip/src/net/http/transport_test.go:1413 +0xec
created by net/http_test.TestTransportPersistConnLeak
	/Users/dmitshur/gotip/src/net/http/transport_test.go:1409 +0x226
panic: Fail in goroutine after TestTransportPersistConnLeak has completed

goroutine 2070 [running]:
testing.(*common).Fail(0xc000376600)
	/Users/dmitshur/gotip/src/testing/testing.go:553 +0x128
testing.(*common).Errorf(0xc000376600, 0x1511423, 0x16, 0xc000579f98, 0x1, 0x1)
	/Users/dmitshur/gotip/src/testing/testing.go:645 +0x83
net/http_test.TestTransportPersistConnLeak.func2(0xc000285b00, 0xc00051cc60, 0xc000319e00, 0xc000376600, 0xc000319e80)
	/Users/dmitshur/gotip/src/net/http/transport_test.go:1413 +0xec
created by net/http_test.TestTransportPersistConnLeak
	/Users/dmitshur/gotip/src/net/http/transport_test.go:1409 +0x226
FAIL	net/http	0.589s

@andybons andybons modified the milestones: Go1.12, Go1.13 Feb 12, 2019
@bradfitz
Copy link
Contributor Author

They seem fine lately. Closing.

@FiloSottile
Copy link
Contributor

Just got a failure on master. Reopening.

--- FAIL: TestTransportPersistConnLeak (0.01s)
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: read tcp 127.0.0.1:55924->127.0.0.1:55923: read: connection reset by peer
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: write tcp 127.0.0.1:55947->127.0.0.1:55923: write: broken pipe
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: read tcp 127.0.0.1:55944->127.0.0.1:55923: read: connection reset by peer
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: read tcp 127.0.0.1:55938->127.0.0.1:55923: read: connection reset by peer
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: read tcp 127.0.0.1:55939->127.0.0.1:55923: read: connection reset by peer
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: read tcp 127.0.0.1:55940->127.0.0.1:55923: read: connection reset by peer
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: read tcp 127.0.0.1:55943->127.0.0.1:55923: read: connection reset by peer
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: write tcp 127.0.0.1:55942->127.0.0.1:55923: write: broken pipe
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: write tcp 127.0.0.1:55946->127.0.0.1:55923: write: broken pipe
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: read tcp 127.0.0.1:55935->127.0.0.1:55923: read: connection reset by peer
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: read tcp 127.0.0.1:55932->127.0.0.1:55923: read: connection reset by peer
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: read tcp 127.0.0.1:55930->127.0.0.1:55923: read: connection reset by peer
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: read tcp 127.0.0.1:55936->127.0.0.1:55923: read: connection reset by peer
    transport_test.go:1572: client fetch error: Get http://127.0.0.1:55923: read tcp 127.0.0.1:55925->127.0.0.1:55923: read: connection reset by peer
panic: Fail in goroutine after TestTransportPersistConnLeak has completed

goroutine 2210 [running]:
testing.(*common).Fail(0xc000620900)
	/Users/valsorda/go/src/testing/testing.go:612 +0x11e
testing.(*common).Errorf(0xc000620900, 0x15705c7, 0x16, 0xc000367f98, 0x1, 0x1)
	/Users/valsorda/go/src/testing/testing.go:707 +0x90
net/http_test.TestTransportPersistConnLeak.func2(0xc000449770, 0xc000501260, 0xc000337c00, 0xc000620900, 0xc000337c80)
	/Users/valsorda/go/src/net/http/transport_test.go:1572 +0xec
created by net/http_test.TestTransportPersistConnLeak
	/Users/valsorda/go/src/net/http/transport_test.go:1568 +0x222
panic: Fail in goroutine after TestTransportPersistConnLeak has completed

goroutine 2206 [running]:
testing.(*common).Fail(0xc000620900)
	/Users/valsorda/go/src/testing/testing.go:612 +0x11e
testing.(*common).Errorf(0xc000620900, 0x15705c7, 0x16, 0xc000667f98, 0x1, 0x1)
	/Users/valsorda/go/src/testing/testing.go:707 +0x90
net/http_test.TestTransportPersistConnLeak.func2(0xc000449770, 0xc000501260, 0xc000337c00, 0xc000620900, 0xc000337c80)
	/Users/valsorda/go/src/net/http/transport_test.go:1572 +0xec
created by net/http_test.TestTransportPersistConnLeak
	/Users/valsorda/go/src/net/http/transport_test.go:1568 +0x222
FAIL	net/http	0.921s

@FiloSottile FiloSottile reopened this May 2, 2019
@FiloSottile FiloSottile added the Testing An issue that has been verified to require only test changes, not just a test failure. label May 2, 2019
@bradfitz
Copy link
Contributor Author

bradfitz commented May 2, 2019

What makes you think this is the same bug? This bug was about something pretty specific. Not just "I saw a flaky test once".

@FiloSottile
Copy link
Contributor

@bradfitz isn't my error the exact same as #18751 (comment)?

@bradfitz
Copy link
Contributor Author

bradfitz commented May 2, 2019

That's not the point. Either because our tests sucked or macOS sucked (#18541 etc), a number of our network-intensive tests have been flaky on Macs.

Independently (this bug), our macOS Sierra builders on VMware were spinning, manifesting itself in lots of weird errors, perhaps including the one you saw. Or not.

They were related but different. Because we could not diagnose either well, we eventually closed #18541 as likely being related to this one one we stopped seeing the errors (but note we also saw them on non-VMware Macs, like Russ's laptop), and then we closed this one when we upgraded our VMware cluster to a new version and also had redid all the Macs VMs to be emphemeral, boot-per-build.

I really doubt this bug is related to the CPU spinning. If you have evidence that it is, let me know, but I think this is just a flaky test. (#18541)

I'd rather not re-use this bug.

@FiloSottile
Copy link
Contributor

I'll open a new bug if it reoccurs.

@golang golang locked and limited conversation to collaborators May 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted OS-Darwin Testing An issue that has been verified to require only test changes, not just a test failure.
Projects
None yet
Development

No branches or pull requests