-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net/http: flaky tests on macOS Sierra #18541
Comments
Still trying to check if I got the de-wedging of the TestTransportPersistConnLeak failure right, but now I'm getting this one repeatedly from TestTransportPersistConnReadLoopEOF instead. Looks like TestTransportPersistConnReadLoopEOF calls getConn, which kicks off a goroutine to dial, which is blocked in waitWrite on the connect. Meanwhile, the listener is blocked in waitRead on the accept. I assume those are using the same address, in which case you'd think they would complete each other.
|
Re last comment, my system got into a state where no connections to localhost worked at all, even after stopping the Apple firewall completely. Even 'telnet 127.0.0.1' hung (instead of being rejected). Rebooting cleared that. Now I just see lots of the 'connection reset by peer' in various tests. Still gathering data. |
After I went on my parallelization spree (which dropped short tests from ~30s to ~3s), I found I missed some global state in a few and servers & clients were being shut down by unrelated tests. I thought I'd fixed them all, but this might be another case. I'll test on my Sierra desktop here. |
This is typical, 18/100 failures of go test -short net/http. Always at least ~10%. If I turn off the Mac firewall, the failures disappear: 0/100. So the Mac firewall is breaking things. Question is what we should do. Sometimes the read failure comes back from t.getConn (before we've event sent a request), other times from pconn.roundTrip (trying to read response). In general if we are issuing a GET and we get a read failure, does it make sense to retry even once inside net/http instead of handing that off to all users? I thought I had seen some code to do that at one point. At least I understand the problem, and that it's not net/http. No longer for Go 1.8. /cc @bradfitz
|
Well, that explains why I couldn't reproduce. I wasn't using the firewall. The http package does retry idempotent requests, but it makes the assumption that requests' conn reads/writes can only mysteriously break on the 2nd and subsequent requests on a conn, but not the first. That base case also has the nice side effect of preventing endless retry loops, since eventually a request will require a new conn (after exhausting the idle conn pool), and if that new conn fails, perhaps there was something just bad about the request or network. |
As mentioned in #18751, I've seen this one (and only once). I don't know what the OS X firewall is or whether I have it enabled, but I do run Little Snitch. In the off changes that it is relevant, Little Snitch was (I suspect) the source of net test failures ages ago because it was returning EAGAIN in places where package net did not anticipate them. Those were fixed. |
I see this on about 50% of test runs on a Google-managed laptop running 10.12.2. A sample:
Similar to @rsc's description above, if I turn off the firewall it starts always passing. |
@crawshaw What version of Go was that? |
@crawshaw Hi, There are two kinds as mentioned here: it might be not related to these, but I'd like to debug in the same environment. |
I am on macOS 10.12.2, |
@crawshaw Thanks. Edit: Ah if you said means
If it means, "my laptop installed any corp-managed software after install os", There is a possibility of a cause. |
So this happened during
The next package would be Interestingly, this is with firewall off (and macOS 10.12.3 (16D32)). It's also a personal laptop, so there's no corp-managed software here. |
@shurcooL did your system come out of that state (no localhost traffic), or did you have to reboot? |
@rsc I had to reboot. I waited a few minutes after ^C'ing the After rebooting, I've run Since then, I've turned on firewall again and ran |
OK, so the connection resets seem at least related to the presence of a network filter kext that Google has loaded onto our laptops. It hasn't changed in a few years, though, so this could still be a bad Sierra interaction, but one that only happens when using the kernel sflt_register functionality. From @shurcooL's experience it sounds like the localhost network stack wedge may be a distinct problem (obviously not caused by the Google kext), but maybe still introduced in Sierra. |
I just hit another localhost network stack wedge running package net's tests. (Is that this bug or #18751?) No Google kext here, but yes com.apple.nke.applicationfirewall. In case it helps, here are dtruss output and stack traces for the hang. I'll leave my laptop in this broken state overnight in case there's any further info I should grab before rebooting. |
@josharian Yes, that's the usual hang - the stack traces match what I've seen. It looks like what we really need is kernel stacks but I haven't yet figured out how to do that. Apple has things kind of locked down now it seems. |
This hasn't recurred (that I've noticed) since I updated the macOS Sierra builders to destroy & recreate their VMWare machines after each build. |
Closing. #18751 tracks the more general Sierra weirdnesses. |
If I run:
Then I have yet to get 100 passes in a row. Usually I don't even get 10 in a row. There are two failure modes right now.
The first is:
and the second is for TestTransportPersistConnLeak to hang and cause the process to be killed by the 10 minute watchdog. I suspect that's the same problem, but the test is not written to cope correctly with failures from Get. Working on that.
This issue is about why we get "read: connection reset by peer" so consistently on macOS Sierra. Maybe it's a flaky test but it seems like maybe more.
/cc @bradfitz
The text was updated successfully, but these errors were encountered: