-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: dns lookup sometimes fails with "no such host" when there are too many open files #18588
Comments
I think the issue is in goLookupIPCNAMEOrder. What's happening is the AAAA query fails with "no such host" (because munic.io has no AAAA records), but the A query sometimes fails with "too many open files". Currently if there's more than one error, which one we return to the caller depends on a race. |
There's a friend issue #18183. I'm inclined to return a list of effective errors when we cannot determine the cause of the Lookup API failure for surviving the dual IP stack era. |
@mdempsky I didn't catch the point reason, could you explain the detail? |
We unfortunately didn't get to this, but if it's in 1.7 and 1.8, it's not a regression, so we'll look into addressing it for Go 1.10. |
I've looked into this a bit. Yes, the issue is in Normal case - (I have added some debugging lines inside the racer lane handler which prints the error)
Other case -
I think if we change the logic to return the first seen error instead of the last error, its a more elegant solution. As mentioned by @pmarks-net on a slightly different but related thread #18183 (comment). Combining the errors is also a possibility but as a user, I would prefer to see one error rather than a list of errors. Overall, it seems to be a fairly simple enough fix. The real thing is to decide what it should be - returning all errors or the first encountered error. Or maybe something very different. Let me know what you guys think. |
According to So, "first" and "last" are meaningless when talking about A vs. AAAA; the only way I see to fix this case would be to define an ordering between errors, i.e. "too many open files" always replaces "no such host" when both are present. Out of curiosity, does enabling |
"first" and "last" still hold meaning when you think of it as first error returned. As in whichever goroutine returned first with an error. Defining a precedence order among errors might be overdoing it IMO. There are lots of errors and lots of edge cases where one error might rank above another. Maybe a simpler solution might work here ?
I re-ran the examples with StrictErrors. No such difference. I am not super familiar with the code base, but when I print the index inside the Here is the error case, with added logging -
mentioning the source code here for easy reference - for i, fqdn := range conf.nameList(name) {
for _, qtype := range qtypes {
go func(qtype uint16) {
cname, rrs, err := r.tryOneName(ctx, conf, fqdn, qtype)
lane <- racer{cname, rrs, err}
}(qtype)
}
hitStrictError := false
for range qtypes {
racer := <-lane
if racer.error != nil {
if nerr, ok := racer.error.(Error); ok && nerr.Temporary() && r.StrictErrors {
fmt.Printf("[%d]: fqdn- %v, error- %v\n", i, fqdn, racer.error)
// This error will abort the nameList loop.
hitStrictError = true
lastErr = racer.error
} else if lastErr == nil || fqdn == name+"." {
fmt.Printf("[%d]: fqdn- %v, error- %v\n", i, fqdn, racer.error)
// Prefer error for original name.
lastErr = racer.error
}
continue
}
addrs = append(addrs, addrRecordList(racer.rrs)...)
if cname == "" {
cname = racer.cname
}
}
if hitStrictError {
// If either family hit an error with StrictErrors enabled,
// discard all addresses. This ensures that network flakiness
// cannot turn a dualstack hostname IPv4/IPv6-only.
addrs = nil
break
}
if len(addrs) > 0 {
break
}
} |
When two goroutines are literally racing each other, "first" and "last" both yield nondeterministic behavior, so I think it would be difficult to argue that either option is actually better. At least, I have no opinion in the matter. |
A different question is why do we get the 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files error in the first case. I hit this very quickly with 100 threads making outbound http requests. It seems that open files for make in UDP DNS requests are not be reused. This leak results in the too many open files error trigger when the open file limit for the user is hit. In my case this limit is set at 1024. With a max of 100 concurrent http requests, I don't really see the need for more than 300 open files. I could up the open file limit, but it increases the time to failure and does not eliminate it. |
I filed a new issue, #23866, for your concern. Let's focus on "error handling" here. |
I'm running into this as well on go 1.12.5. I'd like to help if there's a decision made on how to approach on solving this. |
I think I just hit this issue on 1.14.5 as well:
I get this when running one of my load test apps for a few minutes with 2000 or so concurrent goroutines. |
Disable cgo and it’ll work, this is a limitation of the libc resolver. If you cannot rebuild the app without cgo, try GODEBUG=netdns=go |
@davecheney Thanks for that! I just tried
I'm wondering if this isn't a EDIT: Also tried recompiling with |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I experienced the same problem. And, I did tried to compile code with |
5 years later, this is still an issue |
Change https://go.dev/cl/443255 mentions this issue: |
The discussion in #56192 seems to indicate that this is a |
I can believe that it's a glibc problem, although I just saw this same problem happen on a Mac. On the Mac we're also invoking C code to run the lookup (because DNS traffic is blocked for ordinary Go code), so maybe the Mac library has the same bug. |
What did you do?
My network-intensive code would sometimes fail resolving domain names, returning "no such host" using the default resolver (pure go).The domains that fail resolving are valid domains that are successfully resolved using dig and the same dns server address.
I noticed that increasing the number of open files limit fix the issue.
I managed to reproduce the failure with: https://play.golang.org/p/MEZUS8h-o5
What did you expect to see?
Always "too many open files"
What did you see instead?
Sometimes Lookup returns "no such host" when it shouldn't.
Does this issue reproduce with the latest release (go1.7.4)?
I first noticed the problem with 1.7.3. I reproduced the problem on 1.8beta2.
System details
The text was updated successfully, but these errors were encountered: