Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: dns lookup sometimes fails with "no such host" when there are too many open files #18588

Open
josselin-c opened this issue Jan 10, 2017 · 24 comments
Labels
help wanted NeedsFix The path to resolution is known, but the work has not been done. Unfortunate
Milestone

Comments

@josselin-c
Copy link
Contributor

What did you do?

My network-intensive code would sometimes fail resolving domain names, returning "no such host" using the default resolver (pure go).The domains that fail resolving are valid domains that are successfully resolved using dig and the same dns server address.

I noticed that increasing the number of open files limit fix the issue.

I managed to reproduce the failure with: https://play.golang.org/p/MEZUS8h-o5

go build test.go
ulimit -n 20 # Set open file limit to something low
$ ./test 
2017/01/10 11:01:11 lookup munic.io on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
$ ./test 
2017/01/10 11:01:11 lookup munic.io on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
$ ./test 
2017/01/10 12:01:12 lookup munic.io on 127.0.1.1:53: no such host

What did you expect to see?

Always "too many open files"

What did you see instead?

Sometimes Lookup returns "no such host" when it shouldn't.

Does this issue reproduce with the latest release (go1.7.4)?

I first noticed the problem with 1.7.3. I reproduced the problem on 1.8beta2.

System details

go version go1.8beta2 linux/amd64
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/costanzj/projects/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build335463609=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
PKG_CONFIG="pkg-config"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
GOROOT/bin/go version: go version go1.8beta2 linux/amd64
GOROOT/bin/go tool compile -V: compile version go1.8beta2 X:framepointer
uname -sr: Linux 4.4.0-38-generic
Distributor ID:	Ubuntu
Description:	Ubuntu 16.04.1 LTS
Release:	16.04
Codename:	xenial
/lib/x86_64-linux-gnu/libc.so.6: GNU C Library (Ubuntu GLIBC 2.23-0ubuntu5) stable release version 2.23, by Roland McGrath et al.
gdb --version: GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
@josharian josharian changed the title dns lookup sometimes fails with "no such host" net: dns lookup sometimes fails with "no such host" when there are too many open files Jan 10, 2017
@josharian
Copy link
Contributor

cc @mikioh @mdempsky

@mdempsky
Copy link
Contributor

I think the issue is in goLookupIPCNAMEOrder. What's happening is the AAAA query fails with "no such host" (because munic.io has no AAAA records), but the A query sometimes fails with "too many open files". Currently if there's more than one error, which one we return to the caller depends on a race.

@mikioh mikioh added this to the Go1.9 milestone Jan 11, 2017
@mikioh
Copy link
Contributor

mikioh commented Jan 11, 2017

There's a friend issue #18183. I'm inclined to return a list of effective errors when we cannot determine the cause of the Lookup API failure for surviving the dual IP stack era.

@shuting-yst
Copy link

I think the issue is in goLookupIPCNAMEOrder. What's happening is the AAAA query fails with "no such host" (because munic.io has no AAAA records), but the A query sometimes fails with "too many open files". Currently if there's more than one error, which one we return to the caller depends on a race.

@mdempsky I didn't catch the point reason, could you explain the detail?

@bradfitz bradfitz added the NeedsFix The path to resolution is known, but the work has not been done. label Jun 8, 2017
@bradfitz bradfitz modified the milestones: Go1.10, Go1.9 Jun 8, 2017
@bradfitz
Copy link
Contributor

bradfitz commented Jun 8, 2017

We unfortunately didn't get to this, but if it's in 1.7 and 1.8, it's not a regression, so we'll look into addressing it for Go 1.10.

@agnivade
Copy link
Contributor

I've looked into this a bit.

Yes, the issue is in goLookupIPCNAMEOrder. To be specific, its in the racer goroutine calls. I did some debugging and what is happening is it always returns the last error. Usually, there's only one error - "socket: too many open files", but sometimes the "no such host" comes up too. And when it does, "no such host" is always the last error.

Normal case - (I have added some debugging lines inside the racer lane handler which prints the error)

lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
2017/08/30 07:00:45 lookup munic.io on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files


lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
2017/08/30 07:00:45 lookup munic.io on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files

Other case -

lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: no such host
2017/08/30 12:30:46 lookup munic.io on 127.0.1.1:53: no such host

lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
lookup munic.io. on 127.0.1.1:53: no such host
lookup munic.io. on 127.0.1.1:53: no such host
lookup munic.io. on 127.0.1.1:53: no such host
lookup munic.io. on 127.0.1.1:53: no such host
lookup munic.io. on 127.0.1.1:53: no such host
lookup munic.io. on 127.0.1.1:53: no such host
2017/08/30 12:46:49 lookup munic.io on 127.0.1.1:53: no such host

I think if we change the logic to return the first seen error instead of the last error, its a more elegant solution. As mentioned by @pmarks-net on a slightly different but related thread #18183 (comment).

Combining the errors is also a possibility but as a user, I would prefer to see one error rather than a list of errors.

Overall, it seems to be a fairly simple enough fix. The real thing is to decide what it should be - returning all errors or the first encountered error. Or maybe something very different. Let me know what you guys think.

@pmarks-net
Copy link
Contributor

According to goLookupIPCNAMEOrder, no ordering exists between the A and AAAA queries (until the actual addresses are sorted by sortByRFC6724).

So, "first" and "last" are meaningless when talking about A vs. AAAA; the only way I see to fix this case would be to define an ordering between errors, i.e. "too many open files" always replaces "no such host" when both are present.

Out of curiosity, does enabling StrictErrors make your example behave more predictably? That might only help if the two errors are generated by different elements of conf.nameList.

@agnivade
Copy link
Contributor

So, "first" and "last" are meaningless when talking about A vs. AAAA; the only way I see to fix this case would be to define an ordering between errors, i.e. "too many open files" always replaces "no such host" when both are present.

"first" and "last" still hold meaning when you think of it as first error returned. As in whichever goroutine returned first with an error. Defining a precedence order among errors might be overdoing it IMO. There are lots of errors and lots of edge cases where one error might rank above another. Maybe a simpler solution might work here ?

Out of curiosity, does enabling StrictErrors make your example behave more predictably?

I re-ran the examples with StrictErrors. No such difference. I am not super familiar with the code base, but when I print the index inside the conf.nameList range loop, I see that it has only one element.

Here is the error case, with added logging -

21:44:09-agniva-~/play/gosource/go/src$./lookup 
[0]: fqdn- munic.io., error- lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
[0]: fqdn- munic.io., error- lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
[0]: fqdn- munic.io., error- lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
[0]: fqdn- munic.io., error- lookup munic.io. on 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files
[0]: fqdn- munic.io., error- lookup munic.io. on 127.0.1.1:53: no such host
2017/08/31 21:44:09 lookup munic.io on 127.0.1.1:53: no such host

mentioning the source code here for easy reference -

for i, fqdn := range conf.nameList(name) {
	for _, qtype := range qtypes {
		go func(qtype uint16) {
			cname, rrs, err := r.tryOneName(ctx, conf, fqdn, qtype)
			lane <- racer{cname, rrs, err}
		}(qtype)
	}
	hitStrictError := false
	for range qtypes {
		racer := <-lane
		if racer.error != nil {
			if nerr, ok := racer.error.(Error); ok && nerr.Temporary() && r.StrictErrors {
				fmt.Printf("[%d]: fqdn- %v, error- %v\n", i, fqdn, racer.error)
				// This error will abort the nameList loop.
				hitStrictError = true
				lastErr = racer.error
			} else if lastErr == nil || fqdn == name+"." {
				fmt.Printf("[%d]: fqdn- %v, error- %v\n", i, fqdn, racer.error)
				// Prefer error for original name.
				lastErr = racer.error
			}
			continue
		}
		addrs = append(addrs, addrRecordList(racer.rrs)...)
		if cname == "" {
			cname = racer.cname
		}
	}
	if hitStrictError {
		// If either family hit an error with StrictErrors enabled,
		// discard all addresses. This ensures that network flakiness
		// cannot turn a dualstack hostname IPv4/IPv6-only.
		addrs = nil
		break
	}
	if len(addrs) > 0 {
		break
	}
}

@pmarks-net
Copy link
Contributor

"first" and "last" still hold meaning when you think of it as first error returned. As in whichever goroutine returned first with an error.

When two goroutines are literally racing each other, "first" and "last" both yield nondeterministic behavior, so I think it would be difficult to argue that either option is actually better. At least, I have no opinion in the matter.

@tjmerritt
Copy link

A different question is why do we get the 127.0.1.1:53: dial udp 127.0.1.1:53: socket: too many open files error in the first case. I hit this very quickly with 100 threads making outbound http requests. It seems that open files for make in UDP DNS requests are not be reused. This leak results in the too many open files error trigger when the open file limit for the user is hit. In my case this limit is set at 1024. With a max of 100 concurrent http requests, I don't really see the need for more than 300 open files. I could up the open file limit, but it increases the time to failure and does not eliminate it.

@mikioh
Copy link
Contributor

mikioh commented Feb 16, 2018

@tjmerritt,

I filed a new issue, #23866, for your concern. Let's focus on "error handling" here.

@simar7
Copy link

simar7 commented Jun 30, 2019

I'm running into this as well on go 1.12.5. I'd like to help if there's a decision made on how to approach on solving this.

@mrichman
Copy link

I think I just hit this issue on 1.14.5 as well:

RequestError: send request failed\ncaused by: Post https://dynamodb.us-east-1.amazonaws.com/: dial tcp: lookup dynamodb.us-east-1.amazonaws.com on 172.31.0.2:53: no such host

I get this when running one of my load test apps for a few minutes with 2000 or so concurrent goroutines.

@davecheney
Copy link
Contributor

davecheney commented Jul 21, 2020

Disable cgo and it’ll work, this is a limitation of the libc resolver.

If you cannot rebuild the app without cgo, try GODEBUG=netdns=go

@mrichman
Copy link

mrichman commented Jul 21, 2020

@davecheney Thanks for that! I just tried GODEBUG=netdns=go without recompiling. Now I'm getting these too many open files errors:

{"level":"error","ts":1595333131.5405927,"caller":"ddbb/main.go:26","msg":"RequestError: send request failed\ncaused by: Post https://dynamodb.us-east-1.amazonaws.com/: dial tcp: lookup dynamodb.us-east-1.amazonaws.com on 172.31.0.2:53: dial udp 172.31.0.2:53: socket: too many open files\n\n","stacktrace":"main.worker\n\t/home/ec2-user/ddbb/main.go:26"}

I'm wondering if this isn't a ulimit issue on my EC2 instance.

EDIT: Also tried recompiling with GGO_ENABLED=0 and I get the same error, so I'm assuming it's a ulimit issue.

@batara666

This comment was marked as off-topic.

@ianlancetaylor

This comment was marked as resolved.

@batara666

This comment was marked as resolved.

@vkuznet
Copy link

vkuznet commented Jun 6, 2021

I experienced the same problem. And, I did tried to compile code with GODEBUG=netdns=go CGO_ENABLED=0 but problem persists. I did adjusted ulimit for number of files too (e.g. my limit was set to 1048576) and yet the same issue appears. And, I've seen this issue in Linux, macOS, and in k8s environments. What seems to be working is literally reduce number of concurrent calls and use a queue mechanism in the application.

@matti
Copy link

matti commented Aug 23, 2022

5 years later, this is still an issue

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/443255 mentions this issue: net: cgo resolver return correctly the too many open files error

@bcmills
Copy link
Contributor

bcmills commented Jan 31, 2023

The discussion in #56192 seems to indicate that this is a glibc bug (https://sourceware.org/bugzilla/show_bug.cgi?id=29717) rather than a Go bug per se.

@rsc
Copy link
Contributor

rsc commented Jan 31, 2023

I can believe that it's a glibc problem, although I just saw this same problem happen on a Mac. On the Mac we're also invoking C code to run the lookup (because DNS traffic is blocked for ordinary Go code), so maybe the Mac library has the same bug.

@mateusz834
Copy link
Member

@rsc Side note: Today I stumbled across this: CL 121176, so it seems to be an old bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted NeedsFix The path to resolution is known, but the work has not been done. Unfortunate
Projects
None yet
Development

No branches or pull requests