-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
internal DNS queries fail sometimes during build #2482
Comments
@smarterclayton - potentially the same issue as #2024 |
My suspicion is that the upstream is either taking too long or not responding. We should be able to reproduce this against the internal name server and determine whether it's a timeout/cache problem. |
I suspect SkyDNS (as configured for OpenShift) is the problem here. It seems to be answering queries for things it shouldn't, acting as an open resolver which is already a problem, but beyond that it doesn't answer consistently if it's chained to more than one nameserver. So if you configure a node /etc/resolv.conf with two nameservers (one for internal components, one for "real" DNS), you'll get inconsistent results. Here's a sequence of queries where I have just that setup. I use a side dnsmasq installation for resolving the actual OpenShift hosts, in addition to the regular DNS. 172.16.4.81 is the master where SkyDNS is running.
Thereafter it randomly answers with NOERROR or NXDOMAIN. Obviously this plays hob with builds and deploys contacting the master, and it would be the same problem if you're building or pulling from an internal repository. You'll only see the problem in the container where SkyDNS is inserted ahead of the other resolvers from the host, and only when you have multiple upstream nameservers that give different results. IMNSHO SkyDNS should not be chaining requests to resolve any domains it doesn't own. That's behavior we must be able to disable to avoid deploying open resolvers. Even if it's configured to do that, it should consult the resolver chain in the correct order, not randomly select an upstream server. What options do we have for configuring SkyDNS? |
@sosiouxme it looks like we can configure SkyDNS not to forward |
|
Looking for the config file options that feed into that config... It should really be the default to not forward requests, BTW. |
@sosiouxme I've tested setting |
@ncdc I think for OpenShift we need some more options in https://github.com/openshift/origin/blob/master/pkg/cmd/server/origin/master.go#L773-L796 although if we just hardcode do-not-forward that would be fine (at least for now). It shouldn't spew warnings though, it's not noteworthy to respond NXDOMAIN and let the next resolver handle it... |
I think what we need is NoRec https://github.com/skynetservices/skydns/blob/master/server/config.go#L41 |
@sosiouxme NoRec isn't in our vendored copy, but we can update. |
I've updated the vendored copy and enabled NoRec. It seems to be working - it responds with SERVFAIL. I do see this show up in the log every time I try to curl the docker-registry service via DNS:
Not exactly sure what that means. |
If you update be sure to grab the upstream patch we have applied over it
|
Well as long as the resolver moves on to the next I guess it's not important if the return code is a little funny and there is some extra spew in the log (we should probably fix the log spew though, this is not an error of any kind). |
If SkyDNS doesn't forward requests, what happens when a container asks for something that isn't in SkyDNS? eg: google.com |
It should move on to the next nameserver, which will be the host's nameserver[s]. |
Should the hosts be able to use SkyDNS for resolution? This came up in a dev list thread regarding using SkyDNS for finding the registry at the host level. If the hosts use SkyDNS for resolution, then the "next nameserver" would again be SkyDNS, unless we configure SkyDNS to forward... |
@thoraxe I've tested this on my host, with this for /etc/resolv.conf:
It works fine for DNS resolution on the host. It will resolve cluster DNS entries. It will resolve non cluster DNS entries like google.com - skydns doesn't find google.com, so the host tries the next nameserver (my x.x.x.x entry). I can't say I've specifically looked at how resolution works in the containers with this change in place. |
Yeah you would have to test from inside a container launched pointing at SkyDNS as its only resolver. My assumption is that your host tries to resolve the entry, SkyDNS rejects, and then it moves to the next resolver. Who actually answered your query for google.com? The normal DNS server? |
Yes |
This is from within a pod, 192.168.122.90 is my master, 192.168.122.1 is libvirt's dnsmasq assigned via DHCP to the host where this pod is running.
|
I guess this is a positive sign? |
Containers don't get SkyDNS as their only resolver. Containers get the /etc/resolv.conf from the host plus SkyDNS at the top. Currently. Andy's proposing putting it at the front of the host /etc/resolv.conf too, in which case it wouldn't need inserting at all. |
#2569 prevents skydns from recursing requests it doesn't know about. So this should be working now; @TomasTomecek can you verify? Shortly, we will switch from having kubernetes insert skydns as the first resolver, to the expectation that nodes will have it at the top of their /etc/resolv.conf (which docker already passes on to containers it deploys). This way, kubernetes pods, docker containers like builders spun off from a build (without help from kubernetes), and nodes will all have the same environment for DNS resolution (except the node also has /etc/hosts). |
We disabled skydns by removing it from master config (as suggested by @csrwng). I guess that I can verify this (but since this wasn't happening always, it will be hard to do). |
Fixed by the changes to put master DNS entry in host /etc/resolv.conf and disabling the open forwarding on the master ----- Original Message -----
|
FYI We hit this issue in the lab in Brno some time back but the issue wasn't the recursion, it was the fact that the resolv.conf being pushed from the DHCP server contained entries that were not reachable. Disabling PEERDNS on our minions and only using the SkyDNS resolver fixed our issues, until today that is when we updated to this version. |
still hitting this with 1.0.5 |
Can you reach out to @mfojtik and myself offline so we can debug this? On Wed, Sep 9, 2015 at 9:39 AM, Tomas Tomecek [email protected]
Clayton Coleman | Lead Engineer, OpenShift |
@smarterclayton already talking to @sosiouxme EDIT: looks like it's related to this: https://docs.openshift.com/enterprise/3.0/admin_guide/iptables.html#restarting |
I've got that issue on origin 1.5.1 and sometimes application won't contact database via the service name. |
@metal3d Two things I'd check, are the service endpoints for kubernetes service all reachable from the node where you saw the failures?
Also, are there any errors logged by dnsmasq service? |
Hi @sdodson:
172.16.135.11:8443 and 172.16.135.12:8443 are my masters (etcd is also installed on node1 to have 3 servers) resolv.conf indicates (in containers) the node on which it runs (eg. 172.16.135.15). On each node, I've got:
So, right now, I tried to change node configuration to set dnsIP on 172.30.0.1 and it seems to works. One more thing, now I setted up node config to hit skyDNS, I see name resolution:
That was not the case before while "no-resolv" remains in dnsmasq configuration I think. Note that I've installed openshift-origin with openshift-ansible, on fresh Centos 7 installation (I just installed and enabled NetworkManager service) We have the same issu on CentOS 7 installed on Scaleway.io, sometimes dnsmasq fails to resolv a service name, as on our 5 bare metals here. |
The problem is that now I use direct skydns, I've got no ip rotation while I resolv service name. That's a pitty to not profit of dnsmasq options, cache, and so on... |
forget what I said, I've got the same error without passing by dnsmasq, so the problem seems to be coming from skydns or something like this:
|
My
git clone
during build failed with:Unfortunately, this is not 100% reproducible. Only happens sometimes. When I compare
docker inspect
,/etc/hosts
,/etc/resolv.conf
of "wrong" and "good" build container, they match precisely.Logs
container's
hosts
container's
resolv.conf
host's
resolv.conf
docker inspect $build_container
The text was updated successfully, but these errors were encountered: