-
Notifications
You must be signed in to change notification settings - Fork 558
Pods are unable to resolve DNS for services both internally and externally. #2999
Comments
Kubedns deployment file
|
The kubedns pods also continuously go back and forth from running to crashloopbackoff state. |
@anujshankar asked in #2971, but will paste here as well: Could you kindly:
What we'd like to see is: what is the difference between the DNS lookups going out the wire, if any? Are the pod-originating lookups being SNAT'd in a particular way compared to the DNS lookups from the node OS? |
For others facing this issue - We tried to implement the workaround suggested in #2880. We changed the dns nameserver IPs to 8.8.8.8 (google dns) on nodes (VMs) by modifying the /etc/resolv.conf file. This solved our problem for now. However, the scary part of this workaround is that the /etc/resolv.conf file should not be edited by hand this way on azure cluster as they will be overwritten the next time cluster is re provisioned for any reason. Points note and incidents:
@jackfrancis @CecileRobertMichon |
Thanks for the update @anujshankar, of course this is not an acceptable workaround, we're investigating why 168.63.129.16 is dropping DNS requests from pod-originating traffic on some clusters. |
@anujshankar Will you be able to spend some time this afternoon repro'ing this failure condition so we can do some real time debugging? |
Sure @jackfrancis - let's do it. |
@jackfrancis take a look at the nslookups below fired from within a pod. Any thoughts around this?
nslookup fired from a node
|
@anujshankar Just to confirm, are you using the Azure CNI network implementation on your cluster? ( |
@jackfrancis yes we are using Azure CNI network implementation and we are running kubelet with |
@khenidak FYI |
@jackfrancis @khenidak this issue has come up again in our QA Cluster. |
Hi @anujshankar, the next time you encounter this issue could you kindly:
Thanks! |
Sure @jackfrancis ! |
@jackfrancis @khenidak collected the output of “ebtables -t nat -L” and "kern.log". Have sent you both the output on mail. Let us know if we can do a screenshare session. |
@jackfrancis #2971 the solution mentioned would help us out too? |
@anujshankar Can you please attach the output here. Depending on the reason, it may or may not help. In addition to what jack has mentioned, please share the output of |
We think it may, yes. |
@sharmasushant can you send me your mail id? @jackfrancis that's great - before proceeding - it would be great if @sharmasushant takes a close look at the output of the above commands. |
@jackfrancis @khenidak @sharmasushant Outcome: We didn't face any DNS resolution till now. We plan to re-create our QA and Production clusters too using flannel network plugin. Do you think we are going in the right direction? |
@anujshankar My email is [email protected] |
Sure - Will send you the details by EOD(GMT +6:30) |
@sharmasushant Have sent out a mail to you with the details. |
It has been over a month with flannel and we have not seen such dns resolution related problem even once. |
@diwakar-s-maurya Thanks for sharing! Would love to hear more anecdotes on real-world flannel experience, positive or negative. |
Is this a request for help?:
Yes
Is this an ISSUE or FEATURE REQUEST? (choose one):
Issue
What version of acs-engine?:
0.15.2
Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes Version: 1.10.1
What happened:
All our internal services report a DNS resolution failure when called, results in an overall network outage in our cluster.
What you expected to happen:
DNS should be resolved.
How to reproduce it (as minimally and precisely as possible):
Happens erratically. Root cause of the problem is still unknown.
Anything else we need to know:
We are facing this issue in "production". It results in a huge business impact as all services running on our cluster goes down.
Observation:
Current Hack/Fix that we use today which is temporary:
Attached Kubedns logs
The text was updated successfully, but these errors were encountered: