-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prow jobs are failing with 'Could not resolve host: github.com' #20716
Comments
/assign @e-blackwelder |
Various jobs are still flaky, examples:
/priority critical-urgent |
@jkaniuk: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
It seems that flaking jobs run on cluster |
It seems that migration to EDIT: however, this can move DNS problem to other parts of the test |
It is not only scalability jobs that have this problem. Shouln't we enable NodeLocalDNS on the Prow cluster? |
/cc @cjwagner as test-infra-oncall Cole, could you edit kube-dns configmap in cluster
Regarding NodeLocalDNS - kubernetes/kubernetes#56903 - it seems that this would increase reliability of DNS, so it should be enabled. |
This is exactly the patch I mentioned per #20816 (comment), + @BenTheElder for awareness |
@jprzychodzen @chaodaiG There are no existing data entries in the |
Not entirely sure, but based on to my understanding, there are 2 things needed:
|
@cjwagner I was just referring to scaling configuration for kube-dns, which of course is controlled in ConfigMap object For NodeLocal DNSCache, please run |
We use terraform to manage these clusters, will this cause terraform to think it needs to recreate the cluster? e.g. |
I have drafted a change to this kubernetes/k8s.io#1680, it shouldn't trigger cluster recreation |
/milestone v1.21 |
kubernetes/k8s.io#1686 (comment) - the addon should be installed to regarding editing the config map, where are the numbers coming from? is this something we could check in as a resource to apply automatically? |
This is the current setting
|
I have briefly looked at terraform doc, doesn't seem like there is a way to apply configmap |
The numbers were from "trial-and-error" as I have guessed, basically scaling up node-local-dns pods from 1 pod per 16 nodes, to 1 pod per 4-8 nodes. |
That's fine, I chose to have the build cluster terraform stop at the "infra" layer. It gets a cluster up and configured, but what is deployed to that cluster is something else's responsibility. In this case, files in a given cluster's So if it's just a matter of committing a |
Generally sgtm. There is one catch, it's a system configmap with value formatted as string, so the interface change in the future will fail silently |
kubernetes/k8s.io#1691 merged, which deployed the configmap changes # kubectl --context=gke_k8s-infra-prow-build_us-central1_prow-build get configmap -n kube-system kube-dns-autoscaler -o=json | jq -r .data.linear | jq .
{
"coresPerReplica": 256,
"nodesPerReplica": 8,
"min": 4,
"preventSinglePointFailure": true
} If you need to make any further changes, please do so by opening PR's against that repo If I look at https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-master-scalability-100&width=5 as reference, this problem seems to have resolved once the dns cache was enabled (but before this patch was added)? |
I'm trying to see if I can find a cloud logging query that shows the extent of this issue, and to confirm it's gone |
/close But I also haven't seen any occurrences of "Cannot resolve github.com" since the cluster's nodes were recreated. Please /reopen if you run into this again |
@spiffxp: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Job https://testgrid.k8s.io/sig-scalability-gce#gce-cos-1.19-scalability-100 was flaking a lot, it seems that problem is now resolved. EDIT: not -> now |
The most recent failure on https://testgrid.k8s.io/sig-scalability-gce#gce-cos-1.19-scalability-100 was from last Friday. The only red column was on last Friday at 4AM (https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability-stable1/1362733725674115072). Which was before @spiffxp updating the nodepool with the fixes (See the timestamp at kubernetes/k8s.io#1686 (comment)). @jprzychodzen , did you see anything different there? |
Sorry, a typo in my comment. I've fixed it. Thanks again for handling this issue. s/not/now changes meaning a lot ;-) |
What happened:
Many prow jobs started failing with error like:
e.g. https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/1356950100881969152
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-build-fast/1356957645537284096/
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Please provide links to example occurrences, if any:
Anything else we need to know?:
https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-master-scalability-100 suggests that this started happening between 03:43 PST and 04:27 PST.
The text was updated successfully, but these errors were encountered: