Prow jobs are failing with 'Could not resolve host: github.com' #20716

mborsz · 2021-02-03T13:32:41Z

What happened:
Many prow jobs started failing with error like:

Cloning into 'test-infra'...
fatal: unable to access 'https://github.com/kubernetes/test-infra/': Could not resolve host: github.com

e.g. https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/1356950100881969152
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-build-fast/1356957645537284096/
What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Please provide links to example occurrences, if any:

Anything else we need to know?:
https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-master-scalability-100 suggests that this started happening between 03:43 PST and 04:27 PST.

The text was updated successfully, but these errors were encountered:

mborsz · 2021-02-03T13:36:30Z

/assign @e-blackwelder

jkaniuk · 2021-02-10T18:59:13Z

Various jobs are still flaky, examples:

/priority critical-urgent
/sig testing
/sig network
/wg k8s-infra
/milestone v1.21
/kind flake

k8s-ci-robot · 2021-02-10T18:59:15Z

@jkaniuk: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility.

In response to this:

Various jobs are still flaky, examples:

https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-scalability-100

https://testgrid.k8s.io/sig-release-master-blocking#gce-ubuntu-master-containerd

https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features

/priority critical-urgent
/sig testing
/sig network
/wg k8s-infra
/milestone v1.21
/kind flake

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jprzychodzen · 2021-02-16T07:29:38Z

Recent case for this issue : https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability-stable1/1361465313736003584

jprzychodzen · 2021-02-16T14:38:51Z

It seems that flaking jobs run on cluster k8s-infra-prow-build

jprzychodzen · 2021-02-16T15:06:55Z

It seems that migration to pod-utils will resolve this problem https://github.com/kubernetes/test-infra/pull/18057/files (add retry on resolving git clone command)

EDIT: however, this can move DNS problem to other parts of the test

jkaniuk · 2021-02-16T15:16:16Z

It is not only scalability jobs that have this problem.

Shouln't we enable NodeLocalDNS on the Prow cluster?

jprzychodzen · 2021-02-16T16:14:02Z

/cc @cjwagner as test-infra-oncall

Cole, could you edit kube-dns configmap in cluster k8s-infra-prow-build and scale kube-dns pods? I guess that something like this would help

"data": {
        "linear":'{"coresPerReplica":256,"nodesPerReplica":8,"min":4,"preventSinglePointFailure":true}'
    },

Regarding NodeLocalDNS - kubernetes/kubernetes#56903 - it seems that this would increase reliability of DNS, so it should be enabled.

chaodaiG · 2021-02-16T17:55:36Z

/cc @cjwagner as test-infra-oncall

Cole, could you edit kube-dns configmap in cluster k8s-infra-prow-build and scale kube-dns pods? I guess that something like this would help
"data": {
        "linear":'{"coresPerReplica":256,"nodesPerReplica":8,"min":4,"preventSinglePointFailure":true}'
    },
Regarding NodeLocalDNS - kubernetes/kubernetes#56903 - it seems that this would increase reliability of DNS, so it should be enabled.

This is exactly the patch I mentioned per #20816 (comment), + @BenTheElder for awareness

cjwagner · 2021-02-16T23:27:42Z

@jprzychodzen @chaodaiG There are no existing data entries in the kube-dns configmap, but there is a very similar entry in the kube-dns-autoscaler configmap, is that what you were referring to? I don't want to break DNS by applying this to the wrong configmap so please confirm which is intended before I proceed.

chaodaiG · 2021-02-16T23:39:57Z

@jprzychodzen @chaodaiG There are no existing data entries in the kube-dns configmap, but there is a very similar entry in the kube-dns-autoscaler configmap, is that what you were referring to? I don't want to break DNS by applying this to the wrong configmap so please confirm which is intended before I proceed.

Not entirely sure, but based on to my understanding, there are 2 things needed:

NodeLocal DNSCache needs to be enabled on the cluster
edit kube-dns-autoscaler ( I believe @jprzychodzen miss typed the name of configmap above, please verify) configmap under kube-system namespace to the value mentioned by @jprzychodzen above.

jprzychodzen · 2021-02-17T08:58:30Z

@cjwagner I was just referring to scaling configuration for kube-dns, which of course is controlled in ConfigMap object kube-dns-autoscaler. Please change that value in kube-dns-autoscaler.

For NodeLocal DNSCache, please run gcloud container clusters update prow-build --update-addons=NodeLocalDNS=ENABLED, this will enable NodeLocalDNS during the next node upgrade, as mentioned in GCP documentation.

spiffxp · 2021-02-18T20:52:29Z

For NodeLocal DNSCache, please run gcloud container clusters update prow-build --update-addons=NodeLocalDNS=ENABLED, this will enable NodeLocalDNS during the next node upgrade, as mentioned in GCP documentation.

We use terraform to manage these clusters, will this cause terraform to think it needs to recreate the cluster? e.g.

chaodaiG · 2021-02-18T23:31:04Z

We use terraform to manage these clusters, will this cause terraform to think it needs to recreate the cluster? e.g.

https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/clusters/projects/k8s-infra-prow-build/prow-build

https://github.com/kubernetes/k8s.io/tree/main/infra/gcp/clusters/modules/gke-cluster

I have drafted a change to this kubernetes/k8s.io#1680, it shouldn't trigger cluster recreation

spiffxp · 2021-02-19T14:04:45Z

/milestone v1.21

spiffxp · 2021-02-19T21:22:04Z

kubernetes/k8s.io#1686 (comment) - the addon should be installed to k8s-infra-prow-build after N hours

regarding editing the config map, where are the numbers coming from? is this something we could check in as a resource to apply automatically?

spiffxp · 2021-02-19T21:26:27Z

This is the current setting

$ k --context=gke_k8s-infra-prow-build-trusted_us-central1_prow-build-trusted get configmap -n kube-system kube-dns-autoscaler -o=json | jq .data
{
  "linear": "{\"coresPerReplica\":256,\"nodesPerReplica\":16,\"preventSinglePointFailure\":true}"
}
$ k --context=gke_k8s-infra-prow-build_us-central1_prow-build get configmap -n kube-system kube-dns-autoscaler -o=json | jq .data
{
  "linear": "{\"coresPerReplica\":256,\"nodesPerReplica\":16,\"preventSinglePointFailure\":true}"
}

chaodaiG · 2021-02-19T21:30:23Z

kubernetes/k8s.io#1686 (comment) - the addon should be installed to k8s-infra-prow-build after N hours

regarding editing the config map, where are the numbers coming from? is this something we could check in as a resource to apply automatically?

I have briefly looked at terraform doc, doesn't seem like there is a way to apply configmap

chaodaiG · 2021-02-19T21:31:47Z

The numbers were from "trial-and-error" as I have guessed, basically scaling up node-local-dns pods from 1 pod per 16 nodes, to 1 pod per 4-8 nodes.

spiffxp · 2021-02-19T21:37:28Z

I have briefly looked at terraform doc, doesn't seem like there is a way to apply configmap

That's fine, I chose to have the build cluster terraform stop at the "infra" layer. It gets a cluster up and configured, but what is deployed to that cluster is something else's responsibility.

In this case, files in a given cluster's resources dir get kubectl apply'd by a prowjob, much the same way prow.k8s.io is deployed from the cluster/ dir (ref: https://testgrid.k8s.io/wg-k8s-infra-k8sio#post-k8sio-deploy-prow-build-resources)

So if it's just a matter of committing a kube-system configmap file to github and applying it, we're good. I just want to avoid manually editing a file in-cluster.

chaodaiG · 2021-02-19T21:47:28Z

Generally sgtm. There is one catch, it's a system configmap with value formatted as string, so the interface change in the future will fail silently

spiffxp · 2021-02-22T21:13:26Z

kubernetes/k8s.io#1691 merged, which deployed the configmap changes

# kubectl --context=gke_k8s-infra-prow-build_us-central1_prow-build get configmap -n kube-system kube-dns-autoscaler -o=json | jq -r .data.linear | jq .
{
  "coresPerReplica": 256,
  "nodesPerReplica": 8,
  "min": 4,
  "preventSinglePointFailure": true
}

If you need to make any further changes, please do so by opening PR's against that repo

If I look at https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-master-scalability-100&width=5 as reference, this problem seems to have resolved once the dns cache was enabled (but before this patch was added)?

spiffxp · 2021-02-22T21:28:16Z

I'm trying to see if I can find a cloud logging query that shows the extent of this issue, and to confirm it's gone

spiffxp · 2021-02-23T03:33:48Z

/close
I'm giving up on the logging query / graph. "Cannot resolve github.com" can appear in a number of different containers/fields, so the logging queries go through TB of data and seem to be erroring out at the moment.

But I also haven't seen any occurrences of "Cannot resolve github.com" since the cluster's nodes were recreated.

Please /reopen if you run into this again

k8s-ci-robot · 2021-02-23T03:33:57Z

@spiffxp: Closing this issue.

In response to this:

/close
I cannot ("Cannot resolve github.com" can appear in a number of different containers/fields, so the logging queries go through TB of data and seem to be erroring out at the moment)

But I also haven't seen any occurrences of "Cannot resolve github.com" since the cluster's nodes were recreated.

Please /reopen if you run into this again

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jprzychodzen · 2021-02-23T08:21:48Z

Thanks @spiffxp @chaodaiG

Job https://testgrid.k8s.io/sig-scalability-gce#gce-cos-1.19-scalability-100 was flaking a lot, it seems that problem is now resolved.

EDIT: not -> now

chaodaiG · 2021-02-23T15:09:15Z

The most recent failure on https://testgrid.k8s.io/sig-scalability-gce#gce-cos-1.19-scalability-100 was from last Friday.

The only red column was on last Friday at 4AM (https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability-stable1/1362733725674115072). Which was before @spiffxp updating the nodepool with the fixes (See the timestamp at kubernetes/k8s.io#1686 (comment)).

@jprzychodzen , did you see anything different there?

jprzychodzen · 2021-02-23T15:12:16Z

Jakub Przychodzeń , did you see anything different there?

Sorry, a typo in my comment. I've fixed it. Thanks again for handling this issue.

s/not/now changes meaning a lot ;-)

mborsz added the kind/bug Categorizes issue or PR as related to a bug. label Feb 3, 2021

k8s-ci-robot assigned e-blackwelder Feb 3, 2021

BenTheElder added the area/prow Issues or PRs related to prow label Feb 5, 2021

jkaniuk mentioned this issue Feb 10, 2021

Various Prow jobs failing due to: Could not resolve host: github.com kubernetes/kubernetes#98971

Closed

k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 10, 2021

k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/network Categorizes an issue or PR as relevant to SIG Network. wg/k8s-infra kind/flake Categorizes issue or PR as related to a flaky test. labels Feb 10, 2021

chaodaiG mentioned this issue Feb 19, 2021

Enable nodelocal dnscache on prow build clusters kubernetes/k8s.io#1680

Merged

k8s-ci-robot added this to the v1.21 milestone Feb 19, 2021

spiffxp mentioned this issue Feb 22, 2021

Add custon kube-dns-autoscaler configmap to prow build clusters kubernetes/k8s.io#1691

Merged

k8s-ci-robot closed this as completed Feb 23, 2021

chenlin07 mentioned this issue Jan 29, 2022

Add test coverage pipeline for Cloud Provider Vsphere #24991

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prow jobs are failing with 'Could not resolve host: github.com' #20716

Prow jobs are failing with 'Could not resolve host: github.com' #20716

mborsz commented Feb 3, 2021

mborsz commented Feb 3, 2021

jkaniuk commented Feb 10, 2021

k8s-ci-robot commented Feb 10, 2021

jprzychodzen commented Feb 16, 2021

jprzychodzen commented Feb 16, 2021

jprzychodzen commented Feb 16, 2021 •

edited

Loading

jkaniuk commented Feb 16, 2021

jprzychodzen commented Feb 16, 2021 •

edited

Loading

chaodaiG commented Feb 16, 2021

cjwagner commented Feb 16, 2021

chaodaiG commented Feb 16, 2021

jprzychodzen commented Feb 17, 2021

spiffxp commented Feb 18, 2021

chaodaiG commented Feb 18, 2021

spiffxp commented Feb 19, 2021

spiffxp commented Feb 19, 2021

spiffxp commented Feb 19, 2021

chaodaiG commented Feb 19, 2021

chaodaiG commented Feb 19, 2021

spiffxp commented Feb 19, 2021

chaodaiG commented Feb 19, 2021

spiffxp commented Feb 22, 2021

spiffxp commented Feb 22, 2021

spiffxp commented Feb 23, 2021 •

edited

Loading

k8s-ci-robot commented Feb 23, 2021

jprzychodzen commented Feb 23, 2021 •

edited

Loading

chaodaiG commented Feb 23, 2021

jprzychodzen commented Feb 23, 2021

Prow jobs are failing with 'Could not resolve host: github.com' #20716

Prow jobs are failing with 'Could not resolve host: github.com' #20716

Comments

mborsz commented Feb 3, 2021

mborsz commented Feb 3, 2021

jkaniuk commented Feb 10, 2021

k8s-ci-robot commented Feb 10, 2021

jprzychodzen commented Feb 16, 2021

jprzychodzen commented Feb 16, 2021

jprzychodzen commented Feb 16, 2021 • edited Loading

jkaniuk commented Feb 16, 2021

jprzychodzen commented Feb 16, 2021 • edited Loading

chaodaiG commented Feb 16, 2021

cjwagner commented Feb 16, 2021

chaodaiG commented Feb 16, 2021

jprzychodzen commented Feb 17, 2021

spiffxp commented Feb 18, 2021

chaodaiG commented Feb 18, 2021

spiffxp commented Feb 19, 2021

spiffxp commented Feb 19, 2021

spiffxp commented Feb 19, 2021

chaodaiG commented Feb 19, 2021

chaodaiG commented Feb 19, 2021

spiffxp commented Feb 19, 2021

chaodaiG commented Feb 19, 2021

spiffxp commented Feb 22, 2021

spiffxp commented Feb 22, 2021

spiffxp commented Feb 23, 2021 • edited Loading

k8s-ci-robot commented Feb 23, 2021

jprzychodzen commented Feb 23, 2021 • edited Loading

chaodaiG commented Feb 23, 2021

jprzychodzen commented Feb 23, 2021

jprzychodzen commented Feb 16, 2021 •

edited

Loading

jprzychodzen commented Feb 16, 2021 •

edited

Loading

spiffxp commented Feb 23, 2021 •

edited

Loading

jprzychodzen commented Feb 23, 2021 •

edited

Loading