Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

capz-pr-upstream-k8s-ci-windows-containerd-main is extremely flaky #4144

Closed
Jont828 opened this issue Oct 16, 2023 · 15 comments
Closed

capz-pr-upstream-k8s-ci-windows-containerd-main is extremely flaky #4144

Jont828 opened this issue Oct 16, 2023 · 15 comments
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@Jont828
Copy link
Contributor

Jont828 commented Oct 16, 2023

Which jobs are failing: capz-pr-upstream-k8s-ci-windows-containerd-main

Which tests are failing: Conformance

Since when has it been failing: For some time, likely since the beginning of October.

Testgrid link: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-pr-upstream-k8s-ci-windows-containerd-main

Reason for failure (if possible):

Anything else we need to know: It seems to be extremely flaky to the point where we can consider it failing.

/kind failing-test

[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Oct 16, 2023
@Jont828
Copy link
Contributor Author

Jont828 commented Oct 16, 2023

@CecileRobertMichon @nojnhuh to follow up on what we mentioned at standup

@CecileRobertMichon
Copy link
Contributor

@Jont828 are there any patterns of errors/failures?

@nojnhuh
Copy link
Contributor

nojnhuh commented Oct 17, 2023

Overall this looks very similar to #3842

@jackfrancis
Copy link
Contributor

I may have just repro'd this locally

  [FAILED] Timed out after 1200.002s.
  KubeadmControlPlane object capz-conf-pp30cc/capz-conf-pp30cc-control-plane was not initialized in time
  The function passed to Eventually failed at /Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/common.go:268 with:
  KubeadmControlPlane is not yet initialized
  Expected
      <bool>: false
  to be true
  In [It] at: /Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/common.go:269 @ 10/17/23 10:54:50.168

  Full Stack Trace
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.EnsureControlPlaneInitialized({_, _}, {{0x103075ff8, 0xc0002ffc80}, {0xc001012001, 0xb81f, 0xb820}, {0xc00041f4e0, 0x10}, {0xc000d990a0, ...}, ...}, ...)
    	/Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/common.go:269 +0x40e
    sigs.k8s.io/cluster-api/test/framework/clusterctl.ApplyCustomClusterTemplateAndWait({_, _}, {{0x103075ff8, 0xc0002ffc80}, {0xc001012001, 0xb81f, 0xb820}, {0xc00041f4e0, 0x10}, {0xc000d990a0, ...}, ...}, ...)
    	/Users/jackfrancis/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/clusterctl/clusterctl_helpers.go:404 +0xd56
    sigs.k8s.io/cluster-api/test/framework/clusterctl.ApplyClusterTemplateAndWait({_, _}, {{0x103075ff8, 0xc0002ffc80}, {{0xc0017ac4e0, 0x5e}, {0xc0008e004c, 0x6d}, {0xc0008e3617, 0x1f}, ...}, ...}, ...)
    	/Users/jackfrancis/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/clusterctl/clusterctl_helpers.go:312 +0x994
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.glob..func3.2()
    	/Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:149 +0xad1
$ k get machinedeployments -A
NAMESPACE          NAME                      CLUSTER            REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE       AGE   VERSION
capz-conf-pp30cc   capz-conf-pp30cc-md-0     capz-conf-pp30cc                                              Running     44m   v1.29.0-alpha.2.260+760599db27e272
capz-conf-pp30cc   capz-conf-pp30cc-md-win   capz-conf-pp30cc   2                  2         2             ScalingUp   44m   v1.29.0-alpha.2.260+760599db27e272

@jsturtevant
Copy link
Contributor

from the logs of the last run it looks like kube-proxy for windows failed to started, both of the windows nodes started up but kubeproxy and cloud-node manager is failing: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4052/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714151216998518784/artifacts/clusters/capz-conf-qhdx2j/kube-system/kube-proxy-windows-nkq67/pod-describe.txt

  Normal   Pulled     19m (x7 over 29m)     kubelet            Container image "sigwindowstools/kube-proxy:v1.29.0-alpha.2.250_86676bce04dac0-calico-hostprocess" already present on machine
  Warning  BackOff    4m26s (x86 over 27m)  kubelet            Back-off restarting failed container kube-proxy in pod kube-proxy-windows-nkq67_kube-system(d74d0136-3139-43a0-b199-ff01041e89c4)

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4052/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714151216998518784/artifacts/clusters/capz-conf-qhdx2j/kube-system/cloud-node-manager-windows-k2xrq/pod-describe.txt

 Normal   Scheduled  29m                    default-scheduler  Successfully assigned kube-system/cloud-node-manager-windows-k2xrq to capz-conf-sdlj5
  Normal   Pulling    27m (x4 over 29m)      kubelet            Pulling image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0"
  Warning  Failed     27m (x4 over 29m)      kubelet            Error: ErrImagePull
  Normal   BackOff    4m18s (x108 over 29m)  kubelet            Back-off pulling image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0"

And kubelet is reporting an invalid image: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4052/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714151216998518784/artifacts/clusters/capz-conf-qhdx2j/machines/capz-conf-qhdx2j-md-win-8gm48-7f5cl/kubelet.log

[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod cloud-node-manager-windows-zcpvq_kube-system(f7c0e6ea-2f42-4063-8a32-c6e93ed938c2): ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0": no match for platform in manifest: not found
E1017 06:07:32.266805    3828 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cloud-node-manager\" with ErrImagePull: \"rpc error: code = NotFound desc = failed to pull and unpack image \\\"capzci.azurecr.io/azure-cloud-node-manager:266c5e0\\\": no match for platform in manifest: not found\"" pod="kube-system/cloud-node-manager-windows-zcpvq" podUID="f7c0e6ea-2f42-4063-8a32-c6e93ed938c2"

@jsturtevant
Copy link
Contributor

jsturtevant commented Oct 17, 2023

A different failure also has issues with cloud node manager but different error: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4069/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714025610105327616/artifacts/clusters/capz-conf-nq4wee/kube-system/cloud-node-manager-windows-xwn2p/pod-describe.txt

  Normal   Scheduled  30m                    default-scheduler  Successfully assigned kube-system/cloud-node-manager-windows-xwn2p to capz-conf-hrtrc
  Warning  Failed     29m                    kubelet            Failed to pull image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0": failed to pull and unpack image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0": failed to resolve reference "capzci.azurecr.io/azure-cloud-node-manager:266c5e0": failed to do request: Head "https://capzci.azurecr.io/v2/azure-cloud-node-manager/manifests/266c5e0": dial tcp: lookup capzci.azurecr.io: no such host
  Normal   Pulling    28m (x4 over 29m)      kubelet            Pulling image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0"
  Warning  Failed     28m (x4 over 29m)      kubelet            Error: ErrImagePull
  Normal   BackOff    4m40s (x109 over 29m)  kubelet            Back-off pulling image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0"

but kubelet gives similar error https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4069/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714025610105327616/artifacts/clusters/capz-conf-nq4wee/machines/capz-conf-nq4wee-md-win-gmdt8-dm7vq/kubelet.log:

E1016 21:36:17.082538    3432 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cloud-node-manager\" with ErrImagePull: \"rpc error: code = NotFound desc = failed to pull and unpack image \\\"capzci.azurecr.io/azure-cloud-node-manager:266c5e0\\\": no match for platform in manifest: not found\"" pod="kube-system/cloud-node-manager-windows-xwn2p" podUID="8f95420f-6e73-42a9-b795-43b29df53ca4"

@jsturtevant
Copy link
Contributor

jsturtevant commented Oct 17, 2023

from job https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714025610105327616

we see that the capzci.azurecr.io/azure-cloud-node-manager:266c5e0 is linux only:

regctl manifest get capzci.azurecr.io/azure-cloud-controller-manager:266c5e0
Name:        capzci.azurecr.io/azure-cloud-controller-manager:266c5e0
MediaType:   application/vnd.docker.distribution.manifest.v2+json
Digest:      sha256:d4e21d25eeabdbac71f3d87614b9a98805995fb2f0f30840cad7da09a4c620e7
Total Size:  17.99MB

Config:
  Digest:    sha256:fc7aa3ed566fcabcf6761407799a3565fda8d08cf6d6dc3ce63ba053d3cbb64e
  MediaType: application/vnd.docker.container.image.v1+json
  Size:      1793B

Layers:

  Digest:    sha256:63b450eae87c42ba59c0fa815ad0e5b8cb6fb76a039cc341dbff6e744fa77a77
  MediaType: application/vnd.docker.image.rootfs.diff.tar.gzip
  Size:      83958B
///trucated

update: I got that wrong, I passed wrong image, azure-cloud-controller-manager (should be linux only) vs azure-cloud-node-manager

@jsturtevant
Copy link
Contributor

feels similar too #3270

@jsturtevant
Copy link
Contributor

retriggered on #4052 which was the most recent failure and it worked. My current thought is some where the cloudnode manager is being built with linux only and not being re-built (maybe cloudprovider project is running jobs at same time?)

@CecileRobertMichon
Copy link
Contributor

this is the script that decides if it can reuse the existing images or if it needs to rebuild them: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/scripts/ci-build-azure-ccm.sh#L101

@Jont828
Copy link
Contributor Author

Jont828 commented Oct 18, 2023

@jsturtevant Thanks for looking into this. I checked in on the job today and from one of these runs it looks like it failed to load the CAPZ manager image and that Docker daemon might not be running.

{Failed to load images to the bootstrap cluster: Failed to load image "capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911" into the kind cluster "capz-e2e": error listing local image capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911: failure listing container image: capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Expected
    <*errors.withStack | 0xc000c824b0>: 
    Failed to load image "capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911" into the kind cluster "capz-e2e": error listing local image capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911: failure listing container image: capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
    {
        error: <*errors.withMessage | 0xc0003f28a0>{
            cause: <*errors.withStack | 0xc000c82480>{
                error: <*errors.withMessage | 0xc0003f2840>{
                    cause: <*errors.withStack | 0xc000c82450>{
                        error: <*errors.withMessage | 0xc0003f2680>{
                            cause: <errdefs.errUnknown>{
                                error: <client.errConnectionFailed>{
                                    host: "unix:///var/run/docker.sock",
                                },
                            },
                            msg: "failure listing container image: capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911",
                        },
                        stack: [0x3389c29, 0x3808ef0, 0x3808a97, 0x38c0ac8, 0x38bfc88, 0x165f7a7, 0x165ea5c, 0x1a6f667, 0x1a804f4, 0x1a83cb8, 0x15e4d21],
                    },
                    msg: "error listing local image capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911",
                },
                stack: [0x3809393, 0x3808a97, 0x38c0ac8, 0x38bfc88, 0x165f7a7, 0x165ea5c, 0x1a6f667, 0x1a804f4, 0x1a83cb8, 0x15e4d21],
            },
            msg: "Failed to load image \"capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911\" into the kind cluster \"capz-e2e\"",
        },
        stack: [0x3808cdf, 0x38c0ac8, 0x38bfc88, 0x165f7a7, 0x165ea5c, 0x1a6f667, 0x1a804f4, 0x1a83cb8, 0x15e4d21],
    }

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
Archived in project
Development

No branches or pull requests

7 participants