`capz-pr-upstream-k8s-ci-windows-containerd-main` is extremely flaky #4144

Jont828 · 2023-10-16T23:17:13Z

Which jobs are failing: capz-pr-upstream-k8s-ci-windows-containerd-main

Which tests are failing: Conformance

Since when has it been failing: For some time, likely since the beginning of October.

Testgrid link: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-pr-upstream-k8s-ci-windows-containerd-main

Reason for failure (if possible):

Anything else we need to know: It seems to be extremely flaky to the point where we can consider it failing.

/kind failing-test

[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

Jont828 · 2023-10-16T23:17:45Z

@CecileRobertMichon @nojnhuh to follow up on what we mentioned at standup

CecileRobertMichon · 2023-10-17T00:57:16Z

@Jont828 are there any patterns of errors/failures?

nojnhuh · 2023-10-17T13:43:46Z

Overall this looks very similar to #3842

jackfrancis · 2023-10-17T18:17:52Z

I may have just repro'd this locally

  [FAILED] Timed out after 1200.002s.
  KubeadmControlPlane object capz-conf-pp30cc/capz-conf-pp30cc-control-plane was not initialized in time
  The function passed to Eventually failed at /Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/common.go:268 with:
  KubeadmControlPlane is not yet initialized
  Expected
      <bool>: false
  to be true
  In [It] at: /Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/common.go:269 @ 10/17/23 10:54:50.168

  Full Stack Trace
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.EnsureControlPlaneInitialized({_, _}, {{0x103075ff8, 0xc0002ffc80}, {0xc001012001, 0xb81f, 0xb820}, {0xc00041f4e0, 0x10}, {0xc000d990a0, ...}, ...}, ...)
    	/Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/common.go:269 +0x40e
    sigs.k8s.io/cluster-api/test/framework/clusterctl.ApplyCustomClusterTemplateAndWait({_, _}, {{0x103075ff8, 0xc0002ffc80}, {0xc001012001, 0xb81f, 0xb820}, {0xc00041f4e0, 0x10}, {0xc000d990a0, ...}, ...}, ...)
    	/Users/jackfrancis/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/clusterctl/clusterctl_helpers.go:404 +0xd56
    sigs.k8s.io/cluster-api/test/framework/clusterctl.ApplyClusterTemplateAndWait({_, _}, {{0x103075ff8, 0xc0002ffc80}, {{0xc0017ac4e0, 0x5e}, {0xc0008e004c, 0x6d}, {0xc0008e3617, 0x1f}, ...}, ...}, ...)
    	/Users/jackfrancis/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/clusterctl/clusterctl_helpers.go:312 +0x994
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.glob..func3.2()
    	/Users/jackfrancis/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:149 +0xad1

$ k get machinedeployments -A
NAMESPACE          NAME                      CLUSTER            REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE       AGE   VERSION
capz-conf-pp30cc   capz-conf-pp30cc-md-0     capz-conf-pp30cc                                              Running     44m   v1.29.0-alpha.2.260+760599db27e272
capz-conf-pp30cc   capz-conf-pp30cc-md-win   capz-conf-pp30cc   2                  2         2             ScalingUp   44m   v1.29.0-alpha.2.260+760599db27e272

jsturtevant · 2023-10-17T18:28:00Z

from the logs of the last run it looks like kube-proxy for windows failed to started, both of the windows nodes started up but kubeproxy and cloud-node manager is failing: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4052/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714151216998518784/artifacts/clusters/capz-conf-qhdx2j/kube-system/kube-proxy-windows-nkq67/pod-describe.txt

  Normal   Pulled     19m (x7 over 29m)     kubelet            Container image "sigwindowstools/kube-proxy:v1.29.0-alpha.2.250_86676bce04dac0-calico-hostprocess" already present on machine
  Warning  BackOff    4m26s (x86 over 27m)  kubelet            Back-off restarting failed container kube-proxy in pod kube-proxy-windows-nkq67_kube-system(d74d0136-3139-43a0-b199-ff01041e89c4)

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4052/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714151216998518784/artifacts/clusters/capz-conf-qhdx2j/kube-system/cloud-node-manager-windows-k2xrq/pod-describe.txt

 Normal   Scheduled  29m                    default-scheduler  Successfully assigned kube-system/cloud-node-manager-windows-k2xrq to capz-conf-sdlj5
  Normal   Pulling    27m (x4 over 29m)      kubelet            Pulling image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0"
  Warning  Failed     27m (x4 over 29m)      kubelet            Error: ErrImagePull
  Normal   BackOff    4m18s (x108 over 29m)  kubelet            Back-off pulling image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0"

And kubelet is reporting an invalid image: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4052/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714151216998518784/artifacts/clusters/capz-conf-qhdx2j/machines/capz-conf-qhdx2j-md-win-8gm48-7f5cl/kubelet.log

[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod cloud-node-manager-windows-zcpvq_kube-system(f7c0e6ea-2f42-4063-8a32-c6e93ed938c2): ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0": no match for platform in manifest: not found
E1017 06:07:32.266805    3828 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cloud-node-manager\" with ErrImagePull: \"rpc error: code = NotFound desc = failed to pull and unpack image \\\"capzci.azurecr.io/azure-cloud-node-manager:266c5e0\\\": no match for platform in manifest: not found\"" pod="kube-system/cloud-node-manager-windows-zcpvq" podUID="f7c0e6ea-2f42-4063-8a32-c6e93ed938c2"

jsturtevant · 2023-10-17T18:37:24Z

A different failure also has issues with cloud node manager but different error: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4069/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714025610105327616/artifacts/clusters/capz-conf-nq4wee/kube-system/cloud-node-manager-windows-xwn2p/pod-describe.txt

  Normal   Scheduled  30m                    default-scheduler  Successfully assigned kube-system/cloud-node-manager-windows-xwn2p to capz-conf-hrtrc
  Warning  Failed     29m                    kubelet            Failed to pull image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0": failed to pull and unpack image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0": failed to resolve reference "capzci.azurecr.io/azure-cloud-node-manager:266c5e0": failed to do request: Head "https://capzci.azurecr.io/v2/azure-cloud-node-manager/manifests/266c5e0": dial tcp: lookup capzci.azurecr.io: no such host
  Normal   Pulling    28m (x4 over 29m)      kubelet            Pulling image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0"
  Warning  Failed     28m (x4 over 29m)      kubelet            Error: ErrImagePull
  Normal   BackOff    4m40s (x109 over 29m)  kubelet            Back-off pulling image "capzci.azurecr.io/azure-cloud-node-manager:266c5e0"

but kubelet gives similar error https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4069/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714025610105327616/artifacts/clusters/capz-conf-nq4wee/machines/capz-conf-nq4wee-md-win-gmdt8-dm7vq/kubelet.log:

E1016 21:36:17.082538    3432 pod_workers.go:1300] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"cloud-node-manager\" with ErrImagePull: \"rpc error: code = NotFound desc = failed to pull and unpack image \\\"capzci.azurecr.io/azure-cloud-node-manager:266c5e0\\\": no match for platform in manifest: not found\"" pod="kube-system/cloud-node-manager-windows-xwn2p" podUID="8f95420f-6e73-42a9-b795-43b29df53ca4"

jsturtevant · 2023-10-17T18:46:01Z

from job https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/directory/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714025610105327616

~~we see that the capzci.azurecr.io/azure-cloud-node-manager:266c5e0 is linux only~~:

regctl manifest get capzci.azurecr.io/azure-cloud-controller-manager:266c5e0
Name:        capzci.azurecr.io/azure-cloud-controller-manager:266c5e0
MediaType:   application/vnd.docker.distribution.manifest.v2+json
Digest:      sha256:d4e21d25eeabdbac71f3d87614b9a98805995fb2f0f30840cad7da09a4c620e7
Total Size:  17.99MB

Config:
  Digest:    sha256:fc7aa3ed566fcabcf6761407799a3565fda8d08cf6d6dc3ce63ba053d3cbb64e
  MediaType: application/vnd.docker.container.image.v1+json
  Size:      1793B

Layers:

  Digest:    sha256:63b450eae87c42ba59c0fa815ad0e5b8cb6fb76a039cc341dbff6e744fa77a77
  MediaType: application/vnd.docker.image.rootfs.diff.tar.gzip
  Size:      83958B
///trucated

update: I got that wrong, I passed wrong image, azure-cloud-controller-manager (should be linux only) vs azure-cloud-node-manager

jsturtevant · 2023-10-17T18:48:51Z

feels similar too #3270

jsturtevant · 2023-10-17T22:05:04Z

retriggered on #4052 which was the most recent failure and it worked. My current thought is some where the cloudnode manager is being built with linux only and not being re-built (maybe cloudprovider project is running jobs at same time?)

CecileRobertMichon · 2023-10-17T22:41:47Z

this is the script that decides if it can reuse the existing images or if it needs to rebuild them: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/scripts/ci-build-azure-ccm.sh#L101

Jont828 · 2023-10-18T23:16:21Z

@jsturtevant Thanks for looking into this. I checked in on the job today and from one of these runs it looks like it failed to load the CAPZ manager image and that Docker daemon might not be running.

{Failed to load images to the bootstrap cluster: Failed to load image "capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911" into the kind cluster "capz-e2e": error listing local image capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911: failure listing container image: capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
Expected
    <*errors.withStack | 0xc000c824b0>: 
    Failed to load image "capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911" into the kind cluster "capz-e2e": error listing local image capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911: failure listing container image: capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
    {
        error: <*errors.withMessage | 0xc0003f28a0>{
            cause: <*errors.withStack | 0xc000c82480>{
                error: <*errors.withMessage | 0xc0003f2840>{
                    cause: <*errors.withStack | 0xc000c82450>{
                        error: <*errors.withMessage | 0xc0003f2680>{
                            cause: <errdefs.errUnknown>{
                                error: <client.errConnectionFailed>{
                                    host: "unix:///var/run/docker.sock",
                                },
                            },
                            msg: "failure listing container image: capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911",
                        },
                        stack: [0x3389c29, 0x3808ef0, 0x3808a97, 0x38c0ac8, 0x38bfc88, 0x165f7a7, 0x165ea5c, 0x1a6f667, 0x1a804f4, 0x1a83cb8, 0x15e4d21],
                    },
                    msg: "error listing local image capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911",
                },
                stack: [0x3809393, 0x3808a97, 0x38c0ac8, 0x38bfc88, 0x165f7a7, 0x165ea5c, 0x1a6f667, 0x1a804f4, 0x1a83cb8, 0x15e4d21],
            },
            msg: "Failed to load image \"capzci.azurecr.io/cluster-api-azure-controller-amd64:20231018215911\" into the kind cluster \"capz-e2e\"",
        },
        stack: [0x3808cdf, 0x38c0ac8, 0x38bfc88, 0x165f7a7, 0x165ea5c, 0x1a6f667, 0x1a804f4, 0x1a83cb8, 0x15e4d21],
    }

k8s-triage-robot · 2024-01-30T15:26:48Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-29T16:25:18Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-30T17:01:04Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-30T17:01:07Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Oct 16, 2023

github-project-automation bot added this to CAPZ Planning Oct 16, 2023

jsturtevant mentioned this issue Oct 17, 2023

Allow joining AzureMachinePools to AKS clusters #4052

Merged

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 30, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 29, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 30, 2024

github-project-automation bot moved this to Done in CAPZ Planning Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`capz-pr-upstream-k8s-ci-windows-containerd-main` is extremely flaky #4144

`capz-pr-upstream-k8s-ci-windows-containerd-main` is extremely flaky #4144

Jont828 commented Oct 16, 2023

Jont828 commented Oct 16, 2023

CecileRobertMichon commented Oct 17, 2023

nojnhuh commented Oct 17, 2023

jackfrancis commented Oct 17, 2023

jsturtevant commented Oct 17, 2023

jsturtevant commented Oct 17, 2023 •

edited

Loading

jsturtevant commented Oct 17, 2023 •

edited

Loading

jsturtevant commented Oct 17, 2023

jsturtevant commented Oct 17, 2023

CecileRobertMichon commented Oct 17, 2023

Jont828 commented Oct 18, 2023

k8s-triage-robot commented Jan 30, 2024

k8s-triage-robot commented Feb 29, 2024

k8s-triage-robot commented Mar 30, 2024

k8s-ci-robot commented Mar 30, 2024

capz-pr-upstream-k8s-ci-windows-containerd-main is extremely flaky #4144

capz-pr-upstream-k8s-ci-windows-containerd-main is extremely flaky #4144

Comments

Jont828 commented Oct 16, 2023

Jont828 commented Oct 16, 2023

CecileRobertMichon commented Oct 17, 2023

nojnhuh commented Oct 17, 2023

jackfrancis commented Oct 17, 2023

jsturtevant commented Oct 17, 2023

jsturtevant commented Oct 17, 2023 • edited Loading

jsturtevant commented Oct 17, 2023 • edited Loading

jsturtevant commented Oct 17, 2023

jsturtevant commented Oct 17, 2023

CecileRobertMichon commented Oct 17, 2023

Jont828 commented Oct 18, 2023

k8s-triage-robot commented Jan 30, 2024

k8s-triage-robot commented Feb 29, 2024

k8s-triage-robot commented Mar 30, 2024

k8s-ci-robot commented Mar 30, 2024

`capz-pr-upstream-k8s-ci-windows-containerd-main` is extremely flaky #4144

`capz-pr-upstream-k8s-ci-windows-containerd-main` is extremely flaky #4144

jsturtevant commented Oct 17, 2023 •

edited

Loading

jsturtevant commented Oct 17, 2023 •

edited

Loading