-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
capz-pr-upstream-k8s-ci-windows-containerd-main
is extremely flaky
#4144
Comments
@CecileRobertMichon @nojnhuh to follow up on what we mentioned at standup |
@Jont828 are there any patterns of errors/failures? |
Overall this looks very similar to #3842 |
I may have just repro'd this locally
|
from the logs of the last run it looks like kube-proxy for windows failed to started, both of the windows nodes started up but kubeproxy and cloud-node manager is failing: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4052/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714151216998518784/artifacts/clusters/capz-conf-qhdx2j/kube-system/kube-proxy-windows-nkq67/pod-describe.txt
And kubelet is reporting an invalid image: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4052/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714151216998518784/artifacts/clusters/capz-conf-qhdx2j/machines/capz-conf-qhdx2j-md-win-8gm48-7f5cl/kubelet.log
|
A different failure also has issues with cloud node manager but different error: https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/4069/pull-cluster-api-provider-azure-windows-containerd-upstream-with-ci-artifacts/1714025610105327616/artifacts/clusters/capz-conf-nq4wee/kube-system/cloud-node-manager-windows-xwn2p/pod-describe.txt
|
update: I got that wrong, I passed wrong image, |
feels similar too #3270 |
retriggered on #4052 which was the most recent failure and it worked. My current thought is some where the cloudnode manager is being built with linux only and not being re-built (maybe cloudprovider project is running jobs at same time?) |
this is the script that decides if it can reuse the existing images or if it needs to rebuild them: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/main/scripts/ci-build-azure-ccm.sh#L101 |
@jsturtevant Thanks for looking into this. I checked in on the job today and from one of these runs it looks like it failed to load the CAPZ manager image and that Docker daemon might not be running.
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Which jobs are failing: capz-pr-upstream-k8s-ci-windows-containerd-main
Which tests are failing: Conformance
Since when has it been failing: For some time, likely since the beginning of October.
Testgrid link: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-pr-upstream-k8s-ci-windows-containerd-main
Reason for failure (if possible):
Anything else we need to know: It seems to be extremely flaky to the point where we can consider it failing.
/kind failing-test
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
The text was updated successfully, but these errors were encountered: