Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

capz-conformance is failing to start the control plane occasionally #1370

Closed
jsturtevant opened this issue May 7, 2021 · 6 comments
Closed
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jsturtevant
Copy link
Contributor

/kind bug
/kind flake

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

https://testgrid.k8s.io/provider-azure-master-signal#capz-conformance is failing to start the control plane:

INFO: Creating the workload cluster with name "capz-conf-lxkyn6" using the "conformance-ci-artifacts" template (Kubernetes v1.22.0-alpha.1.178+d5691f754f1812, 1 control-plane machines, 2 worker machines)
INFO: Getting the cluster template yaml
INFO: clusterctl config cluster capz-conf-lxkyn6 --infrastructure (default) --kubernetes-version v1.22.0-alpha.1.178+d5691f754f1812 --control-plane-machine-count 1 --worker-machine-count 2 --flavor conformance-ci-artifacts
INFO: Applying the cluster template yaml to the cluster
INFO: Waiting for the cluster infrastructure to be provisioned
�[1mSTEP�[0m: Waiting for cluster to enter the provisioned phase
INFO: Waiting for control plane to be initialized
INFO: Waiting for the first control plane machine managed by capz-conf-lxkyn6/capz-conf-lxkyn6-control-plane to be provisioned
�[1mSTEP�[0m: Waiting for one control plane node to exist
[AfterEach] Conformance Tests
  /home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:139
�[1mSTEP�[0m: Unable to dump workload cluster logs as the cluster is nil

What did you expect to happen:

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

A few examples:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-conformance-master/1390439872413569024
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-conformance-master/1390303576898670592
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-conformance-master/1389668383770808320

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. labels May 7, 2021
@chewong
Copy link
Member

chewong commented Jun 7, 2021

#1419 will add the ability to upload machine boot log even if the provision failed. It should give us more insight into why it failed.

@jsturtevant
Copy link
Contributor Author

We are still seeing this error. Some more logs on where this is happening:

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-conformance-1-21/1404544474662572032
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-conformance-master/1404456323445166080

�[1mSTEP�[0m: Waiting for one control plane node to exist
[AfterEach] Conformance Tests
  /home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:172
�[1mSTEP�[0m: Unable to dump workload cluster logs as the cluster is nil
�[1mSTEP�[0m: Dumping all the Cluster API resources in the "capz-conf-gxhzo3" namespace
�[1mSTEP�[0m: Deleting all clusters in the capz-conf-gxhzo3 namespace
�[1mSTEP�[0m: Deleting cluster capz-conf-gxhzo3
INFO: Waiting for the Cluster capz-conf-gxhzo3/capz-conf-gxhzo3 to be deleted
�[1mSTEP�[0m: Waiting for cluster capz-conf-gxhzo3 to be deleted
�[1mSTEP�[0m: Deleting namespace used for hosting the "conformance-tests" test spec
INFO: Deleting namespace capz-conf-gxhzo3
�[1mSTEP�[0m: Redacting sensitive information from logs

�[91m�[1m• Failure [1664.426 seconds]�[0m
Conformance Tests
�[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:45�[0m
  �[91m�[1mconformance-tests [Measurement]�[0m
  �[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/conformance_test.go:79�[0m

  �[91mTimed out after 1200.000s.
  Expected
      <bool>: false
  to be true�[0m

  /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/controlplane_helpers.go:145

  �[91mFull Stack Trace�[0m
  sigs.k8s.io/cluster-api/test/framework.WaitForOneKubeadmControlPlaneMachineToExist(0x23b2940, 0xc0000640c0, 0x7f0ad038aeb8, 0xc001395180, 0xc00050f1e0, 0xc000fb8000, 0xc000428a00, 0x2, 0x2)
  	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/controlplane_helpers.go:145 +0x4c5
  sigs.k8s.io/cluster-api/test/framework.DiscoveryAndWaitForControlPlaneInitialized(0x23b2940, 0xc0000640c0, 0x7f0ad038aeb8, 0xc001395180, 0xc00050f1e0, 0xc000428a00, 0x2, 0x2, 0x0)
  	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/controlplane_helpers.go:233 +0x5a5

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 12, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 12, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants