-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
apiserver never comes online #1319
Comments
cloud-init logs:
|
cloud-init logs suggest that the apiserver came online originally, but something went wrong pretty quickly afterwards |
The systemd job has an error trying to grab the kubelet config, but I imagine that's an expected race condition that the job sometimes loses between itself and whoever is paving
|
The next go-around it succeeded in starting up, and the first (non-spurious AWS credentials) error has to do w/ the network not being ready. Not sure if that's also normal, just a race between the control plane starting up and the CNI being installed:
|
Here is a list of the next several error messages, in order:
|
What k8s version is this with? Any interesting cloud-controller-manager logs? Can you also verify that /etc/azure.json has all the expected info, including credentials? |
1.20.5:
|
There is no running cloud-controller-manager container. |
O.K., I now see that only 2 of 3 control plane VMs came online. Will look at capi logs... |
It looks like the 2nd (of 3) control plane machines is stuck in
|
Here it is:
|
|
I can reproduce this pretty easily w/ this script: https://github.com/jackfrancis/cluster-api-provider-azure/blob/repro/repro.sh I cannot reproduce using the above script if I revert this change: In other words, if I add back the What's the current CI test coverage for > 1 control plane node clusters since |
The With 3 control-plane nodes and 2 worker nodes spec runs on every PR and every periodic job. Looking at job history https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capz-periodic-capi-e2e-full-main, that spec failed 2 out of the last 32 runs. One of these was an unrelated temporary issue that affected all specs in that run, the other one was a timeout waiting for the second worker node (all 3 control planes came up successfully). I'm not sure what's different between the above test that seems stable and your repro script. Do you have an estimate of what "pretty easily" is in terms of failure rate? One difference I see is the k8s version in CI is 1.19.7 and you're testing with 1.20.5. |
greater than 50% in my tests Is it possible that the mgmt cluster is responsible? I'm using the config that |
Re: |
Are we currently testing 1.20 for capz? |
:/ just repro'd w/ retry enabled
|
These events happen while the new control planes are coming up, they do not necessarily indicate a failure. Do you have cloud-init logs that show a kubeadm join failure/error? |
I can't repro this when building a 1.19.7 cluster using the This strongly suggests the culprit is 1.20.5. I'll repro using that version and gather data for kubeadm. |
Moved here: |
/kind bug
What steps did you take and what happened:
[A clear and concise description of what the bug is.]
Built a "default" cluster using the provided tilt-accessible template:
https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/templates/cluster-template.yaml
... and the apiserver seems to never reconcile.
What did you expect to happen:
I expected the cluster to come online
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
The IaaS seems to be there, but the apiserver container is not running:
Environment:
kubectl version
):/etc/os-release
):The text was updated successfully, but these errors were encountered: