-
Notifications
You must be signed in to change notification settings - Fork 715
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm join on control plane node failing: timeout waiting for etcd #2450
Comments
@neolit123 @fabriziopandini I was able to repro with CAPZ E2E by bumping the k8s version to 1.20.5: kubernetes-sigs/cluster-api-provider-azure#1322 |
I repro'd this in 1.20.6 as well. |
You should run kubeadm with --v 4 at least.
Do you have kubelet logs?
Maybe its this:
kubernetes/kubernetes#99305
Try debugging what is happening with the etcd cluster.
/priority awaiting-more-evidence
/area etcd
|
Is there a 1.20.x version that works?
|
Thanks for your patience in me adding a response here. I've repro'd this so far on 1.20.0, 1.20.5, and 1.20.6. So, probably there is no working 1.20 version in the cluster-api-provider-azure workflow I'm using. I'll do a clean repro with kubeadm --v=4 and link to a gist w/ kubelet logs. |
To add another data point, maybe it helps. We were running consistently stable CAPO e2e tests against kubeadm / Kubernetes v1.20.4. I've upgrade to 1.20.6 a few days ago. The results look still pretty stable: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-openstack#periodic-e2e-test-main&width=20 Test setup is: OpenStack devstack installed on a GCP VM, CAPO creates a cluster on the devstack with Ubuntu 20 with kubeadm/Kubernetes/... 1.20.6. This test case uses 3 control plane and 1 worker node. |
Well, gulp, I just successfully created a 3-control plane 1.20.6 cluster. Logs are here: https://gist.github.com/jackfrancis/60309254ba0ce86987916cbc8376bbbf Now to get a repro of a 3 node 1.20.6 failure and compare the same logs as above. |
O.K., here are logs from a failed build-out: https://gist.github.com/jackfrancis/e1e54203f0bcb5cbf612e926e9776bee What I see (from a high level) is that the 2nd control plane comes online, but I'm never able to an etcd container launch. Seems to me that's the most interesting thing happening. apiserver comes up on the 2nd control plane node but can't find etcd:
Where to look for more data on why etcd isn't bootstrapping itself on node #2? |
Usually, containerd and kubelet log. There you should see if kubelet tries to start the etcd container and what happens then. A higher kubelet log level makes it usually easier. |
Oh and the error you see is afaik only that etcd is not healthy / not up. I'm not sure that it's already 100% safe to assume that the container is not started at all. Maybe the etcd container is started but crashlooping for some reason. In this case the etcd container logs are also helpful. |
there are a lot of these errors in the kubelet on the second node:
you should resolve this. i don't know what is causing that. the actual problem is this:
as noted here: |
@neolit123 your suspicion makes sense: when we enable it seems that the ephemeral-storage failure can be overcome w/ retries |
In any event, it is not obvious (to me) why that ephemeral-storage failure would never occur on the first control plane node. We only see this when the 2nd node comes online. |
from my surface understanding its due to a race condition in the kubelet. i cannot confirm if it only happens on joining nodes or not, but i can see it being tied to kubelet logic that only happens on nodes that don't have a client yet and unlike the kubeadm init node that pre-bakes a client from the get go. in any case that k/k issue should be pushed forward in front of sig-storage and sig-node. |
@neolit123 I've been made to understand that |
retrying node join until it succeeds due to a race condition in the kubelet, sounds like an absurd workaround to me. |
@neolit123 I see your point that we want to reduce the likelihood of kubelet race conditions In the meantime we will continue to investigate how to produce a working 1.20+ kubeadm solution for folks. I'll follow the issue you linked and close this one for now, thanks! |
For those who are following this from a cluster-api-provider-azure standpoint, I've run more tests today and have 11 successes in a row building 3 control plane node cluster running 1.19.7 ( So, I think we have meaningful confidence that this is indeed a race condition that exhibits itself in 1.20+ only. |
@jackfrancis the ephemeral storage request was added to the etcd pod in kubeadm 1.20: my discussions with @fabriziopandini on this problem recently were around an alternative to remove this request and backport the change to older kubeadm...but in my opinion we shouldn't do that and instead resolve the actual problem in the kubelet. cc @bboreham for visibility, who proposed the addition of requests to the etcd pods here: |
Is this a BUG REPORT or FEATURE REQUEST?
BUG REPORT or FEATURE REQUEST
Versions
kubeadm version (use
kubeadm version
):kubeadm version: &version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:08:27Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Environment:
kubectl version
):Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
uname -a
):Linux acse-test-capz-repro-c8cd6-control-plane-9kvrx 5.4.0-1041-azure kubeadm should indicate progress #43~18.04.1-Ubuntu SMP Fri Feb 26 13:02:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Cluster built using cluster-api from this capz example template:
https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/templates/cluster-template.yaml
tl;dr 3 control plane nodes, 1 node pool w/ 1 worker node
What happened?
From the cloud-init logs, kubeadm tells us that it timed out waiting for etcd:
What you expected to happen?
This does not repro in other Kubernetes versions. I've tested 1.19.7 specifically. I expected 1.20.5 to bootstrap as 1.19.7 does.
How to reproduce it (as minimally and precisely as possible)?
I have a repro script:
https://github.com/jackfrancis/cluster-api-provider-azure/blob/repro/repro.sh
Anything else we need to know?
The text was updated successfully, but these errors were encountered: