kubeadm join on control plane node failing: timeout waiting for etcd #2450

jackfrancis · 2021-04-16T22:59:41Z

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT or FEATURE REQUEST

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:08:27Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration: cluster-api / capz
OS (e.g. from /etc/os-release):
NAME="Ubuntu"
VERSION="18.04.5 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.5 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic
Kernel (e.g. uname -a):
Linux acse-test-capz-repro-c8cd6-control-plane-9kvrx 5.4.0-1041-azure kubeadm should indicate progress #43~18.04.1-Ubuntu SMP Fri Feb 26 13:02:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Others:
Cluster built using cluster-api from this capz example template:

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/templates/cluster-template.yaml

tl;dr 3 control plane nodes, 1 node pool w/ 1 worker node

What happened?

The first control plane node comes online and Ready (kubeadm init)
The 2nd control plane node bootstraps but never comes online/Ready (kubeadm join)

From the cloud-init logs, kubeadm tells us that it timed out waiting for etcd:

[2021-04-16 22:09:39] [etcd] Announced new etcd member joining to the existing etcd cluster
[2021-04-16 22:09:39] [etcd] Creating static Pod manifest for "etcd"
[2021-04-16 22:09:39] [etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
[2021-04-16 22:10:12] [kubelet-check] Initial timeout of 40s passed.
[2021-04-16 22:42:38] error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available

What you expected to happen?

This does not repro in other Kubernetes versions. I've tested 1.19.7 specifically. I expected 1.20.5 to bootstrap as 1.19.7 does.

How to reproduce it (as minimally and precisely as possible)?

I have a repro script:

https://github.com/jackfrancis/cluster-api-provider-azure/blob/repro/repro.sh

Anything else we need to know?

The text was updated successfully, but these errors were encountered:

CecileRobertMichon · 2021-04-16T23:17:43Z

@neolit123 @fabriziopandini I was able to repro with CAPZ E2E by bumping the k8s version to 1.20.5: kubernetes-sigs/cluster-api-provider-azure#1322

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/1322/pull-cluster-api-provider-azure-e2e/1383184213930414080

jackfrancis · 2021-04-16T23:46:54Z

I repro'd this in 1.20.6 as well.

neolit123 · 2021-04-16T23:55:59Z

You should run kubeadm with --v 4 at least. Do you have kubelet logs? Maybe its this: kubernetes/kubernetes#99305 Try debugging what is happening with the etcd cluster. /priority awaiting-more-evidence /area etcd

neolit123 · 2021-04-17T00:00:07Z

Is there a 1.20.x version that works?

jackfrancis · 2021-04-21T17:22:59Z

Thanks for your patience in me adding a response here.

I've repro'd this so far on 1.20.0, 1.20.5, and 1.20.6. So, probably there is no working 1.20 version in the cluster-api-provider-azure workflow I'm using.

I'll do a clean repro with kubeadm --v=4 and link to a gist w/ kubelet logs.

sbueringer · 2021-04-21T17:54:18Z

To add another data point, maybe it helps. We were running consistently stable CAPO e2e tests against kubeadm / Kubernetes v1.20.4. I've upgrade to 1.20.6 a few days ago. The results look still pretty stable: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-openstack#periodic-e2e-test-main&width=20

Test setup is: OpenStack devstack installed on a GCP VM, CAPO creates a cluster on the devstack with Ubuntu 20 with kubeadm/Kubernetes/... 1.20.6. This test case uses 3 control plane and 1 worker node.
E.g. logs of one of the control plane nodes: https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-provider-openstack-e2e-test-main/1384778504263962624/artifacts/clusters/bootstrap/instances/e2e-fo6lw6/cluster-e2e-fo6lw6-control-plane-b2fgl/cloud-final.log

jackfrancis · 2021-04-21T18:31:50Z

Well, gulp, I just successfully created a 3-control plane 1.20.6 cluster. Logs are here:

https://gist.github.com/jackfrancis/60309254ba0ce86987916cbc8376bbbf

Now to get a repro of a 3 node 1.20.6 failure and compare the same logs as above.

jackfrancis · 2021-04-21T18:49:43Z

O.K., here are logs from a failed build-out:

https://gist.github.com/jackfrancis/e1e54203f0bcb5cbf612e926e9776bee

What I see (from a high level) is that the 2nd control plane comes online, but I'm never able to an etcd container launch. Seems to me that's the most interesting thing happening. apiserver comes up on the 2nd control plane node but can't find etcd:

$ sudo crictl logs f988a3dcffae2
Flag --insecure-port has been deprecated, This flag has no effect now and will be removed in v1.24.
I0421 18:45:45.061936       1 server.go:632] external host was not specified, using 10.0.0.5
I0421 18:45:45.062500       1 server.go:182] Version: v1.20.6
I0421 18:45:46.067019       1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
I0421 18:45:46.068258       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I0421 18:45:46.068290       1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I0421 18:45:46.069587       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I0421 18:45:46.069610       1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I0421 18:45:46.072107       1 client.go:360] parsed scheme: "endpoint"
I0421 18:45:46.072152       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379  <nil> 0 <nil>}]
W0421 18:45:46.072495       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...

Where to look for more data on why etcd isn't bootstrapping itself on node #2?

sbueringer · 2021-04-21T18:57:31Z

Usually, containerd and kubelet log. There you should see if kubelet tries to start the etcd container and what happens then. A higher kubelet log level makes it usually easier.

sbueringer · 2021-04-21T18:59:52Z

Oh and the error you see is afaik only that etcd is not healthy / not up. I'm not sure that it's already 100% safe to assume that the container is not started at all. Maybe the etcd container is started but crashlooping for some reason. In this case the etcd container logs are also helpful.

neolit123 · 2021-04-21T19:03:46Z

there are a lot of these errors in the kubelet on the second node:

Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: E0421 18:39:34.974782    2105 driver-call.go:266] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input
Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: W0421 18:39:34.974828    2105 driver-call.go:149] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds, args: [init], error: fork/exec /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds: no such file or directory, output: ""
Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: E0421 18:39:34.974869    2105 plugins.go:747] Error dynamically probing plugins: Error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input
Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: E0421 18:39:34.975564    2105 driver-call.go:266] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input
Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: W0421 18:39:34.975585    2105 driver-call.go:149] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds, args: [init], error: fork/exec /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds: no such file or directory, output: ""
Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: E0421 18:39:34.975610    2105 plugins.go:747] Error dynamically probing plugins: Error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input

you should resolve this. i don't know what is causing that.

the actual problem is this:

Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: W0421 18:39:34.788517    2105 predicate.go:113] Failed to admit pod etcd-acse-test-93710-control-plane-l5z4q_kube-system(d5445ac6339e94c35a047386352476ea) - Unexpected error while attempting to recover from admission failure: preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: ephemeral-storage, q: 104857600), ]

as noted here:
#2450 (comment)
kubernetes/kubernetes#99305

jackfrancis · 2021-04-21T19:04:32Z

Containerd logs:

https://gist.github.com/jackfrancis/ce622f527448aa85668bce26d2ad17d0

jackfrancis · 2021-04-21T19:06:53Z

@neolit123 your suspicion makes sense: when we enable useExperimentalRetryJoin it's harder to repro this

it seems that the ephemeral-storage failure can be overcome w/ retries

jackfrancis · 2021-04-21T19:07:45Z

In any event, it is not obvious (to me) why that ephemeral-storage failure would never occur on the first control plane node. We only see this when the 2nd node comes online.

neolit123 · 2021-04-21T19:17:14Z

from my surface understanding its due to a race condition in the kubelet. i cannot confirm if it only happens on joining nodes or not, but i can see it being tied to kubelet logic that only happens on nodes that don't have a client yet and unlike the kubeadm init node that pre-bakes a client from the get go.

in any case that k/k issue should be pushed forward in front of sig-storage and sig-node.
if you have people that can join the respectful meetings and talk with maintainers it might speed things up.
it also must be backported to older releases.

jackfrancis · 2021-04-21T19:18:09Z

@neolit123 I've been made to understand that useExperimentalRetryJoin is discouraged for kubeadm 1.19+. Is that correct? Is there a blessed way to "retry kubeadm join" to overcome such race conditions?

neolit123 · 2021-04-21T19:26:38Z

retrying node join until it succeeds due to a race condition in the kubelet, sounds like an absurd workaround to me.
the bug in the kubelet must be fixed, but until then you can use any means available to mitigate.

jackfrancis · 2021-04-21T19:31:42Z

@neolit123 I see your point that we want to reduce the likelihood of kubelet race conditions

In the meantime we will continue to investigate how to produce a working 1.20+ kubeadm solution for folks.

I'll follow the issue you linked and close this one for now, thanks!

jackfrancis · 2021-04-22T00:07:16Z

For those who are following this from a cluster-api-provider-azure standpoint, I've run more tests today and have 11 successes in a row building 3 control plane node cluster running 1.19.7 (useExperimentalRetryJoin disabled). The purpose of these additional tests is to gain more confidence that the symptom we're seeing is specific to 1.20+.

So, I think we have meaningful confidence that this is indeed a race condition that exhibits itself in 1.20+ only.

neolit123 · 2021-04-22T00:34:57Z

@jackfrancis the ephemeral storage request was added to the etcd pod in kubeadm 1.20:
kubernetes/kubernetes@2ebd293#diff-509876ddb9971f8787c885617ebaabe5da81755c5f1cddc1e2877af0d47928b2R255

my discussions with @fabriziopandini on this problem recently were around an alternative to remove this request and backport the change to older kubeadm...but in my opinion we shouldn't do that and instead resolve the actual problem in the kubelet.
this can trigger a bug for any static pod that requests ephemeral storage.

cc @bboreham for visibility, who proposed the addition of requests to the etcd pods here:
#2195

jackfrancis mentioned this issue Apr 16, 2021

apiserver never comes online kubernetes-sigs/cluster-api-provider-azure#1319

Closed

k8s-ci-robot added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. area/etcd labels Apr 16, 2021

sbueringer mentioned this issue Apr 19, 2021

bump tests to Kubernetes 1.20.6 kubernetes-sigs/cluster-api-provider-openstack#849

Merged

1 task

neolit123 added this to the v1.22 milestone Apr 19, 2021

jackfrancis closed this as completed Apr 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubeadm join on control plane node failing: timeout waiting for etcd #2450

kubeadm join on control plane node failing: timeout waiting for etcd #2450

jackfrancis commented Apr 16, 2021

CecileRobertMichon commented Apr 16, 2021 •

edited

Loading

jackfrancis commented Apr 16, 2021

neolit123 commented Apr 16, 2021 via email

neolit123 commented Apr 17, 2021 via email

jackfrancis commented Apr 21, 2021

sbueringer commented Apr 21, 2021 •

edited

Loading

jackfrancis commented Apr 21, 2021

jackfrancis commented Apr 21, 2021

sbueringer commented Apr 21, 2021

sbueringer commented Apr 21, 2021

neolit123 commented Apr 21, 2021 •

edited

Loading

jackfrancis commented Apr 21, 2021

jackfrancis commented Apr 21, 2021

jackfrancis commented Apr 21, 2021

neolit123 commented Apr 21, 2021

jackfrancis commented Apr 21, 2021

neolit123 commented Apr 21, 2021

jackfrancis commented Apr 21, 2021

jackfrancis commented Apr 22, 2021

neolit123 commented Apr 22, 2021

kubeadm join on control plane node failing: timeout waiting for etcd #2450

kubeadm join on control plane node failing: timeout waiting for etcd #2450

Comments

jackfrancis commented Apr 16, 2021

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

CecileRobertMichon commented Apr 16, 2021 • edited Loading

jackfrancis commented Apr 16, 2021

neolit123 commented Apr 16, 2021 via email

neolit123 commented Apr 17, 2021 via email

jackfrancis commented Apr 21, 2021

sbueringer commented Apr 21, 2021 • edited Loading

jackfrancis commented Apr 21, 2021

jackfrancis commented Apr 21, 2021

sbueringer commented Apr 21, 2021

sbueringer commented Apr 21, 2021

neolit123 commented Apr 21, 2021 • edited Loading

jackfrancis commented Apr 21, 2021

jackfrancis commented Apr 21, 2021

jackfrancis commented Apr 21, 2021

neolit123 commented Apr 21, 2021

jackfrancis commented Apr 21, 2021

neolit123 commented Apr 21, 2021

jackfrancis commented Apr 21, 2021

jackfrancis commented Apr 22, 2021

neolit123 commented Apr 22, 2021

CecileRobertMichon commented Apr 16, 2021 •

edited

Loading

sbueringer commented Apr 21, 2021 •

edited

Loading

neolit123 commented Apr 21, 2021 •

edited

Loading