Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kubeadm join on control plane node failing: timeout waiting for etcd #2450

Closed
jackfrancis opened this issue Apr 16, 2021 · 20 comments
Closed
Labels
area/etcd priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@jackfrancis
Copy link

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT or FEATURE REQUEST

Versions

kubeadm version (use kubeadm version):

kubeadm version: &version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:08:27Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.5", GitCommit:"6b1d87acf3c8253c123756b9e61dac642678305f", GitTreeState:"clean", BuildDate:"2021-03-18T01:10:43Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: cluster-api / capz
  • OS (e.g. from /etc/os-release):
    NAME="Ubuntu"
    VERSION="18.04.5 LTS (Bionic Beaver)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 18.04.5 LTS"
    VERSION_ID="18.04"
    HOME_URL="https://www.ubuntu.com/"
    SUPPORT_URL="https://help.ubuntu.com/"
    BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
    PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
    VERSION_CODENAME=bionic
    UBUNTU_CODENAME=bionic
  • Kernel (e.g. uname -a):
    Linux acse-test-capz-repro-c8cd6-control-plane-9kvrx 5.4.0-1041-azure kubeadm should indicate progress #43~18.04.1-Ubuntu SMP Fri Feb 26 13:02:32 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
  • Others:
    Cluster built using cluster-api from this capz example template:

https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/templates/cluster-template.yaml

tl;dr 3 control plane nodes, 1 node pool w/ 1 worker node

What happened?

  1. The first control plane node comes online and Ready (kubeadm init)
  2. The 2nd control plane node bootstraps but never comes online/Ready (kubeadm join)

From the cloud-init logs, kubeadm tells us that it timed out waiting for etcd:

[2021-04-16 22:09:39] [etcd] Announced new etcd member joining to the existing etcd cluster
[2021-04-16 22:09:39] [etcd] Creating static Pod manifest for "etcd"
[2021-04-16 22:09:39] [etcd] Waiting for the new etcd member to join the cluster. This can take up to 40s
[2021-04-16 22:10:12] [kubelet-check] Initial timeout of 40s passed.
[2021-04-16 22:42:38] error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available

What you expected to happen?

This does not repro in other Kubernetes versions. I've tested 1.19.7 specifically. I expected 1.20.5 to bootstrap as 1.19.7 does.

How to reproduce it (as minimally and precisely as possible)?

I have a repro script:

https://github.com/jackfrancis/cluster-api-provider-azure/blob/repro/repro.sh

Anything else we need to know?

@jackfrancis
Copy link
Author

I repro'd this in 1.20.6 as well.

@neolit123
Copy link
Member

neolit123 commented Apr 16, 2021 via email

@k8s-ci-robot k8s-ci-robot added priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. area/etcd labels Apr 16, 2021
@neolit123
Copy link
Member

neolit123 commented Apr 17, 2021 via email

@jackfrancis
Copy link
Author

Thanks for your patience in me adding a response here.

I've repro'd this so far on 1.20.0, 1.20.5, and 1.20.6. So, probably there is no working 1.20 version in the cluster-api-provider-azure workflow I'm using.

I'll do a clean repro with kubeadm --v=4 and link to a gist w/ kubelet logs.

@sbueringer
Copy link
Member

sbueringer commented Apr 21, 2021

To add another data point, maybe it helps. We were running consistently stable CAPO e2e tests against kubeadm / Kubernetes v1.20.4. I've upgrade to 1.20.6 a few days ago. The results look still pretty stable: https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-openstack#periodic-e2e-test-main&width=20

Test setup is: OpenStack devstack installed on a GCP VM, CAPO creates a cluster on the devstack with Ubuntu 20 with kubeadm/Kubernetes/... 1.20.6. This test case uses 3 control plane and 1 worker node.
E.g. logs of one of the control plane nodes: https://storage.googleapis.com/kubernetes-jenkins/logs/periodic-cluster-api-provider-openstack-e2e-test-main/1384778504263962624/artifacts/clusters/bootstrap/instances/e2e-fo6lw6/cluster-e2e-fo6lw6-control-plane-b2fgl/cloud-final.log

@jackfrancis
Copy link
Author

Well, gulp, I just successfully created a 3-control plane 1.20.6 cluster. Logs are here:

https://gist.github.com/jackfrancis/60309254ba0ce86987916cbc8376bbbf

Now to get a repro of a 3 node 1.20.6 failure and compare the same logs as above.

@jackfrancis
Copy link
Author

O.K., here are logs from a failed build-out:

https://gist.github.com/jackfrancis/e1e54203f0bcb5cbf612e926e9776bee

What I see (from a high level) is that the 2nd control plane comes online, but I'm never able to an etcd container launch. Seems to me that's the most interesting thing happening. apiserver comes up on the 2nd control plane node but can't find etcd:

$ sudo crictl logs f988a3dcffae2
Flag --insecure-port has been deprecated, This flag has no effect now and will be removed in v1.24.
I0421 18:45:45.061936       1 server.go:632] external host was not specified, using 10.0.0.5
I0421 18:45:45.062500       1 server.go:182] Version: v1.20.6
I0421 18:45:46.067019       1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
I0421 18:45:46.068258       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I0421 18:45:46.068290       1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I0421 18:45:46.069587       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
I0421 18:45:46.069610       1 plugins.go:161] Loaded 10 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
I0421 18:45:46.072107       1 client.go:360] parsed scheme: "endpoint"
I0421 18:45:46.072152       1 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://127.0.0.1:2379  <nil> 0 <nil>}]
W0421 18:45:46.072495       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://127.0.0.1:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...

Where to look for more data on why etcd isn't bootstrapping itself on node #2?

@sbueringer
Copy link
Member

Usually, containerd and kubelet log. There you should see if kubelet tries to start the etcd container and what happens then. A higher kubelet log level makes it usually easier.

@sbueringer
Copy link
Member

Oh and the error you see is afaik only that etcd is not healthy / not up. I'm not sure that it's already 100% safe to assume that the container is not started at all. Maybe the etcd container is started but crashlooping for some reason. In this case the etcd container logs are also helpful.

@neolit123
Copy link
Member

neolit123 commented Apr 21, 2021

there are a lot of these errors in the kubelet on the second node:

Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: E0421 18:39:34.974782    2105 driver-call.go:266] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input
Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: W0421 18:39:34.974828    2105 driver-call.go:149] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds, args: [init], error: fork/exec /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds: no such file or directory, output: ""
Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: E0421 18:39:34.974869    2105 plugins.go:747] Error dynamically probing plugins: Error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input
Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: E0421 18:39:34.975564    2105 driver-call.go:266] Failed to unmarshal output for command: init, output: "", error: unexpected end of JSON input
Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: W0421 18:39:34.975585    2105 driver-call.go:149] FlexVolume: driver call failed: executable: /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds, args: [init], error: fork/exec /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds/uds: no such file or directory, output: ""
Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: E0421 18:39:34.975610    2105 plugins.go:747] Error dynamically probing plugins: Error creating Flexvolume plugin from directory nodeagent~uds, skipping. Error: unexpected end of JSON input

you should resolve this. i don't know what is causing that.

the actual problem is this:

Apr 21 18:39:34 acse-test-93710-control-plane-l5z4q kubelet[2105]: W0421 18:39:34.788517    2105 predicate.go:113] Failed to admit pod etcd-acse-test-93710-control-plane-l5z4q_kube-system(d5445ac6339e94c35a047386352476ea) - Unexpected error while attempting to recover from admission failure: preemption: error finding a set of pods to preempt: no set of running pods found to reclaim resources: [(res: ephemeral-storage, q: 104857600), ]

as noted here:
#2450 (comment)
kubernetes/kubernetes#99305

@jackfrancis
Copy link
Author

@jackfrancis
Copy link
Author

@neolit123 your suspicion makes sense: when we enable useExperimentalRetryJoin it's harder to repro this

it seems that the ephemeral-storage failure can be overcome w/ retries

@jackfrancis
Copy link
Author

In any event, it is not obvious (to me) why that ephemeral-storage failure would never occur on the first control plane node. We only see this when the 2nd node comes online.

@neolit123
Copy link
Member

from my surface understanding its due to a race condition in the kubelet. i cannot confirm if it only happens on joining nodes or not, but i can see it being tied to kubelet logic that only happens on nodes that don't have a client yet and unlike the kubeadm init node that pre-bakes a client from the get go.

in any case that k/k issue should be pushed forward in front of sig-storage and sig-node.
if you have people that can join the respectful meetings and talk with maintainers it might speed things up.
it also must be backported to older releases.

@jackfrancis
Copy link
Author

@neolit123 I've been made to understand that useExperimentalRetryJoin is discouraged for kubeadm 1.19+. Is that correct? Is there a blessed way to "retry kubeadm join" to overcome such race conditions?

@neolit123
Copy link
Member

retrying node join until it succeeds due to a race condition in the kubelet, sounds like an absurd workaround to me.
the bug in the kubelet must be fixed, but until then you can use any means available to mitigate.

@jackfrancis
Copy link
Author

@neolit123 I see your point that we want to reduce the likelihood of kubelet race conditions

In the meantime we will continue to investigate how to produce a working 1.20+ kubeadm solution for folks.

I'll follow the issue you linked and close this one for now, thanks!

@jackfrancis
Copy link
Author

For those who are following this from a cluster-api-provider-azure standpoint, I've run more tests today and have 11 successes in a row building 3 control plane node cluster running 1.19.7 (useExperimentalRetryJoin disabled). The purpose of these additional tests is to gain more confidence that the symptom we're seeing is specific to 1.20+.

So, I think we have meaningful confidence that this is indeed a race condition that exhibits itself in 1.20+ only.

@neolit123
Copy link
Member

@jackfrancis the ephemeral storage request was added to the etcd pod in kubeadm 1.20:
kubernetes/kubernetes@2ebd293#diff-509876ddb9971f8787c885617ebaabe5da81755c5f1cddc1e2877af0d47928b2R255

my discussions with @fabriziopandini on this problem recently were around an alternative to remove this request and backport the change to older kubeadm...but in my opinion we shouldn't do that and instead resolve the actual problem in the kubelet.
this can trigger a bug for any static pod that requests ephemeral storage.

cc @bboreham for visibility, who proposed the addition of requests to the etcd pods here:
#2195

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

5 participants