RFE: a consistent way for infrastructure providers to report bootstrap success or failure #2554

ncdc · 2020-03-05T19:46:37Z

⚠️ Cluster API maintainers can ask to turn an issue-proposal into a CAEP when necessary, this is to be expected for large changes that impact multiple components, breaking changes, or new large features.

Goals

Provide a definitive way to determine if a machine bootstrapped successfully or not

Non-Goals/Future Work

N/A

User Story

As a user, I would like to know if a Machine failed to bootstrap, so that I don't have a node joining the cluster that may not be fully functional.

Detailed Description

There is currently nothing in the contract for infrastructure providers for indicating if machine bootstrapping succeeded or failed. We have seen multiple instances where (with the kubeadm bootstrap provider) cloud-init runs, kubeadm join executes, a new Node joins the workload cluster, but kubeadm actually had an error (had a non-zero exit code). In some circumstances, the Node joins the cluster but may be missing default taints/labels (such as the master node role).

I'd like us to find a way for an infrastructure provider to report if bootstrapping succeeded or failed. The exact manner by which each infrastructure provider checks for success or failure will probably need to vary, but we should be able to define a common status field in each "infrastructure machine" that indicates success or failure.

I'm not sure what we should do exactly around remediating Machines in this state. We could potentially integrate with MachineHealthCheck, and we'll need to figure something out for KubeadmControlPlane Machines too.

Contract changes [optional]

All infrastructure providers must populate the new field/condition described below based on bootstrap success or failure
Machine controller copies field/condition to its own status

Data model changes [optional]

Add a way (status field or condition) for infrastructure machine CRDs to indicate bootstrap succeeded/failed

/kind proposal

The text was updated successfully, but these errors were encountered:

ncdc · 2020-03-05T19:46:52Z

xref kubernetes-sigs/cluster-api-provider-aws#972 for the original report

vincepri · 2020-03-05T19:48:15Z

/milestone Next

yastij · 2020-04-01T18:00:22Z

@ncdc - it depends on how much granularity we want here. I'd be just fine to know If cloud-init succeeded or not. a trivial way to do it is to label or annotate the corresponding node object on startup

detiber · 2020-04-01T18:07:19Z

@yastij that is definitely a possibility, however it assumes that there are credentials available on the host with enough permissions to modify the Node object after starting. If I'm not mistaken that is only the case for control plane Nodes and not for general worker Nodes.

dlipovetsky · 2020-04-01T18:31:05Z

I'd be just fine to know If cloud-init succeeded or not. a trivial way to do it is to label or annotate the corresponding node object on startup

Is it common for cloud-init to fail, but for the Node resource be created?

dlipovetsky · 2020-04-01T18:32:47Z

I would be in favor of using a reporting mechanism that handles a broad range of bootstrap failures. And that implies the mechanism is not related to whatever is being bootstrapped (Kubernetes, in this case). I think the mechanism will have to be infrastructure-specific.

(Thinking out loud) For example, CAPA could fetch the console output of an EC2 instance and look for a sentinel value. If the value isn't there after some time, CAPA can decide that the instance failed to bootstrap, save the console output somewhere in the Cluster API control plane. [Edit: Looks like my mind rehashed the discussion in kubernetes-sigs/cluster-api-provider-aws/issues/972, which I failed to read before writing down my thoughts!]

vincepri · 2020-04-01T18:35:33Z

I think the mechanism will have to be infrastructure-specific.

+1 to this. CAPI can offer common (optional) ways to tap to the information infrastructure providers expose.

ncdc · 2020-04-01T19:00:57Z

Recording for posterity from a 1:1 I had with @yastij: he asked if we could use the kubeletExtraArgs to set the initial-node-labels to indicate "bootstrapping done", and I pointed out that the kubelet registers its Node with the apiserver before init/join is 100% done, so this won't work, unfortunately.

@dlipovetsky

I think the mechanism will have to be infrastructure-specific.

Agreed; I don't think there is any generic option.

Is it common for cloud-init to fail, but for the Node resource be created?

We see it fairly frequently. You get a Node, but the phases at the end of the kubeadm join process that label/taint the node and update the kubeadm-config ConfigMap may fail. This is why @randomvariable added #2763 and #2783

I would be in favor of using a reporting mechanism that handles a broad range of bootstrap failures. And that implies the mechanism is not related to whatever is being bootstrapped (Kubernetes, in this case).

This is where we have to get the contract right. I'm thinking that if an infra provider detects that bootstrapping failed, it sets <Infra>Machine.Status.FailureReason/Message, which then bubbles up to the Machine. That's the MVP for me.

MVP++ would be determining a reliable way to report more details about the bootstrap failure. For example, let's say that you had a typo in your kubernetes version, and kubeadm init/join just kept failing to pull images. If the only signal we have is "failure", we have no idea why, and automatic remediation won't help - it'll just create an endless cycle of failing infra.

I don't know what this would look like exactly. We could try to gather the cloud init output, but etcd has a 1.5MB limit per entry, and it really shouldn't be used to store logs. We could alternatively define specific phases for bootstrapping (modeled closely after the kubeadm phases), and then have status conditions for each phase. We'd need to figure out how to report that information back, but being able to see something like machine.status.conditions[.type==BootstrapImagePull].success=false (or whatever) would be really helpful imho.

vincepri · 2020-04-27T17:13:14Z

As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes.

Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described.

/close

k8s-ci-robot · 2020-04-27T17:13:21Z

@vincepri: Closing this issue.

In response to this:

As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes.

Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/proposal Issues or PRs related to proposals. label Mar 5, 2020

k8s-ci-robot added this to the Next milestone Mar 5, 2020

randomvariable mentioned this issue Mar 24, 2020

Insulate users from kubeadm API version changes #2769

Closed

CecileRobertMichon mentioned this issue Apr 9, 2020

📖 Update Roadmap with focus on v1alpha3+ (v0.3.x) #2882

Merged

randomvariable mentioned this issue Apr 21, 2020

Reevaluate bidirectional communication between management and workload clusters kubernetes-sigs/cluster-api-provider-vsphere#882

Closed

k8s-ci-robot closed this as completed Apr 27, 2020

ncdc mentioned this issue May 8, 2020

Bootstrap failure detection kubernetes-sigs/cluster-api-provider-azure#603

Closed

liztio mentioned this issue May 8, 2020

Bootstrap failure detection kubernetes-sigs/cluster-api-provider-aws#972

Closed

ncdc mentioned this issue May 8, 2020

Bootstrap failure detection kubernetes-sigs/cluster-api-provider-vsphere#912

Open

vincepri mentioned this issue Oct 7, 2020

Add ignition support in bootstrap provider #3430

Closed

ghost mentioned this issue Oct 25, 2022

Bootstrap failure outscale/cluster-api-provider-outscale#115

Closed

enxebre mentioned this issue May 31, 2024

🌱 Add BootstrapFailedMachineError error #10360

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: a consistent way for infrastructure providers to report bootstrap success or failure #2554

RFE: a consistent way for infrastructure providers to report bootstrap success or failure #2554

ncdc commented Mar 5, 2020

ncdc commented Mar 5, 2020

vincepri commented Mar 5, 2020

yastij commented Apr 1, 2020

detiber commented Apr 1, 2020

dlipovetsky commented Apr 1, 2020

dlipovetsky commented Apr 1, 2020 •

edited

Loading

vincepri commented Apr 1, 2020

ncdc commented Apr 1, 2020

vincepri commented Apr 27, 2020

k8s-ci-robot commented Apr 27, 2020

RFE: a consistent way for infrastructure providers to report bootstrap success or failure #2554

RFE: a consistent way for infrastructure providers to report bootstrap success or failure #2554

Comments

ncdc commented Mar 5, 2020

ncdc commented Mar 5, 2020

vincepri commented Mar 5, 2020

yastij commented Apr 1, 2020

detiber commented Apr 1, 2020

dlipovetsky commented Apr 1, 2020

dlipovetsky commented Apr 1, 2020 • edited Loading

vincepri commented Apr 1, 2020

ncdc commented Apr 1, 2020

vincepri commented Apr 27, 2020

k8s-ci-robot commented Apr 27, 2020

dlipovetsky commented Apr 1, 2020 •

edited

Loading