Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: a consistent way for infrastructure providers to report bootstrap success or failure #2554

Closed
ncdc opened this issue Mar 5, 2020 · 10 comments
Labels
kind/proposal Issues or PRs related to proposals.

Comments

@ncdc
Copy link
Contributor

ncdc commented Mar 5, 2020

⚠️ Cluster API maintainers can ask to turn an issue-proposal into a CAEP when necessary, this is to be expected for large changes that impact multiple components, breaking changes, or new large features.

Goals

  1. Provide a definitive way to determine if a machine bootstrapped successfully or not

Non-Goals/Future Work

  1. N/A

User Story

As a user, I would like to know if a Machine failed to bootstrap, so that I don't have a node joining the cluster that may not be fully functional.

Detailed Description

There is currently nothing in the contract for infrastructure providers for indicating if machine bootstrapping succeeded or failed. We have seen multiple instances where (with the kubeadm bootstrap provider) cloud-init runs, kubeadm join executes, a new Node joins the workload cluster, but kubeadm actually had an error (had a non-zero exit code). In some circumstances, the Node joins the cluster but may be missing default taints/labels (such as the master node role).

I'd like us to find a way for an infrastructure provider to report if bootstrapping succeeded or failed. The exact manner by which each infrastructure provider checks for success or failure will probably need to vary, but we should be able to define a common status field in each "infrastructure machine" that indicates success or failure.

I'm not sure what we should do exactly around remediating Machines in this state. We could potentially integrate with MachineHealthCheck, and we'll need to figure something out for KubeadmControlPlane Machines too.

Contract changes [optional]

  • All infrastructure providers must populate the new field/condition described below based on bootstrap success or failure
  • Machine controller copies field/condition to its own status

Data model changes [optional]

  • Add a way (status field or condition) for infrastructure machine CRDs to indicate bootstrap succeeded/failed

/kind proposal

@k8s-ci-robot k8s-ci-robot added the kind/proposal Issues or PRs related to proposals. label Mar 5, 2020
@ncdc
Copy link
Contributor Author

ncdc commented Mar 5, 2020

xref kubernetes-sigs/cluster-api-provider-aws#972 for the original report

@vincepri
Copy link
Member

vincepri commented Mar 5, 2020

/milestone Next

@yastij
Copy link
Member

yastij commented Apr 1, 2020

@ncdc - it depends on how much granularity we want here. I'd be just fine to know If cloud-init succeeded or not. a trivial way to do it is to label or annotate the corresponding node object on startup

@detiber
Copy link
Member

detiber commented Apr 1, 2020

@yastij that is definitely a possibility, however it assumes that there are credentials available on the host with enough permissions to modify the Node object after starting. If I'm not mistaken that is only the case for control plane Nodes and not for general worker Nodes.

@dlipovetsky
Copy link
Contributor

I'd be just fine to know If cloud-init succeeded or not. a trivial way to do it is to label or annotate the corresponding node object on startup

Is it common for cloud-init to fail, but for the Node resource be created?

@dlipovetsky
Copy link
Contributor

dlipovetsky commented Apr 1, 2020

I would be in favor of using a reporting mechanism that handles a broad range of bootstrap failures. And that implies the mechanism is not related to whatever is being bootstrapped (Kubernetes, in this case). I think the mechanism will have to be infrastructure-specific.

(Thinking out loud) For example, CAPA could fetch the console output of an EC2 instance and look for a sentinel value. If the value isn't there after some time, CAPA can decide that the instance failed to bootstrap, save the console output somewhere in the Cluster API control plane. [Edit: Looks like my mind rehashed the discussion in kubernetes-sigs/cluster-api-provider-aws/issues/972, which I failed to read before writing down my thoughts!]

@vincepri
Copy link
Member

vincepri commented Apr 1, 2020

I think the mechanism will have to be infrastructure-specific.

+1 to this. CAPI can offer common (optional) ways to tap to the information infrastructure providers expose.

@ncdc
Copy link
Contributor Author

ncdc commented Apr 1, 2020

Recording for posterity from a 1:1 I had with @yastij: he asked if we could use the kubeletExtraArgs to set the initial-node-labels to indicate "bootstrapping done", and I pointed out that the kubelet registers its Node with the apiserver before init/join is 100% done, so this won't work, unfortunately.

@dlipovetsky

I think the mechanism will have to be infrastructure-specific.

Agreed; I don't think there is any generic option.

Is it common for cloud-init to fail, but for the Node resource be created?

We see it fairly frequently. You get a Node, but the phases at the end of the kubeadm join process that label/taint the node and update the kubeadm-config ConfigMap may fail. This is why @randomvariable added #2763 and #2783

I would be in favor of using a reporting mechanism that handles a broad range of bootstrap failures. And that implies the mechanism is not related to whatever is being bootstrapped (Kubernetes, in this case).

This is where we have to get the contract right. I'm thinking that if an infra provider detects that bootstrapping failed, it sets <Infra>Machine.Status.FailureReason/Message, which then bubbles up to the Machine. That's the MVP for me.

MVP++ would be determining a reliable way to report more details about the bootstrap failure. For example, let's say that you had a typo in your kubernetes version, and kubeadm init/join just kept failing to pull images. If the only signal we have is "failure", we have no idea why, and automatic remediation won't help - it'll just create an endless cycle of failing infra.

I don't know what this would look like exactly. We could try to gather the cloud init output, but etcd has a 1.5MB limit per entry, and it really shouldn't be used to store logs. We could alternatively define specific phases for bootstrapping (modeled closely after the kubeadm phases), and then have status conditions for each phase. We'd need to figure out how to report that information back, but being able to see something like machine.status.conditions[.type==BootstrapImagePull].success=false (or whatever) would be really helpful imho.

@vincepri
Copy link
Member

As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes.

Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described.

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes.

Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/proposal Issues or PRs related to proposals.
Projects
None yet
Development

No branches or pull requests

6 participants