-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFE: a consistent way for infrastructure providers to report bootstrap success or failure #2554
Comments
xref kubernetes-sigs/cluster-api-provider-aws#972 for the original report |
/milestone Next |
@ncdc - it depends on how much granularity we want here. I'd be just fine to know If cloud-init succeeded or not. a trivial way to do it is to label or annotate the corresponding node object on startup |
@yastij that is definitely a possibility, however it assumes that there are credentials available on the host with enough permissions to modify the Node object after starting. If I'm not mistaken that is only the case for control plane Nodes and not for general worker Nodes. |
Is it common for cloud-init to fail, but for the Node resource be created? |
I would be in favor of using a reporting mechanism that handles a broad range of bootstrap failures. And that implies the mechanism is not related to whatever is being bootstrapped (Kubernetes, in this case). I think the mechanism will have to be infrastructure-specific. (Thinking out loud) For example, CAPA could fetch the console output of an EC2 instance and look for a sentinel value. If the value isn't there after some time, CAPA can decide that the instance failed to bootstrap, save the console output somewhere in the Cluster API control plane. [Edit: Looks like my mind rehashed the discussion in kubernetes-sigs/cluster-api-provider-aws/issues/972, which I failed to read before writing down my thoughts!] |
+1 to this. CAPI can offer common (optional) ways to tap to the information infrastructure providers expose. |
Recording for posterity from a 1:1 I had with @yastij: he asked if we could use the kubeletExtraArgs to set the initial-node-labels to indicate "bootstrapping done", and I pointed out that the kubelet registers its
Agreed; I don't think there is any generic option.
We see it fairly frequently. You get a Node, but the phases at the end of the
This is where we have to get the contract right. I'm thinking that if an infra provider detects that bootstrapping failed, it sets MVP++ would be determining a reliable way to report more details about the bootstrap failure. For example, let's say that you had a typo in your kubernetes version, and I don't know what this would look like exactly. We could try to gather the cloud init output, but etcd has a 1.5MB limit per entry, and it really shouldn't be used to store logs. We could alternatively define specific phases for bootstrapping (modeled closely after the kubeadm phases), and then have status conditions for each phase. We'd need to figure out how to report that information back, but being able to see something like machine.status.conditions[.type==BootstrapImagePull].success=false (or whatever) would be really helpful imho. |
As per April 27th 2020 community guidelines, this project follows the process outlined in https://github.com/kubernetes-sigs/cluster-api/blob/master/CONTRIBUTING.md#proposal-process-caep for large features or changes. Following those guidelines, I'll go ahead and close this issue for now and defer to contributors interested in pushing the proposal forward to open a collaborative document proposal instead, and follow the process as described. /close |
@vincepri: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Goals
Non-Goals/Future Work
User Story
As a user, I would like to know if a Machine failed to bootstrap, so that I don't have a node joining the cluster that may not be fully functional.
Detailed Description
There is currently nothing in the contract for infrastructure providers for indicating if machine bootstrapping succeeded or failed. We have seen multiple instances where (with the kubeadm bootstrap provider) cloud-init runs,
kubeadm join
executes, a newNode
joins the workload cluster, butkubeadm
actually had an error (had a non-zero exit code). In some circumstances, theNode
joins the cluster but may be missing default taints/labels (such as the master node role).I'd like us to find a way for an infrastructure provider to report if bootstrapping succeeded or failed. The exact manner by which each infrastructure provider checks for success or failure will probably need to vary, but we should be able to define a common status field in each "infrastructure machine" that indicates success or failure.
I'm not sure what we should do exactly around remediating Machines in this state. We could potentially integrate with MachineHealthCheck, and we'll need to figure something out for KubeadmControlPlane Machines too.
Contract changes [optional]
Data model changes [optional]
/kind proposal
The text was updated successfully, but these errors were encountered: