-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decide on & document how infrastructure providers should (or must) handle permanently failed servers/VMs #1205
Comments
I think something else that is related to this is what the expected behavior of the MachineSet controller should be if it encounters a Machine that has its ErrorReason/ErrorMessage set. I'd like us to consider having the MachineSet controller delete Machines in this state, so replacement replicas can be created. |
Based on discussions that were held during the Node Lifecycle Workstream, the general consensus was to not recreate missing Machine instances. /assign @timothysc |
That's what I recall, although I have had difficulty locating any docs on this. Hence the issue. |
Hi @ncdc, This sums up our discussion quite well, thank you. The CAPV issue to which you referred is kubernetes-sigs/cluster-api-provider-vsphere#409. We also discussed adding entropy to VM/guest names in the form of a UUID-based suffix (something like the first seven characters from a UUID) so that a CAPV machine might still recreate a missing VM, but ensure that VMs/guests always have unique names, which in turn ensures unique node names. As you pointed out, this is not an issue with CAPA since:
My note to Andy in our original conversation was that not creating missing machines seems anti-pattern to the nature of CAPI and Kubernetes. If a machine's VM/instance is missing, shouldn't it be reconciled? I don't feel strongly about this, but at least one user did assume that missing VMs should be reconciled, and they filed an issue with CAPV, and we implemented that logic. Andy and I just want a written consensus on this in order to point to the suggested behavior if/when future questions are submitted about why CAP* doesn't recreate missing instances/VMs. |
FWIW, I do feel that not recreating backend VMs/instances is anti-pattern to Kubernetes, and I completely understand why a user would reasonably expect a CAPI provider to ensure that a backend instance/VM is recreated if missing. |
@akutz UUID of a Machine (or a provider CRD object) cannot be relied on, since it will change if the cluster-api components are pivoted or restored from backup. |
Hi @detiber, Which is why CAPV updates a Machine spec to include the VM's own instance UUID. |
I'm having trouble finding the notes/discussions, but iirc the rationale was that since a Machine is an abstraction over a Kubernetes Node, if we destroy/recreate a Machine as part of recovery of errors, then we will generally get a completely different Node object rather than the same Node that previously existed. |
do you mean server/VM/instance? |
@ncdc yes, that is indeed what I meant. |
Which is why I made this remark to Andy on Slack:
I consider the CAPI |
@detiber ok, that's not necessarily the case - it depends on the provider. CAPV is able to create a replacement VM that will reassociate with the original Node (and potentially be an issue if the provider id is outdated). |
I do not recognize an anti-pattern. I find that an immutable Machine object is useful, and even necessary for some semantics. If a VM/instance is missing, then the controller will act according to the semantics it provides. For example, the MachineSet controller can create a new Machine, then delete the Machine with the missing VM/instance. If the Machine is not owned by a controller, then a user will need to do this. Moreover, if an instance backing the Machine object changes, CAPI clients may want to know what the previous instance was, why it was changed, etc. Do you have thoughts on how they would find that information?
If a Machine is mutable in this way, how will CAPI clients (including the planned control plane providers) implement StatefulSet-like semantics if/when they are needed? |
This is also related to in-place vs. re-place upgrades. /cc @neolit123 |
Hi @dlipovetsky,
I'm not entirely sure I understand how recreating a VM/instance for a Machine makes a Machine mutable. The Machine is an immutable spec that guarantees the existence of a back-end instance/VM. If a Machine is not a contract that guarantees the existence of the back-end instance/VM, then why is a Machine even represented as a resource? The Machine is the intended/desired state that a VM/instance fulfills.
Do K8s clients know when/if the pod fulfilling a service changes? If so, then I imagine the record will be the same. The Machine is the intended state and the VM/instance fulfills that state. I don't quite understand how this is any different than any other K8s resource. |
Another data point to bring up: a Pod has a restart policy that applies to all of its containers. Valid values are Always, OnFailure, and Never. If a container terminates, it may be restarted, depending on the policy. Maybe we need a similar restart (recreate) policy for Machines? |
Unlike Pods we have multiple backing implementations, can we ensure that such policies can be handled and enforced across all the different backing implementations? How to handle cases where they can't be for a particular provider? |
Good point. You could potentially have an infra provider install a validating webhook that validates a machine if its infraRef is for that provider, and it could reject unsupported policies. |
Having different providers supporting and rejecting different sets of policies seems like it could lead to end-user confusion and frustration. |
What about a retention policy with options being retain or delete. If the server is in error (missing or some actual error):
Then to handle the need to recreate, if the Machine belongs to a MachineSet, the MachineSet logic could identify the unhealthy Machine, delete it, and create a replacement. |
I agree with you. Having read @detiber's, @ncdc's, and your comments, I see I did not understand the problem with enough nuance. I now think the problem is not replacing the back-end instance/VM, but rather in the difference in network identity and on-disk state (among other things) between the original back-end instance/VM and its replacement. For convenience, let me refer to the two back-end VMs/instances "old" and "new." I think that replacing an old instance with a new one should be equivalent to power cycling the "old" instance. If, for example, if it is not possible to assign the old instance's network identity to the new instance and restore the old instance's on-disk state to the new instance, then the replacement must not happen. |
+1 to a machine lifecycle policy, I think VM lifecycle has deviated too much from various cloud providers to find policy that will fit all the use-cases. The AWS EC2 vs vSphere VM case is drastically different enough already. Adding to Andy's list of
|
Chiming in here. In OpenShift, we decided it doesn't make sense to recreate a missing VM, or a VM that is otherwise unhealthy. In v1alpha1, originally even if the VM was stopped, AWS provider would create a new instance (and delete the other instance based on whether or not it's a master). For cluster-api, I would like to get to the place where a machine-object only results in a single instance being created, ever. Firstly, this is pretty easy to grok, even if it doesn't align with someone's initial expectation. Secondly, we're wasting a lot of cloud-api quota trying to keep track of instance state. If we know we only create once, or delete once, we don't need to worry about hitting the cloud api's quota limit. You'll definitely run into this if you're deploying multiple large clusters in the same AWS account. Trimming API calls wherever possible should be a goal. Having the 'only created once' will probably work across all providers, and that's something we should try to keep as similar as possible. Also, we can easily remediate problematic machines at a higher level. If something detects a machine needs to be re-created... it just deletes that machine object and machineset spins up a new one, or reboots it, or does nothing. |
Also, recreating the backing-instances messes up our bootstrap flow. Fact of the matter is, there's other components that do things when a machine is created, and trying to signal to a bunch of loosely coupled components that 'hey this machine is actually different now' is quite problematic. |
Generally I think we should strive to provide the most similar cluster-api experience across all different providers. This makes behaviour predictable across any cloud providing better UX for cluster-api. Also this makes easy building farther common abstractions and upper level tooling on top of cluster-api core primitives yielding added value for the end consumers and favouring the community cluster-api ecosystem development. I think we should keep narrowing down the scope of the providers implementations to purely API non opinionated cloud interactions and let the core cluster-api make some decisions, enforce semantics, commonalities and invariants that we desire for a provider to be compliant and let the flexibility happen on top of the core cluster/machine API.
I would include in the list of permanent failures a few other use cases in addition to a cloud instance being removed out of band, e.g: invalid infra API input that sneaked in pre-persistence validation or cloud zone outage. For all the scenarios falling on that bucket we already have a cloud agnostic semantic https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20190610-machine-states-preboot-bootstrapping.md#failed
@ncdc @detiber I think we should try to unify providers behaviour/expectations to reduce side effects and UX skew and let the flexibility to be built on top of the core upper level cluster/machine layer.
If anything I think we could bubble up the This would in essence favour the interaction between these composable building blocks (machine, mahcineSet, machine health checker) in a cloud provider agnostic manner, which I believe is in the core principles of cluster-api
@detiber I'd love to see those notes if you manage to find them. |
/assign |
@randomvariable I'd hold off on documenting this task until we discuss it during the community meeting or solve the conversations with stakeholders above |
Bumping this to try to come to a resolution. I like @dlipovetsky's comment:
I also don't want to have a muddy UX, which could easily happen if we introduce restart policies and which ones are supported vary per provider. I have a couple of options for language we can consider adopting:
Word-smithing notwithstanding, how do y'all feel about these? |
Even though the option 2 is the most restrictive, is the one I'm leaning towards the most given that it most resembles the current state of things |
The thing that I worry the most about with option 2 is that we mark a InfraMachine and Machine resource as a permanent error, but it recovers prior to workloads being evacuated, and now we potentially disrupt workloads even more by deleting the Machine/InfraMachine through remediation when we could have left them alone. |
I'm for option 2. It's how the cloud typically works. A machine represents a request for a unit of compute resource, and then serves as a record of that request. When you create an instance in AWS or similar cloud provider, the cloud gives you that instance, and tells you it's ID. What happens to that instance later is entirely up to you. There's no mechanism to say "If this instance fails, give me an identical one, and also pretend it was the one that was running previously with same networking, same instance ID, same everything." Machines are by nature imperative because the cloud is imperative. Machines are merely a wrapper to the imperative cloud. MachineSets and higher-order things are the declarative layer, and that's where the recreation, if any, should be handled. |
For me, this is why remediation of machines that are already nodes should always start by looking at the node. It doesn't make a lot of sense to me to keep track of what's happening to the actual instance. The node will go unready, pod health checks will fail. We remediate the node, not the machine. In doing so, we cordon, and then drain, and then delete the backing instance. If the node has failed for 5 minutes, then let's clean it up, machineset will give us a new one. |
I wonder if we can combine elements from what I wrote along with @michaelgugino's comments and some edits to address @detiber's concerns:
Potentially out of scope for this discussion, or pending the outcome of #1684, add some text about how CAPI should handle failed Nodes. WDYT? |
Agree with that. That also ties in with the changes in kubernetes-sigs/cluster-api-provider-aws#1256 |
@ncdc I agree with that phrasing. |
I had an opportunity to re-read through this conversation and I wanted to bubble up some things to make sure they're not being lost in the shuffle: There was a suggestion for machine lifecycle policy, which was reasoned out due to the fact there are multiple implementing providers that might not be able to meet all the provided policies -- could/should there be a follow-on conversation on potentially a higher level controller to maybe add some of this auto-remediation (similar to what was possibly outlined here)? If that discussion is happening, maybe provide a link to that issue? EDIT: Ha, it's here #1684 In several of the cloud-based products that I've encountered, infrastructure self-healing is usually presented as a feature, an often desired feature. If the default behavior of the Cluster API is to do nothing to unhealthy machines, that documentation should be clearly highlighted as well as providing additional mechanisms to address that gap. |
I think the default behavior of 'cluster api' as a set of components would be to remediate that machine. The question is where that remediation takes place. For a variety of reasons, some of us believe that should be an entirely separate concern. In most cases, we want to remediate based on node conditions, not machine conditions, and it makes sense to move the logic to that component. |
Is it fair to close this now that Machine health check & remediation bits have been shipped in v0.3.0? |
I wasn't privy to this conversation originally, but having scanned through, if #1205 (comment) is satisfied, then yes MHC will solve this issue, and remediate any machine which goes into a permanently failed state (if it has |
good enough for me! I think we can close this now, if there is any other follow-up we can either tackle that separately or reopen it later /close |
@vincepri: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind feature
Describe the solution you'd like
Let's say I have a Machine, and my infrastructure provider created a server/VM. Everything is working fine, and then, for whatever reason, the server/VM gets in a bad state. Let's imagine someone/something external to Cluster API manually deleted it. I'd like for us to decide on and document how an infrastructure provider should (or must) address the situation.
CAPA is interested in setting the Machine's error reason & message to indicate the underlying EC2 instance can't be found. This would put the Machine into a permanent failure state, and it would require manual or automated intervention outside of the CAPA AWSMachine reconciler to resolve.
If CAPA instead created a replacement EC2 instance, the new VM most likely will get a different private IP address than the previous instance, and this will result in a new Kubernetes Node being created (because Nodes running in AWS are named based on the private DNS name).
CAPV recently switched from the above behavior, and it now will create a replacement VM if it can't find the expected VM. This was in response to a user's bug report. Creating a new VM most likely will result in reusing the same Kubernetes Node (the node name is based on the VM name, which remains constant and is controlled by / generated by CAPV - right @akutz?). This can have potential issues if the Node's ProviderID still points at the original (missing) VM (again @akutz please correct any bugs in my comments).
/kind documentation
/milestone v1alpha2
/priority important-soon
The text was updated successfully, but these errors were encountered: