Decide on & document how infrastructure providers should (or must) handle permanently failed servers/VMs #1205

ncdc · 2019-07-29T14:34:30Z

/kind feature

Describe the solution you'd like
Let's say I have a Machine, and my infrastructure provider created a server/VM. Everything is working fine, and then, for whatever reason, the server/VM gets in a bad state. Let's imagine someone/something external to Cluster API manually deleted it. I'd like for us to decide on and document how an infrastructure provider should (or must) address the situation.

CAPA is interested in setting the Machine's error reason & message to indicate the underlying EC2 instance can't be found. This would put the Machine into a permanent failure state, and it would require manual or automated intervention outside of the CAPA AWSMachine reconciler to resolve.

If CAPA instead created a replacement EC2 instance, the new VM most likely will get a different private IP address than the previous instance, and this will result in a new Kubernetes Node being created (because Nodes running in AWS are named based on the private DNS name).

CAPV recently switched from the above behavior, and it now will create a replacement VM if it can't find the expected VM. This was in response to a user's bug report. Creating a new VM most likely will result in reusing the same Kubernetes Node (the node name is based on the VM name, which remains constant and is controlled by / generated by CAPV - right @akutz?). This can have potential issues if the Node's ProviderID still points at the original (missing) VM (again @akutz please correct any bugs in my comments).

/kind documentation
/milestone v1alpha2
/priority important-soon

ncdc · 2019-07-29T14:37:24Z

I think something else that is related to this is what the expected behavior of the MachineSet controller should be if it encounters a Machine that has its ErrorReason/ErrorMessage set. I'd like us to consider having the MachineSet controller delete Machines in this state, so replacement replicas can be created.

detiber · 2019-07-29T14:38:38Z

Based on discussions that were held during the Node Lifecycle Workstream, the general consensus was to not recreate missing Machine instances.

/assign @timothysc

ncdc · 2019-07-29T14:39:31Z

That's what I recall, although I have had difficulty locating any docs on this. Hence the issue.

akutz · 2019-07-29T14:40:29Z

Hi @ncdc,

This sums up our discussion quite well, thank you. The CAPV issue to which you referred is kubernetes-sigs/cluster-api-provider-vsphere#409.

We also discussed adding entropy to VM/guest names in the form of a UUID-based suffix (something like the first seven characters from a UUID) so that a CAPV machine might still recreate a missing VM, but ensure that VMs/guests always have unique names, which in turn ensures unique node names.

As you pointed out, this is not an issue with CAPA since:

CAPA doesn't automatically replace a missing instance
If it did, CAPA instance names are based on the IP, which means unique CAPA instance names

My note to Andy in our original conversation was that not creating missing machines seems anti-pattern to the nature of CAPI and Kubernetes. If a machine's VM/instance is missing, shouldn't it be reconciled?

I don't feel strongly about this, but at least one user did assume that missing VMs should be reconciled, and they filed an issue with CAPV, and we implemented that logic.

Andy and I just want a written consensus on this in order to point to the suggested behavior if/when future questions are submitted about why CAP* doesn't recreate missing instances/VMs.

akutz · 2019-07-29T14:41:49Z

FWIW, I do feel that not recreating backend VMs/instances is anti-pattern to Kubernetes, and I completely understand why a user would reasonably expect a CAPI provider to ensure that a backend instance/VM is recreated if missing.

detiber · 2019-07-29T14:42:31Z

@akutz UUID of a Machine (or a provider CRD object) cannot be relied on, since it will change if the cluster-api components are pivoted or restored from backup.

akutz · 2019-07-29T14:44:56Z

Hi @detiber,

Which is why CAPV updates a Machine spec to include the VM's own instance UUID.

detiber · 2019-07-29T14:48:31Z

I'm having trouble finding the notes/discussions, but iirc the rationale was that since a Machine is an abstraction over a Kubernetes Node, if we destroy/recreate a Machine as part of recovery of errors, then we will generally get a completely different Node object rather than the same Node that previously existed.

ncdc · 2019-07-29T14:49:22Z

if we destroy/recreate a Machine

do you mean server/VM/instance?

detiber · 2019-07-29T14:49:52Z

@ncdc yes, that is indeed what I meant.

akutz · 2019-07-29T14:50:45Z

we will generally get a completely different Node object rather than the same Node that previously existed.

Which is why I made this remark to Andy on Slack:

I never considered having a 1-to-many relationship for CAP* machines to Kube nodes. I suppose it’s still 1-to-1 at any given point-in-time, but it’s 1-to-many over time since replacing machine’s failed instance/node/VM should likely cause the creation of a net-new node in the cluster, not a node with the same name.

I consider the CAPI Machine to almost be the contract for the node. What satisfies that contract is unimportant as long as you can find the thing (the node) that satisfies the Machine contract.

ncdc · 2019-07-29T14:51:28Z

@detiber ok, that's not necessarily the case - it depends on the provider. CAPV is able to create a replacement VM that will reassociate with the original Node (and potentially be an issue if the provider id is outdated).

dlipovetsky · 2019-07-29T22:46:09Z

My note to Andy in our original conversation was that not creating missing machines seems anti-pattern to the nature of CAPI and Kubernetes. If a machine's VM/instance is missing, shouldn't it be reconciled?

I do not recognize an anti-pattern. I find that an immutable Machine object is useful, and even necessary for some semantics.

If a VM/instance is missing, then the controller will act according to the semantics it provides. For example, the MachineSet controller can create a new Machine, then delete the Machine with the missing VM/instance. If the Machine is not owned by a controller, then a user will need to do this.

Moreover, if an instance backing the Machine object changes, CAPI clients may want to know what the previous instance was, why it was changed, etc. Do you have thoughts on how they would find that information?

I suppose it’s still 1-to-1 at any given point-in-time, but it’s 1-to-many over time since replacing machine’s failed instance/node/VM

If a Machine is mutable in this way, how will CAPI clients (including the planned control plane providers) implement StatefulSet-like semantics if/when they are needed?

dlipovetsky · 2019-07-29T22:51:32Z

This is also related to in-place vs. re-place upgrades.

/cc @neolit123

akutz · 2019-07-30T00:22:37Z

Hi @dlipovetsky,

I find that an immutable Machine object is useful,

I'm not entirely sure I understand how recreating a VM/instance for a Machine makes a Machine mutable. The Machine is an immutable spec that guarantees the existence of a back-end instance/VM.

If a Machine is not a contract that guarantees the existence of the back-end instance/VM, then why is a Machine even represented as a resource? The Machine is the intended/desired state that a VM/instance fulfills.

Moreover, if an instance backing the Machine object changes, CAPI clients may want to know what the previous instance was, why it was changed, etc. Do you have thoughts on how they would find that information?

Do K8s clients know when/if the pod fulfilling a service changes? If so, then I imagine the record will be the same.

The Machine is the intended state and the VM/instance fulfills that state.

I don't quite understand how this is any different than any other K8s resource.

ncdc · 2019-07-30T14:19:38Z

Another data point to bring up: a Pod has a restart policy that applies to all of its containers. Valid values are Always, OnFailure, and Never. If a container terminates, it may be restarted, depending on the policy.

Maybe we need a similar restart (recreate) policy for Machines?

detiber · 2019-07-30T14:22:04Z

Unlike Pods we have multiple backing implementations, can we ensure that such policies can be handled and enforced across all the different backing implementations? How to handle cases where they can't be for a particular provider?

ncdc · 2019-07-30T14:30:50Z

Good point. You could potentially have an infra provider install a validating webhook that validates a machine if its infraRef is for that provider, and it could reject unsupported policies.

detiber · 2019-07-30T14:36:22Z

Having different providers supporting and rejecting different sets of policies seems like it could lead to end-user confusion and frustration.

ncdc · 2019-07-30T14:51:42Z

What about a retention policy with options being retain or delete. If the server is in error (missing or some actual error):

retain = don't do anything
delete = delete the server (if it exists)

Then to handle the need to recreate, if the Machine belongs to a MachineSet, the MachineSet logic could identify the unhealthy Machine, delete it, and create a replacement.

dlipovetsky · 2019-07-30T18:32:41Z

@akutz,

The Machine is an immutable spec that guarantees the existence of a back-end instance/VM.

I agree with you. Having read @detiber's, @ncdc's, and your comments, I see I did not understand the problem with enough nuance.

I now think the problem is not replacing the back-end instance/VM, but rather in the difference in network identity and on-disk state (among other things) between the original back-end instance/VM and its replacement.

For convenience, let me refer to the two back-end VMs/instances "old" and "new."

I think that replacing an old instance with a new one should be equivalent to power cycling the "old" instance.

If, for example, if it is not possible to assign the old instance's network identity to the new instance and restore the old instance's on-disk state to the new instance, then the replacement must not happen.

andrewsykim · 2019-07-30T18:56:06Z

+1 to a machine lifecycle policy, I think VM lifecycle has deviated too much from various cloud providers to find policy that will fit all the use-cases. The AWS EC2 vs vSphere VM case is drastically different enough already.

Adding to Andy's list of Retain and Delete, some other possible policies off the top of my head:

ReconcileOnDelete - only reconcile if an instance is deleted outside of CAPI, but not if an instance is in some failure state
ReconcileOnFailure - only reconcile if the instance is in some failure state
Ignore - same as Retain suggested by Andy but I think it's more clearer what is happening

michaelgugino · 2019-07-30T19:04:18Z

Chiming in here.

In OpenShift, we decided it doesn't make sense to recreate a missing VM, or a VM that is otherwise unhealthy. In v1alpha1, originally even if the VM was stopped, AWS provider would create a new instance (and delete the other instance based on whether or not it's a master).

For cluster-api, I would like to get to the place where a machine-object only results in a single instance being created, ever. Firstly, this is pretty easy to grok, even if it doesn't align with someone's initial expectation. Secondly, we're wasting a lot of cloud-api quota trying to keep track of instance state. If we know we only create once, or delete once, we don't need to worry about hitting the cloud api's quota limit. You'll definitely run into this if you're deploying multiple large clusters in the same AWS account. Trimming API calls wherever possible should be a goal.

Having the 'only created once' will probably work across all providers, and that's something we should try to keep as similar as possible. Also, we can easily remediate problematic machines at a higher level. If something detects a machine needs to be re-created... it just deletes that machine object and machineset spins up a new one, or reboots it, or does nothing.

michaelgugino · 2019-07-30T19:09:10Z

Also, recreating the backing-instances messes up our bootstrap flow. Fact of the matter is, there's other components that do things when a machine is created, and trying to signal to a bunch of loosely coupled components that 'hey this machine is actually different now' is quite problematic.

enxebre · 2019-07-31T11:03:31Z

Generally I think we should strive to provide the most similar cluster-api experience across all different providers. This makes behaviour predictable across any cloud providing better UX for cluster-api. Also this makes easy building farther common abstractions and upper level tooling on top of cluster-api core primitives yielding added value for the end consumers and favouring the community cluster-api ecosystem development.

I think we should keep narrowing down the scope of the providers implementations to purely API non opinionated cloud interactions and let the core cluster-api make some decisions, enforce semantics, commonalities and invariants that we desire for a provider to be compliant and let the flexibility happen on top of the core cluster/machine API.

how infrastructure providers should (or must) handle permanently failed servers/VMs

I would include in the list of permanent failures a few other use cases in addition to a cloud instance being removed out of band, e.g: invalid infra API input that sneaked in pre-persistence validation or cloud zone outage. For all the scenarios falling on that bucket we already have a cloud agnostic semantic https://github.com/kubernetes-sigs/cluster-api/blob/master/docs/proposals/20190610-machine-states-preboot-bootstrapping.md#failed

+1 to a machine lifecycle policy, I think VM lifecycle has deviated too much from various cloud providers to find policy that will fit all the use-cases. The AWS EC2 vs vSphere VM case is drastically different enough already.

What about a retention policy with options being retain or delete. If the server is in error (missing or some actual error):

retain = don't do anything
delete = delete the server (if it exists)
Then to handle the need to recreate, if the Machine belongs to a MachineSet, the MachineSet logic could identify the unhealthy Machine, delete it, and create a replacement.

@ncdc @detiber I think we should try to unify providers behaviour/expectations to reduce side effects and UX skew and let the flexibility to be built on top of the core upper level cluster/machine layer.
If the end user desired experience is for a new instance to be recreated then we should leverage the common cluster/machine API that we are building exactly for reducing cross provider maintainability burden and to abstract away commonalities like this.
i.e an ad-hoc controller (analogous to pod gc) could watch the phase="failed" machines and make decisions based on strategies of choice, e.g delete the machine object and let the machineSet recreate a new one. This could be a "machine health check controller" and could eventually watch also node conditions and so enabling smarter plumbing, e.g https://github.com/kubernetes/node-problem-detector.
Now the big difference and benefit is that this would know nothing about providers, that's an implementation detail, instead it just need to know about the common layer, cluster/machine API. So we can simplify the ecosystem and reduce unpredictable side effects by enforcing the machine 1 - 1 cloud instance as an invariant while providing the same end user value at the right layer.

I think something else that is related to this is what the expected behaviour of the MachineSet controller should be if it encounters a Machine that has its ErrorReason/ErrorMessage set. I'd like us to consider having the MachineSet controller delete Machines in this state, so replacement replicas can be created.

If anything I think we could bubble up the status.errorMessage from the machines to the machineSet.status.errorMessage when calculating status for the sake of a better UX. I think we should keep the machineSet controller intentionally with narrow responsibility, just ensuring desired replicas number. The ad-hoc component described above would contain the domain knowledge and provide the flexibility to react to the different scenarios conveyed in cloud agnostic semantics.

This would in essence favour the interaction between these composable building blocks (machine, mahcineSet, machine health checker) in a cloud provider agnostic manner, which I believe is in the core principles of cluster-api

I'm having trouble finding the notes/discussions, but iirc the rationale was that since a Machine is an abstraction over a Kubernetes Node, if we destroy/recreate a Machine as part of recovery of errors, then we will generally get a completely different Node object rather than the same Node that previously existed.

@detiber I'd love to see those notes if you manage to find them.

randomvariable · 2019-10-01T13:44:46Z

/assign

vincepri · 2019-10-01T13:52:07Z

@randomvariable I'd hold off on documenting this task until we discuss it during the community meeting or solve the conversations with stakeholders above

ncdc · 2019-10-30T20:50:31Z

Bumping this to try to come to a resolution.

I like @dlipovetsky's comment:

I think that replacing an old instance with a new one should be equivalent to power cycling the "old" instance.

If, for example, if it is not possible to assign the old instance's network identity to the new instance and restore the old instance's on-disk state to the new instance, then the replacement must not happen.

I also don't want to have a muddy UX, which could easily happen if we introduce restart policies and which ones are supported vary per provider.

I have a couple of options for language we can consider adopting:

A Machine represents infrastructure for a single Kubernetes Node. The Node has a specific, single, immutable identity (its name). An infrastructure provider may, at its discretion, replace a Machine's infrastructure with replacement infrastructure, as long as the Node's identity remains consistent and immutable.
A Machine represents infrastructure for a single Kubernetes Node. The Node has a specific, single, immutable identity (its name). An infrastructure provider MUST only ever create a single instance of the infrastructure (virtual machine, bare metal server, or the equivalent) for a Machine. In the event the infrastructure fails, the infrastructure provider MUST mark the infrastructure resource ("AcmeMachine") as failed, setting status.errorReason and status.errorMessage accordingly. This is similar to a Pod with a RestartPolicy of Never.

Word-smithing notwithstanding, how do y'all feel about these?

vincepri · 2019-10-30T22:00:38Z

Even though the option 2 is the most restrictive, is the one I'm leaning towards the most given that it most resembles the current state of things

detiber · 2019-10-31T00:23:50Z

The thing that I worry the most about with option 2 is that we mark a InfraMachine and Machine resource as a permanent error, but it recovers prior to workloads being evacuated, and now we potentially disrupt workloads even more by deleting the Machine/InfraMachine through remediation when we could have left them alone.

michaelgugino · 2019-10-31T13:51:41Z

I'm for option 2. It's how the cloud typically works. A machine represents a request for a unit of compute resource, and then serves as a record of that request.

When you create an instance in AWS or similar cloud provider, the cloud gives you that instance, and tells you it's ID. What happens to that instance later is entirely up to you. There's no mechanism to say "If this instance fails, give me an identical one, and also pretend it was the one that was running previously with same networking, same instance ID, same everything."

Machines are by nature imperative because the cloud is imperative. Machines are merely a wrapper to the imperative cloud. MachineSets and higher-order things are the declarative layer, and that's where the recreation, if any, should be handled.

michaelgugino · 2019-10-31T13:56:29Z

The thing that I worry the most about with option 2 is that we mark a InfraMachine and Machine resource as a permanent error, but it recovers prior to workloads being evacuated, and now we potentially disrupt workloads even more by deleting the Machine/InfraMachine through remediation when we could have left them alone.

For me, this is why remediation of machines that are already nodes should always start by looking at the node. It doesn't make a lot of sense to me to keep track of what's happening to the actual instance. The node will go unready, pod health checks will fail. We remediate the node, not the machine. In doing so, we cordon, and then drain, and then delete the backing instance. If the node has failed for 5 minutes, then let's clean it up, machineset will give us a new one.

ncdc · 2019-10-31T14:23:41Z

I wonder if we can combine elements from what I wrote along with @michaelgugino's comments and some edits to address @detiber's concerns:

1:1 relationship between Machine & backing infra
Infra provider MUST NOT create more than one "backing infra" instance for a Machine
Infra provider MAY detect permanent failures and mark the "AcmeMachine" as failed (e.g., for AWS, if the ec2 instance is terminated or not found, this can immediately be identified as failed)

Potentially out of scope for this discussion, or pending the outcome of #1684, add some text about how CAPI should handle failed Nodes.

WDYT?

randomvariable · 2019-10-31T15:10:08Z

Agree with that. That also ties in with the changes in kubernetes-sigs/cluster-api-provider-aws#1256

michaelgugino · 2019-10-31T15:11:53Z

@ncdc I agree with that phrasing.

voor · 2019-12-03T19:49:07Z

I had an opportunity to re-read through this conversation and I wanted to bubble up some things to make sure they're not being lost in the shuffle:

There was a suggestion for machine lifecycle policy, which was reasoned out due to the fact there are multiple implementing providers that might not be able to meet all the provided policies -- could/should there be a follow-on conversation on potentially a higher level controller to maybe add some of this auto-remediation (similar to what was possibly outlined here)? If that discussion is happening, maybe provide a link to that issue? EDIT: Ha, it's here #1684

In several of the cloud-based products that I've encountered, infrastructure self-healing is usually presented as a feature, an often desired feature. If the default behavior of the Cluster API is to do nothing to unhealthy machines, that documentation should be clearly highlighted as well as providing additional mechanisms to address that gap.

michaelgugino · 2019-12-03T23:08:56Z

If the default behavior of the Cluster API is to do nothing to unhealthy machines

I think the default behavior of 'cluster api' as a set of components would be to remediate that machine. The question is where that remediation takes place. For a variety of reasons, some of us believe that should be an entirely separate concern. In most cases, we want to remediate based on node conditions, not machine conditions, and it makes sense to move the logic to that component.

@voor

vincepri · 2020-03-11T15:47:37Z

Is it fair to close this now that Machine health check & remediation bits have been shipped in v0.3.0?

@JoelSpeed @enxebre

JoelSpeed · 2020-03-11T16:38:46Z

I wasn't privy to this conversation originally, but having scanned through, if #1205 (comment) is satisfied, then yes MHC will solve this issue, and remediate any machine which goes into a permanently failed state (if it has failureReason set IIRC)

vincepri · 2020-03-11T16:49:13Z

good enough for me! I think we can close this now, if there is any other follow-up we can either tackle that separately or reopen it later

/close

k8s-ci-robot · 2020-03-11T16:49:15Z

@vincepri: Closing this issue.

In response to this:

good enough for me! I think we can close this now, if there is any other follow-up we can either tackle that separately or reopen it later

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added this to the v1alpha2 milestone Jul 29, 2019

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/documentation Categorizes issue or PR as related to documentation. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jul 29, 2019

k8s-ci-robot assigned timothysc Jul 29, 2019

ncdc mentioned this issue Aug 1, 2019

Handle stopped/ terminated ec2 instances kubernetes-sigs/cluster-api-provider-aws#866

Closed

ncdc mentioned this issue Aug 21, 2019

[v1alpha1] Machines with replaced instances do not re-join the cluster kubernetes-sigs/cluster-api-provider-aws#1052

Closed

detiber modified the milestones: v0.2.0 (v1alpha2), v0.2.x (v1alpha2) Aug 29, 2019

ncdc modified the milestones: v0.2.x (v1alpha2), Next Sep 4, 2019

timothysc removed their assignment Sep 26, 2019

k8s-ci-robot assigned randomvariable Oct 1, 2019

randomvariable removed their assignment Dec 16, 2019

k8s-ci-robot closed this as completed Mar 11, 2020

mrog mentioned this issue Oct 19, 2022

While creating a new cluster, CAPI fails to remediate new machines that aren't functional #7353

Closed

Decide on & document how infrastructure providers should (or must) handle permanently failed servers/VMs #1205

Decide on & document how infrastructure providers should (or must) handle permanently failed servers/VMs #1205

Comments

ncdc commented Jul 29, 2019

ncdc commented Jul 29, 2019

detiber commented Jul 29, 2019

ncdc commented Jul 29, 2019

akutz commented Jul 29, 2019

akutz commented Jul 29, 2019

detiber commented Jul 29, 2019

akutz commented Jul 29, 2019

detiber commented Jul 29, 2019

ncdc commented Jul 29, 2019

detiber commented Jul 29, 2019

akutz commented Jul 29, 2019

ncdc commented Jul 29, 2019

dlipovetsky commented Jul 29, 2019

dlipovetsky commented Jul 29, 2019

akutz commented Jul 30, 2019

ncdc commented Jul 30, 2019

detiber commented Jul 30, 2019

ncdc commented Jul 30, 2019

detiber commented Jul 30, 2019

ncdc commented Jul 30, 2019

dlipovetsky commented Jul 30, 2019 • edited Loading

andrewsykim commented Jul 30, 2019

michaelgugino commented Jul 30, 2019

michaelgugino commented Jul 30, 2019

enxebre commented Jul 31, 2019 • edited Loading

randomvariable commented Oct 1, 2019

vincepri commented Oct 1, 2019

ncdc commented Oct 30, 2019

vincepri commented Oct 30, 2019

detiber commented Oct 31, 2019 • edited Loading

michaelgugino commented Oct 31, 2019

michaelgugino commented Oct 31, 2019

ncdc commented Oct 31, 2019

randomvariable commented Oct 31, 2019

michaelgugino commented Oct 31, 2019

voor commented Dec 3, 2019 • edited Loading

michaelgugino commented Dec 3, 2019

vincepri commented Mar 11, 2020

JoelSpeed commented Mar 11, 2020

vincepri commented Mar 11, 2020

k8s-ci-robot commented Mar 11, 2020

dlipovetsky commented Jul 30, 2019 •

edited

Loading

enxebre commented Jul 31, 2019 •

edited

Loading

detiber commented Oct 31, 2019 •

edited

Loading

voor commented Dec 3, 2019 •

edited

Loading