KEP-4563: Evacuation API #4565

atiratree · 2024-03-28T15:05:26Z

One-line PR description: introduce Evacuation API to allow managed graceful pod removal

Issue link: Evacuation API #4563

Other comments: part of the functionality has been split from KEP-4212: Declarative Node Maintenance #4213 into this KEP

sftim · 2024-03-28T16:12:00Z

keps/sig-apps/4563-evacuation-api/README.md

+We will introduce a new term called evacuation. This is a contract between the evacuation instigator,
+the evacuee, and the evacuator. The contract is enforced by the API and an evacuation controller.
+We can think of evacuation as a managed and safer alternative to eviction.


Watch out for the risk of confusing end users.

We already have preemption and eviction and people confuse the two. Or three, because there are two kinds of eviction. And there's disruption in the mix.

Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?

Good idea, I added a mention which of what kind of eviction I mean here.

Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?

Yes, I think we want to add a new concept there and generally update the docs once we have an alpha.

I'm with Tim here.

Preemption vs Eviction is already quite confusing. And TBH, I couldn't fully understand what the "evacuation" is supposed to solve by reading the summary or motivation.

From Goals:

Changes to the eviction API to support the evacuation process.

If this is already going to be part of the Eviction API, maybe it should be named as a form of eviction. Something like "cooperative eviction" or "eviction with ack" or something along those lines?

I'm all for framing it as another type of eviction; we already have two, so the extra cognitive load for users is not so much a problem.

@alculquicondor I have updated the summary and goals, I hope it makes more sense now.

I think the name should make the most sense to the person creating the Evacuation (Evacuation Instigator ATM). So CooperativeEviction or EvictionWithAck is a bit misleading IMO. Because from that person's perpective there is no additional step required of them. Only the evacuators and the evacuation controller implement the cooperative evacuation process but this is hidden from the normal user.

My suggestions:

GracefulEviction (might confuse people if it is associated with graceful pod termination, which it is not)

SafeEviction (*safer than the API-initiated one for some pods)

Or just call it Eviction? And tell people to use it instead of the Eviction API endpoint? This might be a bit confusing (at least in the beginning)

We can bikeshed names for the API kind; I'd throw a few of my own into the hat:

EvictionRequest

PodEvictionRequest

sftim · 2024-03-28T16:17:34Z

keps/sig-apps/4563-evacuation-api/README.md

+<!--
+What other approaches did you consider, and why did you rule them out? These do
+not need to be as detailed as the proposal, but should include enough
+information to express the idea and why it was not acceptable.
+-->


We have a number of examples of having a SomethingRequest or SomethingClaim API that then causes a something (certificate signing, node provisioning, etc).
Think of TokenRequest (a subresource for a ServiceAccount), or CertificateSigningRequest.

I would confidently and strongly prefer to have an EvictionRequest or PodEvictionRequest API, rather than an Evacuation API kind.

It's easy to teach that we have evictions and than an EvictionRequest is asking for one to happen; it's hard to teach the difference between an eviction and an evacuation.

As a side effect, this makes the feature gate easier to name (eg PodEvictionRequests).

As you have mentioned, we already have different kinds of eviction. So I think it would be good to use a completely new term to distinguish it from the others.

Also, Evacuation does not always result in eviction (and PDB consultation). It depends on the controller/workload. For some workloads like DaemonSets and static pods, API eviction has never worked before. This could also be very confusing if we name it the same way.

I think Evacuation fits this better because

The name is shorter. If we go with EvacuationRequest then the evacuation will become just an abstract term and less recognizable.

It seems it will have quite a lot of functionality included (state synchronization between multiple instigators and multiple evacuators, state of the evacuee and evacuation). TokenRequest and CertificateSigningRequest are simpler and not involved in a complex process.

I suggested EvictionRequest so that we don't have to have a section with the (too long) title: Scheduling, Preemption, Evacuation and Eviction. Not EvacuationRequest.

Adding another term doesn't scale so well: it means helping n people understand the difference between evacuation and eviction. It's a scaling challenge where n is not only large, it probably includes quite a few Kubernetes maintainers.

As for CertificateSigningRequest being simple: I don't buy it. There are three controllers, custom signers, an integration with trust bundles, the horrors of ASN.1 and X.509… trust me, it's complicated enough.

I understand that it will be confusing for people, but that will happen regardless of what term we will use.

My main issue is that evacuation does not directly translate to eviction. Therefore, I think it would be preferable to choose a new term (not necessarily evacuation).

I would like to get additional opinions from people about this. And we will definitely have to come back to this in the API review.

keps/sig-apps/4563-evacuation-api/README.md

sftim · 2024-03-28T16:21:36Z

keps/sig-apps/4563-evacuation-api/README.md

+The evacuee can be any pod. The owner/controller of the pod is the evacuator. In a
+special case, the evacuee can be its own evacuator. The evacuator should decide what action to take
+when it observes an evacuation intent:


This doesn't match common English use. An evacuator would be something initiating evacuation, not the owner / manager of the thing being evactuated.

I don't know: It seemed to fit somehow, according to the dictionary ^^

a person or thing that evacuates.

In my view, even though it is the owner/manager of the pod, it also implements the evacuation part (evacuates).

Nevertheless, I am open to suggestions.

How about Evacuator -> Evacuation Responder -> Evacuee?

keps/sig-apps/4563-evacuation-api/README.md

atiratree · 2024-04-09T17:15:51Z

I have updated the KEP to include support for multiple evacuators
Evacuators can now advertise which pods they are able to evacuate, even before the evacuation. The advantage of this approach is that we can trigger an eviction immediately without a delay (was known asacceptDeadlineSeconds) if we do not find an evacuator. I have added a bunch of restrictions to ensure the API cannot be misused.
Clarified how the Evacuation objects should be named and how the admission should work in general (also for pods). This will ensure a 1:1 mapping between pods and Evacuation.
Removed the ability to add a full reference of the evacuator because it would be a hassle to synchronize considering the evacuator leader election and multiple evacuatorsin play.

sftim · 2024-04-09T18:21:41Z

keps/sig-apps/4563-evacuation-api/README.md

+
+Example evacuation triggers:
+- Node maintenance controller: node maintenance triggered by an admin.
+- Descheduler: descheduling triggered by a descheduling rule.


If the descheduler requests an eviction, what thing is being evacuated?

(the node maintenance and cluster autoscaler examples are easier: you're evacuating an entire node)

A single pod or multiple pods. The descheduler can use it as a new mechanism instead of eviction.

OK, so people typically think of “evacuate” as a near synonym of “drain” - you drain a node, you evacuate a rack or zone full of servers. Saying that you can evacuate a Pod might make people think its containers all get stopped, or just confuse readers. We do need to smooth out how we can teach this.

It seems it can be used in both scenarios https://www.merriam-webster.com/grammar/can-you-evacuate-people.
Evacuation of containers doesn't make sense because they are tied to the pod lifecycle. But, I guess it could be confusing if we do not make it explicitly clear what we are targeting.

Thing is, Kubernetes users typically - before we accept this KEP - use “evacuate” as a synonym for drain.

I'm (still) fine with the API idea, and still concerned about the naming.

Just to +1 the potential confusion of the term "evacuation".
Is it okay to have a glossary of terms for "evacuation", "eviction", and "drain" (or any other potentially confusing terms) added somewhere in this KEP?

I can include it in the KEP. And yes, we are going to change the name to something more suitable.

"To evacuate a person" implies "get them out of trouble, to safety" as opposed to "to empty" (as in vacuum). It's not ENTRIELY wrong in this context, but it's not entirely right either.

I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.

keps/sig-apps/4563-evacuation-api/README.md

toVersus · 2024-05-05T06:17:16Z

keps/sig-apps/4563-evacuation-api/README.md

+
+#### Deployment Controller
+
+The deployment controller must first decide whether a particular Deployment supports the evacuation


Do we need to consider integration with HPA? During an evacuation, if HPA scales down a Deployment, do we just follow the existing ReplicaSet's scale-down behavior to delete those Pods that need to be reduced, aside from the Pods scheduled for termination in the Evacuation API?

Regardless of the Deployment strategy, if the Deployment is scaled down during an evacuation, the ReplicaSet is scaled down and the pods with matching Evacuations can be terminated immediately by the replica set controller.

There are ways for the HPA to inject itself into the evacuation process. For example it could become the Evacuation Instigator and select the pods that should be scaled down by creating Evacuations for supported workloads. This is not a feature that HPA offers today, and I am not sure if it should. I think it is better to leave it out of this KEP.

The Deployments and ReplicaSets should work for both the manual and automatic scaling.

@toVersus I have added a new followup for the HPA use case (see HorizontalPodAutoscaler).

It should be possible for the HPA to become the evacuator of the workload, if Evacuations are observed.

I also roughly outline how the evacuations conflicts should be resolved.

Thank you for adding the HPA controller section in the Follow-up design!

Upon detecting an evacuation of pods under the HPA workload, the HPA would increase the number of replicas by the number of evacuations.

Just to confirm, the behavior described here refers to when the HPA controller becomes the Evacuator (i.e., when evacuation priority annotations are added to the HPA object), right? I agree that, to avoid conflicts with other pod controllers, I agree that the HPA controller does not perform evacuation by default.

Although this is something we can decide during future discussions about pod controller's Evacuation API support, another concern is whether the HPA controller will modify the replicas of Deployments, etc., during evacuation. If the pod controller increases the replicas to account for maxSurge and then the HPA controller modifies the replicas, there could be side effects. It seems simpler for the HPA controller not to perform scale-in or scale-out during evacuation, but if the evacuation takes a long time, not being able to scale in or scale out for a long period of time could itself be problematic.

Yes, it should become an evacuator and add annotations to pods. This is up for disscusion if the evacuation should be by default or not. I do not see a downside, by adding it as a lower priority evacuator.

The priorities and scaling operations depend on the workload. But it should never happen that both the deployment controller and the HPA increase replicas/pods at the same time.

k8s-ci-robot · 2024-06-03T13:18:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atiratree
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-apps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

atiratree · 2024-06-03T13:26:45Z

Added the following sections to the Alternatives:

Enhancing PodDisruptionBudgets and Changes to the Eviction API explaining why it is hard to use existing primitives like a PDB and eviction to solve the problem at hand.
Cancellation of Evacuation discussing the issue of allowing evacuation cancellations.

Other changes:

Added recommendations for evict vs delete call: we usually cannot use evict call because the evacuation may not be in a reversible state and there is a chance a PDB would block it.
Added .status.evacuationCancellationPolicy field to allow evacuators to control the cancelation policy.
We still need the finalizers to track the instigators and which ones can cancel/remove the evacuation object.
Added affinity based eviction to the motivation and use cases.
Updated validation and admission requirements.
Added HorizontalPodAutoscaler into Follow-up Design Details for Kubernetes Workloads section.
Filled in PRR.

soltysh

I have similar concerns as other expressed in #4213 and I've also voiced out there. There are several alternatives you've described here that you rejected due to simplicty. I'm not sure that entirely new API will solve this problem, I'd first suggest to tackle smaller problems that we currently see with both PDBs and Eviction API before overhauling this and introducing yet another API which will most likely confuse users, rather than provide them with suggested flexibility.

soltysh · 2024-06-06T18:38:41Z

keps/sig-apps/4563-evacuation-api/README.md

+An evacuator could reconcile the status properly without making any progress. It is thus
+recommended to check `creationTimestamp` of the Evacuations and observe
+`evacuation_controller_active_evacuator` metric to see how much it takes for an evacuator to
+complete the evacuation. This metric can be also used to implement additional alerts.


What about the risk behind adopting this new API, most of the users are very slow adopting these. My concern is that, aside from the naming confusion overload as mentioned in https://github.com/kubernetes/enhancements/pull/4565/files#r1543242249, a lot of users will prefer the simpler eviction API that is already battle tested even with its limitation, but mostly because it's already being widely used. Your proposed mechanism on one hand requires work on the k8s core side, and on the other hand to application owners managing the evacuation process.

I will repeat similarly to what has been stated in #4213, should we revisit PDB and eviction API, both of which are widely used and identify the gaps and what are the options to extend them with the missing functionality. I don't think we need a full fledged, feature-rich functionality, but rather a functional and easy to understand API built on the existing primitives.

I do not see a big risk in terms of adoption.

The eviction API is mostly used by a handful of components (kubectl drain, cluster autoscaler, karpenter, scheduler, descheduler, etc.). It is enough to introduce Evacuation API to a handful of components to see an impact. Users usually do not use this API by themselves - there is not even a kubectl support.

From the evacuator side, there is no user action required either. Please see the followup workload integration in this KEP, DeamonSet is part of the NodeMaintenance KEP.

The user can take action if they want to further improve the situation:

Implement custom evacuation strategy for their workload or use a strategy implemented by another party/component.

Add additional checks to the pods (aspect oriented approach) as discussed in KEP-4212: Declarative Node Maintenance #4213 (comment).

At this time, I do not see any extensions that could be made to PDBs and eviction API to solve the current issues. Please see the analysis in the alternatives section: https://github.com/atiratree/kube-enhancements/blob/improve-node-maintenance/keps/sig-apps/4212-declarative-node-maintenance/README.md#alternatives.

Apart from the NodeMaintenance we already have two additional use cases that would like to adopt Evacuation API

Affinity Based Eviction (Implement Affinity Based Eviction #4328)

Descheduler integration (KEP-1397: descheduler integration with evacuation API as an alternative to eviction API kubernetes-sigs/descheduler#1354)

Descheduler integration (kubernetes-sigs/descheduler#1354)
Why not align with the effort that scheduler is pushing to start using evictions (see #3280)?

That is too specific for preemption and expects ordered pods by a priority. It would not help in cases where priority is not used or cooperative eviction is needed.

atiratree · 2024-06-07T13:21:01Z

There are several alternatives you've described here that you rejected due to simplicty.

Do you mean the Pod API? We have analyzed the problem and found serious concurrency, obsevability and extensibility issues.

I'd first suggest to tackle smaller problems that we currently see with both PDBs and Eviction API before overhauling this and introducing yet another API which will most likely confuse users, rather than provide them with suggested flexibility.

I would prefer not to have a new API, but at this time, I do not see how we can incrementally improve the situation for PDBs and eviction API to address the problems at hand.

We need to come up with a good name so that is not such a big hurdle for the users to undestand. Fortunately, most of the time, this API will be used internally as a building block by the cluster components. Similar to the eviction API today.

k8s-triage-robot · 2024-09-11T10:42:31Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

atiratree · 2024-09-19T22:14:45Z

/remove-lifecycle stale

thockin

I am out of time for now, but I have to admit I am worried about how this will play out. Coordinating between N arbitrary actors via the API is not somethign we really do anywhere else (AFAIK "classic" DRA was the one place and that is being removed).

Can we pin down some very concrete examples of what REAL evacuators might really do, and how the lifecycle works, especially WRT cancellation of an eviction? You can lose the details and just tell a story. This is significantly complicated and I don't feel able to defend it right now.

Could this be positioned like "readiness gates". That sort of feels similar.

thockin · 2024-09-29T23:44:32Z

keps/sig-apps/4563-evacuation-api/README.md

+
+The major issues are:
+
+1. Without extra manual effort, an application running with a single replica has to settle for


I'm torn on this as a motivating case. If you have a single-replica app, you shouldn't be really allowed to use PDB (at least not as defined).

Unfortunately, we made it possible and people do it. And also utilize validation webhooks/controllers to implement their desired behavior.

thockin · 2024-09-29T23:58:22Z

keps/sig-apps/4563-evacuation-api/README.md

+
+Example evacuation triggers:
+- Node maintenance controller: node maintenance triggered by an admin.
+- Descheduler: descheduling triggered by a descheduling rule.


"To evacuate a person" implies "get them out of trouble, to safety" as opposed to "to empty" (as in vacuum). It's not ENTRIELY wrong in this context, but it's not entirely right either.

thockin · 2024-09-30T00:08:21Z

keps/sig-apps/4563-evacuation-api/README.md

+   - After a partial cleanup (e.g. storage migrated, notification sent) and if the application is
+     still in an available state, the eviction API can be used to respect PodDisruptionBudgets.
+
+In the end, the evacuation should always end with a pod being deleted (evict or delete call) by one


Early you said "The instigator's responsibility is to communicate an intent ... it should remove its intent when the evacuation is no longer necessary." Reconcile that with "evacuation should always end with a pod being deleted" ?

Right, we should make it clear that the pod might not end up deleted/terminated. I also want to add a mechanism that would result in the pod termination, but not deletion.

In this context (or in general), what is the difference of "pod termination" and "pod deletion"?

"pod termination" -> pod terminates but the pod object is kept in the cluster/etcd)- meant for pods with OnFailure, Never restart policies
"pod deletion" -> pod terminates and the pod object is also deleted - meant for pods with Always restart policy

thank you. so you mean by default, evacuation aims to delete, but there will be a separate set of rules for this mechanism to determine if a pod is going to be terminated instead?

thank you. so you mean by default, evacuation aims to delete,

yes

but there will be a separate set of rules for this mechanism to determine if a pod is going to be terminated instead?

I think it should be decided by the evacuator thought the Evacuation API if it is useful to keep the pod around after the termination.

thockin · 2024-09-30T00:19:10Z

keps/sig-apps/4563-evacuation-api/README.md

+There can be many evacuation instigators for a single Evacuation.
+
+When an instigator decides that a pod needs to be evacuated, it should create an Evacuation:
+- With a predictable and consistent name in the following format: `${POD_UID}-${POD_NAME_PREFIX}`.


I don't love magical hardcoded formatted strings. Why is it not sufficient to use POD_NAME or, if you really want to handle the ABA case, just POD_UID ? What I mean is that we already have places that assert that two things are linked 1:1 by having the same name. In this case, I suppose it's possilbe that you have:

Pod "foo" exists

create evacuation "foo", with pod's UID inside

delete pod foo

create pod foo

try to create evacuation "foo", with new UID inside, fail

I (maybe wrongly) presume there's a cleanup loop that removes evacuations which have no matching pod?

Name of the Evacuation object must be unique and predictable for each pod to prevent the creation of multiple Evacuations for the same pod. We do not expect the evacuators to support interaction with multiple Evacuations for a pod.

I was trying to combine ease of use (POD_NAME) and uniqueness of each pod instance (POD_UID).

Having name only: we can solve the uniqueness by an admission plugin and checking .spec.podRef.

Having ID only: we can let clients (kubectl) print out the name from .spec.podRef.

Even though we do not expect a big number of instigators, I agree that ${POD_UID}-${POD_NAME_PREFIX} is overly complicated for the end user. Especially the number of characters limit.

I (maybe wrongly) presume there's a cleanup loop that removes evacuations which have no matching pod?

Yes there is a GC, but that can have a delay. Nevertheless I am mainly concerned about having the 1:1 mapping, to make the implementation easier for all the parties (mainly Evacuators)

thockin · 2024-09-30T00:21:06Z

keps/sig-apps/4563-evacuation-api/README.md

+  after the previous pod is removed.
+- `.spec.progressDeadlineSeconds` should be set to a reasonable value. It is recommended to leave
+  it at the default value of 1800 (30m) to allow for potential evacuator disruptions.
+- `.spec.evacuators` value will be resolved on the Evacuation admission from the pod and should


I have gotten this far, but I really don't know what this is describing. This is a very long and detailed KEP, but I think it would really benefit from an overview with less details. E.g.

"""
Pods carry new information about which "evacuators" (described later) are involved in its lifecycle. Multiple evacutation initiators can coordinate, via a new "Evacuation" API object, to indicate the need for a pod to be removed from a node. This new object allows the evacuators to take action before the pod is removed, and...
"""

I will go over the whole KEP and try to make it more accessible.

thockin · 2024-09-30T00:39:19Z

keps/sig-apps/4563-evacuation-api/README.md

+  characters (`63 - len("priority_")`)
+- `PRIORITY` and `ROLE`
+  - `controller` should always set a `PRIORITY=10000` and `ROLE=controller`
+  - other evacuators should set `PRIORITY` according to their own needs (minimum value is 0,


is 0 "highest priority" or "lowest priority"? What does this priority govern? Can two of them have the same prio?

0 is "lowest priority"

What does this priority govern?

The order of Evacuator evaluation. The highest priority evacuators are evaluated first.

Can two of them have the same prio?

I think they can, but then their order is not defined in the API and most likely just order alpabetically by their class.

One exception is 10000/controller which should be unique and set by the controller.

thockin · 2024-09-30T00:40:40Z

keps/sig-apps/4563-evacuation-api/README.md

+  namespace: blueberry
+```
+
+The evacuator should observe the evacuation objects that match the pods that the evacuator manages.


How does an evacuator know which pods it manages?

This is evacuator specific.

It can use the same labels to select the evacuation as it does for its own pods (e.g. Deployment's .spec.selector) . We will add the labels to the Evacuation on admission.

For others they might have annotations/ownerReferences on the pod and then they need to pair them.

Some might react to all pods (e.g. log the state of each pod)

thockin · 2024-09-30T00:41:22Z

keps/sig-apps/4563-evacuation-api/README.md

+```
+
+The evacuator should observe the evacuation objects that match the pods that the evacuator manages.
+It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the


whose status? Pod or Evactuation?

Evacuation.

thockin · 2024-09-30T00:43:22Z

keps/sig-apps/4563-evacuation-api/README.md

+It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the
+`EVACUATOR_CLASS` it previously set in the annotation.
+
+If the evacuator is not interested in evacuating the pod, it should set


Elsewhere you talked about an evacuation not being necessary and being cancelled. Can it be cancelled once evacuators have started running? What if they have done something irreversible? Are they all expected to be reversible? Idempotent? Do they get called to reverse an evacuation?

The inspiration for this was (partly) the declarative node maintenance KEP - #4212

With declaring a node as under maintenance it feels implicit that you can end maintenance at any time, either as a cancellation or by marking it complete. That's how organizations getting humans to maintain things typically model it.

(also, you can't “reverse“ a maintenance notification; you can rescind it, but you can't create a second maintenance period that overlaps with and cancels out the first one).

My point was that I can't tell, from this KEP, how to answer those questions. If those answers are in there, they are buried.

Elsewhere you talked about an evacuation not being necessary and being cancelled.

I am sorry, I will make this part more clear.

Can it be cancelled once evacuators have started running? What if they have done something irreversible? Are they all expected to be reversible? Idempotent? Do they get called to reverse an evacuation?

A @sftim has said, one of the consumers/inspirations is the NodeMaintenance or other projects that might want to stop the maintenance.

The cancellation should be under the control of the current Evacuator. Evacuators should offer the ability to cancel if possible, but they can forbid it if the termination is irreversible. Please see .status.EvacuationCancellationPolicy.

thockin · 2024-09-30T00:44:53Z

keps/sig-apps/4563-evacuation-api/README.md

+`.spec.progressDeadlineSeconds` and make sure to update the status periodically in less than that
+duration. For example, if `.spec.progressDeadlineSeconds` is set to the default value of 1800 (30m),
+it may update the status every 3 minutes. The status updates should look as follows:
+- Check that `.status.activeEvacuatorClass` still matches the `EVACUATOR_CLASS` previously set in


This feels racy and brittle - easy to miss "events" ?

.status.activeEvacuatorClass is only reconciled/coordinated by the the evacuation controller (single instance).

The reaction time can be tweaked with .spec.progressDeadlineSeconds.

I think the following values might be reasonable, but welcome any other suggestions.

// The minimum value is 600 (10m) and the maximum value is 21600 (6h). // The default value is 1800 (30m).

FYI, this mechanism is very quick if there is no evacuator advertised on the pod. It will instantly fallback to eviction.

atiratree

@thockin Thanks a lot for the feedback. I see there are still quite a lot of places to improve.

Can we pin down some very concrete examples of what REAL evacuators might really do, and how the lifecycle works, especially WRT cancellation of an eviction?

I guess the user stories are still pretty abstract and it is best to show the flow with real examples.

I will work on the updates soon.

atiratree · 2024-09-30T19:46:56Z

keps/sig-apps/4563-evacuation-api/README.md

+
+Example evacuation triggers:
+- Node maintenance controller: node maintenance triggered by an admin.
+- Descheduler: descheduling triggered by a descheduling rule.


I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.

atiratree · 2024-09-30T19:50:20Z

keps/sig-apps/4563-evacuation-api/README.md

+
+The major issues are:
+
+1. Without extra manual effort, an application running with a single replica has to settle for


Unfortunately, we made it possible and people do it. And also utilize validation webhooks/controllers to implement their desired behavior.

atiratree · 2024-09-30T19:55:18Z

keps/sig-apps/4563-evacuation-api/README.md

+   - After a partial cleanup (e.g. storage migrated, notification sent) and if the application is
+     still in an available state, the eviction API can be used to respect PodDisruptionBudgets.
+
+In the end, the evacuation should always end with a pod being deleted (evict or delete call) by one


Right, we should make it clear that the pod might not end up deleted/terminated. I also want to add a mechanism that would result in the pod termination, but not deletion.

atiratree · 2024-09-30T20:20:19Z

keps/sig-apps/4563-evacuation-api/README.md

+There can be many evacuation instigators for a single Evacuation.
+
+When an instigator decides that a pod needs to be evacuated, it should create an Evacuation:
+- With a predictable and consistent name in the following format: `${POD_UID}-${POD_NAME_PREFIX}`.


Name of the Evacuation object must be unique and predictable for each pod to prevent the creation of multiple Evacuations for the same pod. We do not expect the evacuators to support interaction with multiple Evacuations for a pod.

I was trying to combine ease of use (POD_NAME) and uniqueness of each pod instance (POD_UID).

Having name only: we can solve the uniqueness by an admission plugin and checking .spec.podRef.

Having ID only: we can let clients (kubectl) print out the name from .spec.podRef.

Even though we do not expect a big number of instigators, I agree that ${POD_UID}-${POD_NAME_PREFIX} is overly complicated for the end user. Especially the number of characters limit.

I (maybe wrongly) presume there's a cleanup loop that removes evacuations which have no matching pod?

Yes there is a GC, but that can have a delay. Nevertheless I am mainly concerned about having the 1:1 mapping, to make the implementation easier for all the parties (mainly Evacuators)

atiratree · 2024-09-30T20:22:11Z

keps/sig-apps/4563-evacuation-api/README.md

+  after the previous pod is removed.
+- `.spec.progressDeadlineSeconds` should be set to a reasonable value. It is recommended to leave
+  it at the default value of 1800 (30m) to allow for potential evacuator disruptions.
+- `.spec.evacuators` value will be resolved on the Evacuation admission from the pod and should


I will go over the whole KEP and try to make it more accessible.

atiratree · 2024-09-30T20:30:56Z

keps/sig-apps/4563-evacuation-api/README.md

+  characters (`63 - len("priority_")`)
+- `PRIORITY` and `ROLE`
+  - `controller` should always set a `PRIORITY=10000` and `ROLE=controller`
+  - other evacuators should set `PRIORITY` according to their own needs (minimum value is 0,


0 is "lowest priority"

What does this priority govern?

The order of Evacuator evaluation. The highest priority evacuators are evaluated first.

Can two of them have the same prio?

I think they can, but then their order is not defined in the API and most likely just order alpabetically by their class.

One exception is 10000/controller which should be unique and set by the controller.

atiratree · 2024-09-30T20:44:19Z

keps/sig-apps/4563-evacuation-api/README.md

+  namespace: blueberry
+```
+
+The evacuator should observe the evacuation objects that match the pods that the evacuator manages.


This is evacuator specific.

It can use the same labels to select the evacuation as it does for its own pods (e.g. Deployment's .spec.selector) . We will add the labels to the Evacuation on admission.

For others they might have annotations/ownerReferences on the pod and then they need to pair them.

Some might react to all pods (e.g. log the state of each pod)

atiratree · 2024-09-30T20:44:44Z

keps/sig-apps/4563-evacuation-api/README.md

+```
+
+The evacuator should observe the evacuation objects that match the pods that the evacuator manages.
+It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the


Evacuation.

atiratree · 2024-09-30T20:54:21Z

keps/sig-apps/4563-evacuation-api/README.md

+It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the
+`EVACUATOR_CLASS` it previously set in the annotation.
+
+If the evacuator is not interested in evacuating the pod, it should set


Elsewhere you talked about an evacuation not being necessary and being cancelled.

I am sorry, I will make this part more clear.

Can it be cancelled once evacuators have started running? What if they have done something irreversible? Are they all expected to be reversible? Idempotent? Do they get called to reverse an evacuation?

A @sftim has said, one of the consumers/inspirations is the NodeMaintenance or other projects that might want to stop the maintenance.

The cancellation should be under the control of the current Evacuator. Evacuators should offer the ability to cancel if possible, but they can forbid it if the termination is irreversible. Please see .status.EvacuationCancellationPolicy.

atiratree · 2024-09-30T21:04:48Z

keps/sig-apps/4563-evacuation-api/README.md

+`.spec.progressDeadlineSeconds` and make sure to update the status periodically in less than that
+duration. For example, if `.spec.progressDeadlineSeconds` is set to the default value of 1800 (30m),
+it may update the status every 3 minutes. The status updates should look as follows:
+- Check that `.status.activeEvacuatorClass` still matches the `EVACUATOR_CLASS` previously set in


.status.activeEvacuatorClass is only reconciled/coordinated by the the evacuation controller (single instance).

The reaction time can be tweaked with .spec.progressDeadlineSeconds.

I think the following values might be reasonable, but welcome any other suggestions.

// The minimum value is 600 (10m) and the maximum value is 21600 (6h). // The default value is 1800 (30m).

FYI, this mechanism is very quick if there is no evacuator advertised on the pod. It will instantly fallback to eviction.

tuibeovince · 2024-10-01T06:42:00Z

keps/sig-apps/4563-evacuation-api/README.md

+### Non-Goals
+
+- Implement evacuation capabilities in Kubernetes workloads (ReplicaSet, Deployment, etc.).
+- Synchronizing of the evacuation status to the pod status.


does it mean that pod status will not reflect the pod is being evacuated? on a high level, how do we identify a pod is currently being evacuated?

You would have to pair the pod to an evacuation. I think it would be a nice followup once we are set on the Evacuation API to add for example a pod condition to indicate that. This thread is also relevant: #4565 (comment)

so what we monitor is the existing evacuations and check the evacuation statuses instead? if an evacuation that failed to pair with a pod is eventually garbage collected, will the pod ever know it was intended to be evacuated?

I think it would be a nice followup once we are set on the Evacuation API to add for example a pod condition to indicate that.

I think we can add this followup to the Beta graduation.

so what we monitor is the existing evacuations and check the evacuation statuses instead?

yes, for now

if an evacuation that failed to pair with a pod is eventually garbage collected, will the pod ever know it was intended to be evacuated?

The pairing is the responsibility of the Evacuator. Nevertheless, we should check as much as we can on admission.

tuibeovince · 2024-10-09T00:30:37Z

keps/sig-apps/4563-evacuation-api/README.md

+   needs high availability either, due to a preference for a simpler deployment model, lack of
+   application support for HA, or to minimize compute costs. Also, any automated solution needs
+   to edit the PDB to account for the additional pod that needs to be spun to move the workload
+   from one node to another. This has been discussed in issue [kubernetes/kubernetes#66811](https://github.com/kubernetes/kubernetes/issues/66811)


As far as I have read about EvacuationAPI, it seems to respect PDBs always.
But I wish to know, will EvacuationAPI be able to disable PDBs or ignore PDBs to help the case where the operator expects zero living nodes in the cluster (i.e. shutting down cluster during the weekends/holidays for maintenance and starting them again after the weekend)?
The above case requires all nodes to shut down, and having PDBs with minAvailable: 1 or maxUnavailable: 0 will definitely block the draining the nodes.

the feature is not limited by PDBs, the workloads can use them but they can also bypass them if there is a controller that can make more advanced decisions

But for the k8s built-in controllers like deployments, statefulsets, etc; if a PDB was configured, how will EvacuationAPI be able to bypass the disruption budget and allow node termination below the PDB configured values?

I will update the KEP soon. But the mechanism of optionally bypassing the PDB should not change, and requires action from the evacuator.. PTAL at the KEP or just the attached diagram: https://github.com/kubernetes/enhancements/pull/4565/files#diff-560381a7190e130fded9e8758d53d2e5c4b69c39838485fc7a3fb13be096ae47

looking at the diagram, where in the innermost loop, it says the Evacuator updates the Evacuation status. I just want to clarify the meaning of the part where it says

"make sure the Evacuation status is up to date, evict the pod if it is not".

Does it mean if the evacuator failed to update the Evacuation status for a specific time, the pod will be terminated regardless?

yup, this is implemented via the .spec.progressDeadlineSeconds, but the min/default values should be forgiving to allow for evacuator disruptions

oh thank you. As I see it, does .spec.progressDeadlineSeconds double as the polling duration counting from the most recent evacuator update, waiting for other potential instigators?

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Mar 28, 2024

k8s-ci-robot requested review from kow3ns and soltysh March 28, 2024 15:05

k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 28, 2024

atiratree mentioned this pull request Mar 28, 2024

KEP-4212: Declarative Node Maintenance #4213

Open

6 tasks

sftim reviewed Mar 28, 2024

View reviewed changes

atiratree force-pushed the evacuation-api branch 8 times, most recently from 13611ce to 2d15b79 Compare March 28, 2024 20:48

razo7 reviewed Apr 2, 2024

View reviewed changes

keps/sig-apps/4563-evacuation-api/README.md Show resolved Hide resolved

atiratree force-pushed the evacuation-api branch from 2d15b79 to c292415 Compare April 9, 2024 17:01

sftim reviewed Apr 9, 2024

View reviewed changes

keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved

sftim reviewed Apr 9, 2024

View reviewed changes

keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved

atiratree force-pushed the evacuation-api branch 6 times, most recently from 1900020 to 085672a Compare April 10, 2024 21:47

adammw reviewed Apr 12, 2024

View reviewed changes

keps/sig-apps/4563-evacuation-api/README.md Show resolved Hide resolved

atiratree force-pushed the evacuation-api branch from 085672a to f81cc8c Compare April 12, 2024 15:32

toVersus reviewed May 5, 2024

View reviewed changes

atiratree force-pushed the evacuation-api branch from 669da27 to 6577eb4 Compare May 6, 2024 19:16

atiratree mentioned this pull request May 6, 2024

Replicaset scaling down takes inter-pods scheduling constraints into consideration kubernetes/kubernetes#124306

Open

atiratree force-pushed the evacuation-api branch from 6577eb4 to 2d7dbab Compare May 8, 2024 10:24

This was referenced May 11, 2024

KEP-1397: descheduler integration with evacuation API as an alternative to eviction API kubernetes-sigs/descheduler#1397

Open

KEP-1397: descheduler integration with evacuation API as an alternative to eviction API kubernetes-sigs/descheduler#1354

Merged

atiratree force-pushed the evacuation-api branch from 2d7dbab to 5f16328 Compare May 28, 2024 10:17

atiratree mentioned this pull request May 31, 2024

KEP4328: Affinity Based Eviction #4329

Open

atiratree force-pushed the evacuation-api branch from 5f16328 to c42f666 Compare June 3, 2024 13:18

atiratree force-pushed the evacuation-api branch 4 times, most recently from f2e68a6 to 85f94b9 Compare June 4, 2024 16:22

soltysh reviewed Jun 6, 2024

View reviewed changes

atiratree force-pushed the evacuation-api branch 2 times, most recently from b55bd3d to cd1dcdd Compare June 10, 2024 15:12

KEP-4563: Evacuation API

b833a5c

atiratree force-pushed the evacuation-api branch from cd1dcdd to b833a5c Compare June 10, 2024 15:21

atiratree mentioned this pull request Aug 26, 2024

Declarative Node Maintenance #4212

Open

4 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2024

thockin reviewed Sep 30, 2024

View reviewed changes

atiratree commented Sep 30, 2024

View reviewed changes

tuibeovince reviewed Oct 1, 2024

View reviewed changes

tuibeovince reviewed Oct 9, 2024

View reviewed changes


		#### Deployment Controller

		The deployment controller must first decide whether a particular Deployment supports the evacuation


		The major issues are:

		1. Without extra manual effort, an application running with a single replica has to settle for

KEP-4563: Evacuation API #4565

Are you sure you want to change the base?

KEP-4563: Evacuation API #4565

Conversation

atiratree commented Mar 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree May 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree commented Apr 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sftim Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jun 3, 2024

atiratree commented Jun 3, 2024 • edited Loading

soltysh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

atiratree Jun 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree commented Jun 7, 2024

k8s-triage-robot commented Sep 11, 2024

atiratree commented Sep 19, 2024

thockin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree May 8, 2024 •

edited

Loading

sftim Jun 3, 2024 •

edited

Loading

atiratree Jun 3, 2024 •

edited

Loading

atiratree commented Jun 3, 2024 •

edited

Loading

atiratree Jun 7, 2024 •

edited

Loading

atiratree Jun 7, 2024 •

edited

Loading