Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4563: Evacuation API #4565

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

atiratree
Copy link
Member

  • One-line PR description: introduce Evacuation API to allow managed graceful pod removal

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Mar 28, 2024
@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Mar 28, 2024
Comment on lines +228 to +236
We will introduce a new term called evacuation. This is a contract between the evacuation instigator,
the evacuee, and the evacuator. The contract is enforced by the API and an evacuation controller.
We can think of evacuation as a managed and safer alternative to eviction.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Watch out for the risk of confusing end users.

We already have preemption and eviction and people confuse the two. Or three, because there are two kinds of eviction. And there's disruption in the mix.

Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I added a mention which of what kind of eviction I mean here.

Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?

Yes, I think we want to add a new concept there and generally update the docs once we have an alpha.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm with Tim here.

Preemption vs Eviction is already quite confusing. And TBH, I couldn't fully understand what the "evacuation" is supposed to solve by reading the summary or motivation.

From Goals:

Changes to the eviction API to support the evacuation process.

If this is already going to be part of the Eviction API, maybe it should be named as a form of eviction. Something like "cooperative eviction" or "eviction with ack" or something along those lines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm all for framing it as another type of eviction; we already have two, so the extra cognitive load for users is not so much a problem.

Copy link
Member Author

@atiratree atiratree May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alculquicondor I have updated the summary and goals, I hope it makes more sense now.

I think the name should make the most sense to the person creating the Evacuation (Evacuation Instigator ATM). So CooperativeEviction or EvictionWithAck is a bit misleading IMO. Because from that person's perpective there is no additional step required of them. Only the evacuators and the evacuation controller implement the cooperative evacuation process but this is hidden from the normal user.

My suggestions:

  • GracefulEviction (might confuse people if it is associated with graceful pod termination, which it is not)
  • SafeEviction (*safer than the API-initiated one for some pods)
  • Or just call it Eviction? And tell people to use it instead of the Eviction API endpoint? This might be a bit confusing (at least in the beginning)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can bikeshed names for the API kind; I'd throw a few of my own into the hat:

  • EvictionRequest
  • PodEvictionRequest

Comment on lines 1130 to 1302
<!--
What other approaches did you consider, and why did you rule them out? These do
not need to be as detailed as the proposal, but should include enough
information to express the idea and why it was not acceptable.
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a number of examples of having a SomethingRequest or SomethingClaim API that then causes a something (certificate signing, node provisioning, etc).
Think of TokenRequest (a subresource for a ServiceAccount), or CertificateSigningRequest.

I would confidently and strongly prefer to have an EvictionRequest or PodEvictionRequest API, rather than an Evacuation API kind.

It's easy to teach that we have evictions and than an EvictionRequest is asking for one to happen; it's hard to teach the difference between an eviction and an evacuation.

As a side effect, this makes the feature gate easier to name (eg PodEvictionRequests).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you have mentioned, we already have different kinds of eviction. So I think it would be good to use a completely new term to distinguish it from the others.

Also, Evacuation does not always result in eviction (and PDB consultation). It depends on the controller/workload. For some workloads like DaemonSets and static pods, API eviction has never worked before. This could also be very confusing if we name it the same way.

I think Evacuation fits this better because

  1. The name is shorter. If we go with EvacuationRequest then the evacuation will become just an abstract term and less recognizable.
  2. It seems it will have quite a lot of functionality included (state synchronization between multiple instigators and multiple evacuators, state of the evacuee and evacuation). TokenRequest and CertificateSigningRequest are simpler and not involved in a complex process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested EvictionRequest so that we don't have to have a section with the (too long) title: Scheduling, Preemption, Evacuation and Eviction. Not EvacuationRequest.

Adding another term doesn't scale so well: it means helping n people understand the difference between evacuation and eviction. It's a scaling challenge where n is not only large, it probably includes quite a few Kubernetes maintainers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for CertificateSigningRequest being simple: I don't buy it. There are three controllers, custom signers, an integration with trust bundles, the horrors of ASN.1 and X.509… trust me, it's complicated enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that it will be confusing for people, but that will happen regardless of what term we will use.

My main issue is that evacuation does not directly translate to eviction. Therefore, I think it would be preferable to choose a new term (not necessarily evacuation).

I would like to get additional opinions from people about this. And we will definitely have to come back to this in the API review.

keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
Comment on lines 253 to 255
The evacuee can be any pod. The owner/controller of the pod is the evacuator. In a
special case, the evacuee can be its own evacuator. The evacuator should decide what action to take
when it observes an evacuation intent:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match common English use. An evacuator would be something initiating evacuation, not the owner / manager of the thing being evactuated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know: It seemed to fit somehow, according to the dictionary ^^

a person or thing that evacuates.

In my view, even though it is the owner/manager of the pod, it also implements the evacuation part (evacuates).

Nevertheless, I am open to suggestions.

How about Evacuator -> Evacuation Responder -> Evacuee?

keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
keps/sig-apps/4563-evacuation-api/README.md Outdated Show resolved Hide resolved
@atiratree atiratree force-pushed the evacuation-api branch 8 times, most recently from 13611ce to 2d15b79 Compare March 28, 2024 20:48
@atiratree
Copy link
Member Author

  • I have updated the KEP to include support for multiple evacuators
  • Evacuators can now advertise which pods they are able to evacuate, even before the evacuation. The advantage of this approach is that we can trigger an eviction immediately without a delay (was known asacceptDeadlineSeconds) if we do not find an evacuator. I have added a bunch of restrictions to ensure the API cannot be misused.
  • Clarified how the Evacuation objects should be named and how the admission should work in general (also for pods). This will ensure a 1:1 mapping between pods and Evacuation.
  • Removed the ability to add a full reference of the evacuator because it would be a hassle to synchronize considering the evacuator leader election and multiple evacuatorsin play.


Example evacuation triggers:
- Node maintenance controller: node maintenance triggered by an admin.
- Descheduler: descheduling triggered by a descheduling rule.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the descheduler requests an eviction, what thing is being evacuated?

(the node maintenance and cluster autoscaler examples are easier: you're evacuating an entire node)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single pod or multiple pods. The descheduler can use it as a new mechanism instead of eviction.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so people typically think of “evacuate” as a near synonym of “drain” - you drain a node, you evacuate a rack or zone full of servers. Saying that you can evacuate a Pod might make people think its containers all get stopped, or just confuse readers. We do need to smooth out how we can teach this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it can be used in both scenarios https://www.merriam-webster.com/grammar/can-you-evacuate-people.
Evacuation of containers doesn't make sense because they are tied to the pod lifecycle. But, I guess it could be confusing if we do not make it explicitly clear what we are targeting.

Copy link
Contributor

@sftim sftim Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is, Kubernetes users typically - before we accept this KEP - use “evacuate” as a synonym for drain.

I'm (still) fine with the API idea, and still concerned about the naming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to +1 the potential confusion of the term "evacuation".
Is it okay to have a glossary of terms for "evacuation", "eviction", and "drain" (or any other potentially confusing terms) added somewhere in this KEP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can include it in the KEP. And yes, we are going to change the name to something more suitable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To evacuate a person" implies "get them out of trouble, to safety" as opposed to "to empty" (as in vacuum). It's not ENTRIELY wrong in this context, but it's not entirely right either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.

@atiratree atiratree force-pushed the evacuation-api branch 6 times, most recently from 1900020 to 085672a Compare April 10, 2024 21:47
keps/sig-apps/4563-evacuation-api/README.md Show resolved Hide resolved

#### Deployment Controller

The deployment controller must first decide whether a particular Deployment supports the evacuation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to consider integration with HPA? During an evacuation, if HPA scales down a Deployment, do we just follow the existing ReplicaSet's scale-down behavior to delete those Pods that need to be reduced, aside from the Pods scheduled for termination in the Evacuation API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of the Deployment strategy, if the Deployment is scaled down during an evacuation, the ReplicaSet is scaled down and the pods with matching Evacuations can be terminated immediately by the replica set controller.

There are ways for the HPA to inject itself into the evacuation process. For example it could become the Evacuation Instigator and select the pods that should be scaled down by creating Evacuations for supported workloads. This is not a feature that HPA offers today, and I am not sure if it should. I think it is better to leave it out of this KEP.

The Deployments and ReplicaSets should work for both the manual and automatic scaling.

Copy link
Member Author

@atiratree atiratree Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@toVersus I have added a new followup for the HPA use case (see HorizontalPodAutoscaler).

  • It should be possible for the HPA to become the evacuator of the workload, if Evacuations are observed.
  • I also roughly outline how the evacuations conflicts should be resolved.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding the HPA controller section in the Follow-up design!

Upon detecting an evacuation of pods under the HPA workload, the HPA would increase the number of replicas by the number of evacuations.

Just to confirm, the behavior described here refers to when the HPA controller becomes the Evacuator (i.e., when evacuation priority annotations are added to the HPA object), right? I agree that, to avoid conflicts with other pod controllers, I agree that the HPA controller does not perform evacuation by default.

Although this is something we can decide during future discussions about pod controller's Evacuation API support, another concern is whether the HPA controller will modify the replicas of Deployments, etc., during evacuation. If the pod controller increases the replicas to account for maxSurge and then the HPA controller modifies the replicas, there could be side effects. It seems simpler for the HPA controller not to perform scale-in or scale-out during evacuation, but if the evacuation takes a long time, not being able to scale in or scale out for a long period of time could itself be problematic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should become an evacuator and add annotations to pods. This is up for disscusion if the evacuation should be by default or not. I do not see a downside, by adding it as a lower priority evacuator.

The priorities and scaling operations depend on the workload. But it should never happen that both the deployment controller and the HPA increase replicas/pods at the same time.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atiratree
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@atiratree
Copy link
Member Author

atiratree commented Jun 3, 2024

Added the following sections to the Alternatives:

Other changes:

  • Added recommendations for evict vs delete call: we usually cannot use evict call because the evacuation may not be in a reversible state and there is a chance a PDB would block it.
  • Added .status.evacuationCancellationPolicy field to allow evacuators to control the cancelation policy.
  • We still need the finalizers to track the instigators and which ones can cancel/remove the evacuation object.
  • Added affinity based eviction to the motivation and use cases.
  • Updated validation and admission requirements.
  • Added HorizontalPodAutoscaler into Follow-up Design Details for Kubernetes Workloads section.
  • Filled in PRR.

@atiratree atiratree force-pushed the evacuation-api branch 4 times, most recently from f2e68a6 to 85f94b9 Compare June 4, 2024 16:22
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have similar concerns as other expressed in #4213 and I've also voiced out there. There are several alternatives you've described here that you rejected due to simplicty. I'm not sure that entirely new API will solve this problem, I'd first suggest to tackle smaller problems that we currently see with both PDBs and Eviction API before overhauling this and introducing yet another API which will most likely confuse users, rather than provide them with suggested flexibility.

An evacuator could reconcile the status properly without making any progress. It is thus
recommended to check `creationTimestamp` of the Evacuations and observe
`evacuation_controller_active_evacuator` metric to see how much it takes for an evacuator to
complete the evacuation. This metric can be also used to implement additional alerts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the risk behind adopting this new API, most of the users are very slow adopting these. My concern is that, aside from the naming confusion overload as mentioned in https://github.com/kubernetes/enhancements/pull/4565/files#r1543242249, a lot of users will prefer the simpler eviction API that is already battle tested even with its limitation, but mostly because it's already being widely used. Your proposed mechanism on one hand requires work on the k8s core side, and on the other hand to application owners managing the evacuation process.

I will repeat similarly to what has been stated in #4213, should we revisit PDB and eviction API, both of which are widely used and identify the gaps and what are the options to extend them with the missing functionality. I don't think we need a full fledged, feature-rich functionality, but rather a functional and easy to understand API built on the existing primitives.

Copy link
Member Author

@atiratree atiratree Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see a big risk in terms of adoption.

The eviction API is mostly used by a handful of components (kubectl drain, cluster autoscaler, karpenter, scheduler, descheduler, etc.). It is enough to introduce Evacuation API to a handful of components to see an impact. Users usually do not use this API by themselves - there is not even a kubectl support.

From the evacuator side, there is no user action required either. Please see the followup workload integration in this KEP, DeamonSet is part of the NodeMaintenance KEP.

The user can take action if they want to further improve the situation:

At this time, I do not see any extensions that could be made to PDBs and eviction API to solve the current issues. Please see the analysis in the alternatives section: https://github.com/atiratree/kube-enhancements/blob/improve-node-maintenance/keps/sig-apps/4212-declarative-node-maintenance/README.md#alternatives.

Copy link
Member Author

@atiratree atiratree Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the NodeMaintenance we already have two additional use cases that would like to adopt Evacuation API

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Descheduler integration (kubernetes-sigs/descheduler#1354)
Why not align with the effort that scheduler is pushing to start using evictions (see #3280)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is too specific for preemption and expects ordered pods by a priority. It would not help in cases where priority is not used or cooperative eviction is needed.

@atiratree
Copy link
Member Author

There are several alternatives you've described here that you rejected due to simplicty.

Do you mean the Pod API? We have analyzed the problem and found serious concurrency, obsevability and extensibility issues.

I'd first suggest to tackle smaller problems that we currently see with both PDBs and Eviction API before overhauling this and introducing yet another API which will most likely confuse users, rather than provide them with suggested flexibility.

I would prefer not to have a new API, but at this time, I do not see how we can incrementally improve the situation for PDBs and eviction API to address the problems at hand.

We need to come up with a good name so that is not such a big hurdle for the users to undestand. Fortunately, most of the time, this API will be used internally as a building block by the cluster components. Similar to the eviction API today.

@atiratree atiratree force-pushed the evacuation-api branch 2 times, most recently from b55bd3d to cd1dcdd Compare June 10, 2024 15:12
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 11, 2024
@atiratree
Copy link
Member Author

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2024
Copy link
Member

@thockin thockin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am out of time for now, but I have to admit I am worried about how this will play out. Coordinating between N arbitrary actors via the API is not somethign we really do anywhere else (AFAIK "classic" DRA was the one place and that is being removed).

Can we pin down some very concrete examples of what REAL evacuators might really do, and how the lifecycle works, especially WRT cancellation of an eviction? You can lose the details and just tell a story. This is significantly complicated and I don't feel able to defend it right now.

Could this be positioned like "readiness gates". That sort of feels similar.


The major issues are:

1. Without extra manual effort, an application running with a single replica has to settle for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on this as a motivating case. If you have a single-replica app, you shouldn't be really allowed to use PDB (at least not as defined).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we made it possible and people do it. And also utilize validation webhooks/controllers to implement their desired behavior.


Example evacuation triggers:
- Node maintenance controller: node maintenance triggered by an admin.
- Descheduler: descheduling triggered by a descheduling rule.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"To evacuate a person" implies "get them out of trouble, to safety" as opposed to "to empty" (as in vacuum). It's not ENTRIELY wrong in this context, but it's not entirely right either.

- After a partial cleanup (e.g. storage migrated, notification sent) and if the application is
still in an available state, the eviction API can be used to respect PodDisruptionBudgets.

In the end, the evacuation should always end with a pod being deleted (evict or delete call) by one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early you said "The instigator's responsibility is to communicate an intent ... it should remove its intent when the evacuation is no longer necessary." Reconcile that with "evacuation should always end with a pod being deleted" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should make it clear that the pod might not end up deleted/terminated. I also want to add a mechanism that would result in the pod termination, but not deletion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this context (or in general), what is the difference of "pod termination" and "pod deletion"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pod termination" -> pod terminates but the pod object is kept in the cluster/etcd)- meant for pods with OnFailure, Never restart policies
"pod deletion" -> pod terminates and the pod object is also deleted - meant for pods with Always restart policy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you. so you mean by default, evacuation aims to delete, but there will be a separate set of rules for this mechanism to determine if a pod is going to be terminated instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you. so you mean by default, evacuation aims to delete,

yes

but there will be a separate set of rules for this mechanism to determine if a pod is going to be terminated instead?

I think it should be decided by the evacuator thought the Evacuation API if it is useful to keep the pod around after the termination.

There can be many evacuation instigators for a single Evacuation.

When an instigator decides that a pod needs to be evacuated, it should create an Evacuation:
- With a predictable and consistent name in the following format: `${POD_UID}-${POD_NAME_PREFIX}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love magical hardcoded formatted strings. Why is it not sufficient to use POD_NAME or, if you really want to handle the ABA case, just POD_UID ? What I mean is that we already have places that assert that two things are linked 1:1 by having the same name. In this case, I suppose it's possilbe that you have:

  • Pod "foo" exists
  • create evacuation "foo", with pod's UID inside
  • delete pod foo
  • create pod foo
  • try to create evacuation "foo", with new UID inside, fail

I (maybe wrongly) presume there's a cleanup loop that removes evacuations which have no matching pod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of the Evacuation object must be unique and predictable for each pod to prevent the creation of multiple Evacuations for the same pod. We do not expect the evacuators to support interaction with multiple Evacuations for a pod.

I was trying to combine ease of use (POD_NAME) and uniqueness of each pod instance (POD_UID).

  • Having name only: we can solve the uniqueness by an admission plugin and checking .spec.podRef.
  • Having ID only: we can let clients (kubectl) print out the name from .spec.podRef.

Even though we do not expect a big number of instigators, I agree that ${POD_UID}-${POD_NAME_PREFIX} is overly complicated for the end user. Especially the number of characters limit.

I (maybe wrongly) presume there's a cleanup loop that removes evacuations which have no matching pod?

Yes there is a GC, but that can have a delay. Nevertheless I am mainly concerned about having the 1:1 mapping, to make the implementation easier for all the parties (mainly Evacuators)

after the previous pod is removed.
- `.spec.progressDeadlineSeconds` should be set to a reasonable value. It is recommended to leave
it at the default value of 1800 (30m) to allow for potential evacuator disruptions.
- `.spec.evacuators` value will be resolved on the Evacuation admission from the pod and should
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have gotten this far, but I really don't know what this is describing. This is a very long and detailed KEP, but I think it would really benefit from an overview with less details. E.g.

"""
Pods carry new information about which "evacuators" (described later) are involved in its lifecycle. Multiple evacutation initiators can coordinate, via a new "Evacuation" API object, to indicate the need for a pod to be removed from a node. This new object allows the evacuators to take action before the pod is removed, and...
"""

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go over the whole KEP and try to make it more accessible.

characters (`63 - len("priority_")`)
- `PRIORITY` and `ROLE`
- `controller` should always set a `PRIORITY=10000` and `ROLE=controller`
- other evacuators should set `PRIORITY` according to their own needs (minimum value is 0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 0 "highest priority" or "lowest priority"? What does this priority govern? Can two of them have the same prio?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is "lowest priority"

What does this priority govern?

The order of Evacuator evaluation. The highest priority evacuators are evaluated first.

Can two of them have the same prio?

I think they can, but then their order is not defined in the API and most likely just order alpabetically by their class.

One exception is 10000/controller which should be unique and set by the controller.

namespace: blueberry
```

The evacuator should observe the evacuation objects that match the pods that the evacuator manages.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does an evacuator know which pods it manages?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is evacuator specific.

  • It can use the same labels to select the evacuation as it does for its own pods (e.g. Deployment's .spec.selector) . We will add the labels to the Evacuation on admission.
  • For others they might have annotations/ownerReferences on the pod and then they need to pair them.
  • Some might react to all pods (e.g. log the state of each pod)

```

The evacuator should observe the evacuation objects that match the pods that the evacuator manages.
It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whose status? Pod or Evactuation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evacuation.

It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the
`EVACUATOR_CLASS` it previously set in the annotation.

If the evacuator is not interested in evacuating the pod, it should set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere you talked about an evacuation not being necessary and being cancelled. Can it be cancelled once evacuators have started running? What if they have done something irreversible? Are they all expected to be reversible? Idempotent? Do they get called to reverse an evacuation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inspiration for this was (partly) the declarative node maintenance KEP - #4212

With declaring a node as under maintenance it feels implicit that you can end maintenance at any time, either as a cancellation or by marking it complete. That's how organizations getting humans to maintain things typically model it.

(also, you can't “reverse“ a maintenance notification; you can rescind it, but you can't create a second maintenance period that overlaps with and cancels out the first one).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that I can't tell, from this KEP, how to answer those questions. If those answers are in there, they are buried.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere you talked about an evacuation not being necessary and being cancelled.

I am sorry, I will make this part more clear.

Can it be cancelled once evacuators have started running? What if they have done something irreversible? Are they all expected to be reversible? Idempotent? Do they get called to reverse an evacuation?

A @sftim has said, one of the consumers/inspirations is the NodeMaintenance or other projects that might want to stop the maintenance.

The cancellation should be under the control of the current Evacuator. Evacuators should offer the ability to cancel if possible, but they can forbid it if the termination is irreversible. Please see .status.EvacuationCancellationPolicy.

`.spec.progressDeadlineSeconds` and make sure to update the status periodically in less than that
duration. For example, if `.spec.progressDeadlineSeconds` is set to the default value of 1800 (30m),
it may update the status every 3 minutes. The status updates should look as follows:
- Check that `.status.activeEvacuatorClass` still matches the `EVACUATOR_CLASS` previously set in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels racy and brittle - easy to miss "events" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.status.activeEvacuatorClass is only reconciled/coordinated by the the evacuation controller (single instance).

The reaction time can be tweaked with .spec.progressDeadlineSeconds.

I think the following values might be reasonable, but welcome any other suggestions.

	// The minimum value is 600 (10m) and the maximum value is 21600 (6h).
	// The default value is 1800 (30m).

FYI, this mechanism is very quick if there is no evacuator advertised on the pod. It will instantly fallback to eviction.

Copy link
Member Author

@atiratree atiratree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin Thanks a lot for the feedback. I see there are still quite a lot of places to improve.

Can we pin down some very concrete examples of what REAL evacuators might really do, and how the lifecycle works, especially WRT cancellation of an eviction?

I guess the user stories are still pretty abstract and it is best to show the flow with real examples.

I will work on the updates soon.


Example evacuation triggers:
- Node maintenance controller: node maintenance triggered by an admin.
- Descheduler: descheduling triggered by a descheduling rule.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.


The major issues are:

1. Without extra manual effort, an application running with a single replica has to settle for
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we made it possible and people do it. And also utilize validation webhooks/controllers to implement their desired behavior.

- After a partial cleanup (e.g. storage migrated, notification sent) and if the application is
still in an available state, the eviction API can be used to respect PodDisruptionBudgets.

In the end, the evacuation should always end with a pod being deleted (evict or delete call) by one
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should make it clear that the pod might not end up deleted/terminated. I also want to add a mechanism that would result in the pod termination, but not deletion.

There can be many evacuation instigators for a single Evacuation.

When an instigator decides that a pod needs to be evacuated, it should create an Evacuation:
- With a predictable and consistent name in the following format: `${POD_UID}-${POD_NAME_PREFIX}`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of the Evacuation object must be unique and predictable for each pod to prevent the creation of multiple Evacuations for the same pod. We do not expect the evacuators to support interaction with multiple Evacuations for a pod.

I was trying to combine ease of use (POD_NAME) and uniqueness of each pod instance (POD_UID).

  • Having name only: we can solve the uniqueness by an admission plugin and checking .spec.podRef.
  • Having ID only: we can let clients (kubectl) print out the name from .spec.podRef.

Even though we do not expect a big number of instigators, I agree that ${POD_UID}-${POD_NAME_PREFIX} is overly complicated for the end user. Especially the number of characters limit.

I (maybe wrongly) presume there's a cleanup loop that removes evacuations which have no matching pod?

Yes there is a GC, but that can have a delay. Nevertheless I am mainly concerned about having the 1:1 mapping, to make the implementation easier for all the parties (mainly Evacuators)

after the previous pod is removed.
- `.spec.progressDeadlineSeconds` should be set to a reasonable value. It is recommended to leave
it at the default value of 1800 (30m) to allow for potential evacuator disruptions.
- `.spec.evacuators` value will be resolved on the Evacuation admission from the pod and should
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go over the whole KEP and try to make it more accessible.

characters (`63 - len("priority_")`)
- `PRIORITY` and `ROLE`
- `controller` should always set a `PRIORITY=10000` and `ROLE=controller`
- other evacuators should set `PRIORITY` according to their own needs (minimum value is 0,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is "lowest priority"

What does this priority govern?

The order of Evacuator evaluation. The highest priority evacuators are evaluated first.

Can two of them have the same prio?

I think they can, but then their order is not defined in the API and most likely just order alpabetically by their class.

One exception is 10000/controller which should be unique and set by the controller.

namespace: blueberry
```

The evacuator should observe the evacuation objects that match the pods that the evacuator manages.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is evacuator specific.

  • It can use the same labels to select the evacuation as it does for its own pods (e.g. Deployment's .spec.selector) . We will add the labels to the Evacuation on admission.
  • For others they might have annotations/ownerReferences on the pod and then they need to pair them.
  • Some might react to all pods (e.g. log the state of each pod)

```

The evacuator should observe the evacuation objects that match the pods that the evacuator manages.
It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Evacuation.

It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the
`EVACUATOR_CLASS` it previously set in the annotation.

If the evacuator is not interested in evacuating the pod, it should set
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Elsewhere you talked about an evacuation not being necessary and being cancelled.

I am sorry, I will make this part more clear.

Can it be cancelled once evacuators have started running? What if they have done something irreversible? Are they all expected to be reversible? Idempotent? Do they get called to reverse an evacuation?

A @sftim has said, one of the consumers/inspirations is the NodeMaintenance or other projects that might want to stop the maintenance.

The cancellation should be under the control of the current Evacuator. Evacuators should offer the ability to cancel if possible, but they can forbid it if the termination is irreversible. Please see .status.EvacuationCancellationPolicy.

`.spec.progressDeadlineSeconds` and make sure to update the status periodically in less than that
duration. For example, if `.spec.progressDeadlineSeconds` is set to the default value of 1800 (30m),
it may update the status every 3 minutes. The status updates should look as follows:
- Check that `.status.activeEvacuatorClass` still matches the `EVACUATOR_CLASS` previously set in
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.status.activeEvacuatorClass is only reconciled/coordinated by the the evacuation controller (single instance).

The reaction time can be tweaked with .spec.progressDeadlineSeconds.

I think the following values might be reasonable, but welcome any other suggestions.

	// The minimum value is 600 (10m) and the maximum value is 21600 (6h).
	// The default value is 1800 (30m).

FYI, this mechanism is very quick if there is no evacuator advertised on the pod. It will instantly fallback to eviction.

### Non-Goals

- Implement evacuation capabilities in Kubernetes workloads (ReplicaSet, Deployment, etc.).
- Synchronizing of the evacuation status to the pod status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it mean that pod status will not reflect the pod is being evacuated? on a high level, how do we identify a pod is currently being evacuated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would have to pair the pod to an evacuation. I think it would be a nice followup once we are set on the Evacuation API to add for example a pod condition to indicate that. This thread is also relevant: #4565 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so what we monitor is the existing evacuations and check the evacuation statuses instead? if an evacuation that failed to pair with a pod is eventually garbage collected, will the pod ever know it was intended to be evacuated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be a nice followup once we are set on the Evacuation API to add for example a pod condition to indicate that.

I think we can add this followup to the Beta graduation.

so what we monitor is the existing evacuations and check the evacuation statuses instead?

yes, for now

if an evacuation that failed to pair with a pod is eventually garbage collected, will the pod ever know it was intended to be evacuated?

The pairing is the responsibility of the Evacuator. Nevertheless, we should check as much as we can on admission.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

needs high availability either, due to a preference for a simpler deployment model, lack of
application support for HA, or to minimize compute costs. Also, any automated solution needs
to edit the PDB to account for the additional pod that needs to be spun to move the workload
from one node to another. This has been discussed in issue [kubernetes/kubernetes#66811](https://github.com/kubernetes/kubernetes/issues/66811)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I have read about EvacuationAPI, it seems to respect PDBs always.
But I wish to know, will EvacuationAPI be able to disable PDBs or ignore PDBs to help the case where the operator expects zero living nodes in the cluster (i.e. shutting down cluster during the weekends/holidays for maintenance and starting them again after the weekend)?
The above case requires all nodes to shut down, and having PDBs with minAvailable: 1 or maxUnavailable: 0 will definitely block the draining the nodes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the feature is not limited by PDBs, the workloads can use them but they can also bypass them if there is a controller that can make more advanced decisions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But for the k8s built-in controllers like deployments, statefulsets, etc; if a PDB was configured, how will EvacuationAPI be able to bypass the disruption budget and allow node termination below the PDB configured values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the KEP soon. But the mechanism of optionally bypassing the PDB should not change, and requires action from the evacuator.. PTAL at the KEP or just the attached diagram: https://github.com/kubernetes/enhancements/pull/4565/files#diff-560381a7190e130fded9e8758d53d2e5c4b69c39838485fc7a3fb13be096ae47

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking at the diagram, where in the innermost loop, it says the Evacuator updates the Evacuation status. I just want to clarify the meaning of the part where it says

"make sure the Evacuation status is up to date, evict the pod if it is not".

Does it mean if the evacuator failed to update the Evacuation status for a specific time, the pod will be terminated regardless?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, this is implemented via the .spec.progressDeadlineSeconds, but the min/default values should be forgiving to allow for evacuator disruptions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh thank you. As I see it, does .spec.progressDeadlineSeconds double as the polling duration counting from the most recent evacuator update, waiting for other potential instigators?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.