-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4563: Evacuation API #4565
base: master
Are you sure you want to change the base?
KEP-4563: Evacuation API #4565
Conversation
atiratree
commented
Mar 28, 2024
- One-line PR description: introduce Evacuation API to allow managed graceful pod removal
- Issue link: Evacuation API #4563
- Other comments: part of the functionality has been split from KEP-4212: Declarative Node Maintenance #4213 into this KEP
We will introduce a new term called evacuation. This is a contract between the evacuation instigator, | ||
the evacuee, and the evacuator. The contract is enforced by the API and an evacuation controller. | ||
We can think of evacuation as a managed and safer alternative to eviction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Watch out for the risk of confusing end users.
We already have preemption and eviction and people confuse the two. Or three, because there are two kinds of eviction. And there's disruption in the mix.
Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I added a mention which of what kind of eviction I mean here.
Do we want to rename Scheduling, Preemption and Eviction to Scheduling, Preemption, Evacuation and Eviction?
Yes, I think we want to add a new concept there and generally update the docs once we have an alpha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with Tim here.
Preemption vs Eviction is already quite confusing. And TBH, I couldn't fully understand what the "evacuation" is supposed to solve by reading the summary or motivation.
From Goals:
Changes to the eviction API to support the evacuation process.
If this is already going to be part of the Eviction API, maybe it should be named as a form of eviction. Something like "cooperative eviction" or "eviction with ack" or something along those lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm all for framing it as another type of eviction; we already have two, so the extra cognitive load for users is not so much a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alculquicondor I have updated the summary and goals, I hope it makes more sense now.
I think the name should make the most sense to the person creating the Evacuation (Evacuation Instigator ATM). So CooperativeEviction
or EvictionWithAck
is a bit misleading IMO. Because from that person's perpective there is no additional step required of them. Only the evacuators and the evacuation controller implement the cooperative evacuation process but this is hidden from the normal user.
My suggestions:
GracefulEviction
(might confuse people if it is associated with graceful pod termination, which it is not)SafeEviction
(*safer than the API-initiated one for some pods)- Or just call it
Eviction
? And tell people to use it instead of the Eviction API endpoint? This might be a bit confusing (at least in the beginning)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can bikeshed names for the API kind; I'd throw a few of my own into the hat:
- EvictionRequest
- PodEvictionRequest
<!-- | ||
What other approaches did you consider, and why did you rule them out? These do | ||
not need to be as detailed as the proposal, but should include enough | ||
information to express the idea and why it was not acceptable. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a number of examples of having a SomethingRequest or SomethingClaim API that then causes a something (certificate signing, node provisioning, etc).
Think of TokenRequest (a subresource for a ServiceAccount), or CertificateSigningRequest.
I would confidently and strongly prefer to have an EvictionRequest or PodEvictionRequest API, rather than an Evacuation API kind.
It's easy to teach that we have evictions and than an EvictionRequest is asking for one to happen; it's hard to teach the difference between an eviction and an evacuation.
As a side effect, this makes the feature gate easier to name (eg PodEvictionRequests
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you have mentioned, we already have different kinds of eviction. So I think it would be good to use a completely new term to distinguish it from the others.
Also, Evacuation does not always result in eviction (and PDB consultation). It depends on the controller/workload. For some workloads like DaemonSets and static pods, API eviction has never worked before. This could also be very confusing if we name it the same way.
I think Evacuation fits this better because
- The name is shorter. If we go with EvacuationRequest then the evacuation will become just an abstract term and less recognizable.
- It seems it will have quite a lot of functionality included (state synchronization between multiple instigators and multiple evacuators, state of the evacuee and evacuation). TokenRequest and CertificateSigningRequest are simpler and not involved in a complex process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggested EvictionRequest so that we don't have to have a section with the (too long) title: Scheduling, Preemption, Evacuation and Eviction. Not EvacuationRequest.
Adding another term doesn't scale so well: it means helping n people understand the difference between evacuation and eviction. It's a scaling challenge where n is not only large, it probably includes quite a few Kubernetes maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for CertificateSigningRequest being simple: I don't buy it. There are three controllers, custom signers, an integration with trust bundles, the horrors of ASN.1 and X.509… trust me, it's complicated enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it will be confusing for people, but that will happen regardless of what term we will use.
My main issue is that evacuation does not directly translate to eviction. Therefore, I think it would be preferable to choose a new term (not necessarily evacuation).
I would like to get additional opinions from people about this. And we will definitely have to come back to this in the API review.
The evacuee can be any pod. The owner/controller of the pod is the evacuator. In a | ||
special case, the evacuee can be its own evacuator. The evacuator should decide what action to take | ||
when it observes an evacuation intent: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't match common English use. An evacuator would be something initiating evacuation, not the owner / manager of the thing being evactuated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know: It seemed to fit somehow, according to the dictionary ^^
a person or thing that evacuates.
In my view, even though it is the owner/manager of the pod, it also implements the evacuation part (evacuates).
Nevertheless, I am open to suggestions.
How about Evacuator -> Evacuation Responder -> Evacuee
?
13611ce
to
2d15b79
Compare
2d15b79
to
c292415
Compare
|
|
||
Example evacuation triggers: | ||
- Node maintenance controller: node maintenance triggered by an admin. | ||
- Descheduler: descheduling triggered by a descheduling rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the descheduler requests an eviction, what thing is being evacuated?
(the node maintenance and cluster autoscaler examples are easier: you're evacuating an entire node)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single pod or multiple pods. The descheduler can use it as a new mechanism instead of eviction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so people typically think of “evacuate” as a near synonym of “drain” - you drain a node, you evacuate a rack or zone full of servers. Saying that you can evacuate a Pod might make people think its containers all get stopped, or just confuse readers. We do need to smooth out how we can teach this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it can be used in both scenarios https://www.merriam-webster.com/grammar/can-you-evacuate-people.
Evacuation of containers doesn't make sense because they are tied to the pod lifecycle. But, I guess it could be confusing if we do not make it explicitly clear what we are targeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thing is, Kubernetes users typically - before we accept this KEP - use “evacuate” as a synonym for drain.
I'm (still) fine with the API idea, and still concerned about the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to +1 the potential confusion of the term "evacuation".
Is it okay to have a glossary of terms for "evacuation", "eviction", and "drain" (or any other potentially confusing terms) added somewhere in this KEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can include it in the KEP. And yes, we are going to change the name to something more suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"To evacuate a person" implies "get them out of trouble, to safety" as opposed to "to empty" (as in vacuum). It's not ENTRIELY wrong in this context, but it's not entirely right either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.
1900020
to
085672a
Compare
085672a
to
f81cc8c
Compare
|
||
#### Deployment Controller | ||
|
||
The deployment controller must first decide whether a particular Deployment supports the evacuation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to consider integration with HPA? During an evacuation, if HPA scales down a Deployment, do we just follow the existing ReplicaSet's scale-down behavior to delete those Pods that need to be reduced, aside from the Pods scheduled for termination in the Evacuation API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regardless of the Deployment strategy, if the Deployment is scaled down during an evacuation, the ReplicaSet is scaled down and the pods with matching Evacuations can be terminated immediately by the replica set controller.
There are ways for the HPA to inject itself into the evacuation process. For example it could become the Evacuation Instigator and select the pods that should be scaled down by creating Evacuations for supported workloads. This is not a feature that HPA offers today, and I am not sure if it should. I think it is better to leave it out of this KEP.
The Deployments and ReplicaSets should work for both the manual and automatic scaling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@toVersus I have added a new followup for the HPA use case (see HorizontalPodAutoscaler).
- It should be possible for the HPA to become the evacuator of the workload, if Evacuations are observed.
- I also roughly outline how the evacuations conflicts should be resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for adding the HPA controller section in the Follow-up design!
Upon detecting an evacuation of pods under the HPA workload, the HPA would increase the number of replicas by the number of evacuations.
Just to confirm, the behavior described here refers to when the HPA controller becomes the Evacuator (i.e., when evacuation priority annotations are added to the HPA object), right? I agree that, to avoid conflicts with other pod controllers, I agree that the HPA controller does not perform evacuation by default.
Although this is something we can decide during future discussions about pod controller's Evacuation API support, another concern is whether the HPA controller will modify the replicas of Deployments, etc., during evacuation. If the pod controller increases the replicas to account for maxSurge and then the HPA controller modifies the replicas, there could be side effects. It seems simpler for the HPA controller not to perform scale-in or scale-out during evacuation, but if the evacuation takes a long time, not being able to scale in or scale out for a long period of time could itself be problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should become an evacuator and add annotations to pods. This is up for disscusion if the evacuation should be by default or not. I do not see a downside, by adding it as a lower priority evacuator.
The priorities and scaling operations depend on the workload. But it should never happen that both the deployment controller and the HPA increase replicas/pods at the same time.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: atiratree The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Added the following sections to the Alternatives:
Other changes:
|
f2e68a6
to
85f94b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have similar concerns as other expressed in #4213 and I've also voiced out there. There are several alternatives you've described here that you rejected due to simplicty. I'm not sure that entirely new API will solve this problem, I'd first suggest to tackle smaller problems that we currently see with both PDBs and Eviction API before overhauling this and introducing yet another API which will most likely confuse users, rather than provide them with suggested flexibility.
An evacuator could reconcile the status properly without making any progress. It is thus | ||
recommended to check `creationTimestamp` of the Evacuations and observe | ||
`evacuation_controller_active_evacuator` metric to see how much it takes for an evacuator to | ||
complete the evacuation. This metric can be also used to implement additional alerts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the risk behind adopting this new API, most of the users are very slow adopting these. My concern is that, aside from the naming confusion overload as mentioned in https://github.com/kubernetes/enhancements/pull/4565/files#r1543242249, a lot of users will prefer the simpler eviction API that is already battle tested even with its limitation, but mostly because it's already being widely used. Your proposed mechanism on one hand requires work on the k8s core side, and on the other hand to application owners managing the evacuation process.
I will repeat similarly to what has been stated in #4213, should we revisit PDB and eviction API, both of which are widely used and identify the gaps and what are the options to extend them with the missing functionality. I don't think we need a full fledged, feature-rich functionality, but rather a functional and easy to understand API built on the existing primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not see a big risk in terms of adoption.
The eviction API is mostly used by a handful of components (kubectl drain, cluster autoscaler, karpenter, scheduler, descheduler, etc.). It is enough to introduce Evacuation API to a handful of components to see an impact. Users usually do not use this API by themselves - there is not even a kubectl support.
From the evacuator side, there is no user action required either. Please see the followup workload integration in this KEP, DeamonSet is part of the NodeMaintenance KEP.
The user can take action if they want to further improve the situation:
- Implement custom evacuation strategy for their workload or use a strategy implemented by another party/component.
- Add additional checks to the pods (aspect oriented approach) as discussed in KEP-4212: Declarative Node Maintenance #4213 (comment).
At this time, I do not see any extensions that could be made to PDBs and eviction API to solve the current issues. Please see the analysis in the alternatives section: https://github.com/atiratree/kube-enhancements/blob/improve-node-maintenance/keps/sig-apps/4212-declarative-node-maintenance/README.md#alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the NodeMaintenance we already have two additional use cases that would like to adopt Evacuation API
- Affinity Based Eviction (Implement Affinity Based Eviction #4328)
- Descheduler integration (KEP-1397: descheduler integration with evacuation API as an alternative to eviction API kubernetes-sigs/descheduler#1354)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Descheduler integration (kubernetes-sigs/descheduler#1354)
Why not align with the effort that scheduler is pushing to start using evictions (see #3280)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is too specific for preemption and expects ordered pods by a priority. It would not help in cases where priority is not used or cooperative eviction is needed.
Do you mean the Pod API? We have analyzed the problem and found serious concurrency, obsevability and extensibility issues.
I would prefer not to have a new API, but at this time, I do not see how we can incrementally improve the situation for PDBs and eviction API to address the problems at hand. We need to come up with a good name so that is not such a big hurdle for the users to undestand. Fortunately, most of the time, this API will be used internally as a building block by the cluster components. Similar to the eviction API today. |
b55bd3d
to
cd1dcdd
Compare
cd1dcdd
to
b833a5c
Compare
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am out of time for now, but I have to admit I am worried about how this will play out. Coordinating between N arbitrary actors via the API is not somethign we really do anywhere else (AFAIK "classic" DRA was the one place and that is being removed).
Can we pin down some very concrete examples of what REAL evacuators might really do, and how the lifecycle works, especially WRT cancellation of an eviction? You can lose the details and just tell a story. This is significantly complicated and I don't feel able to defend it right now.
Could this be positioned like "readiness gates". That sort of feels similar.
|
||
The major issues are: | ||
|
||
1. Without extra manual effort, an application running with a single replica has to settle for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm torn on this as a motivating case. If you have a single-replica app, you shouldn't be really allowed to use PDB (at least not as defined).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we made it possible and people do it. And also utilize validation webhooks/controllers to implement their desired behavior.
|
||
Example evacuation triggers: | ||
- Node maintenance controller: node maintenance triggered by an admin. | ||
- Descheduler: descheduling triggered by a descheduling rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"To evacuate a person" implies "get them out of trouble, to safety" as opposed to "to empty" (as in vacuum). It's not ENTRIELY wrong in this context, but it's not entirely right either.
- After a partial cleanup (e.g. storage migrated, notification sent) and if the application is | ||
still in an available state, the eviction API can be used to respect PodDisruptionBudgets. | ||
|
||
In the end, the evacuation should always end with a pod being deleted (evict or delete call) by one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Early you said "The instigator's responsibility is to communicate an intent ... it should remove its intent when the evacuation is no longer necessary." Reconcile that with "evacuation should always end with a pod being deleted" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we should make it clear that the pod might not end up deleted/terminated. I also want to add a mechanism that would result in the pod termination, but not deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this context (or in general), what is the difference of "pod termination" and "pod deletion"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"pod termination" -> pod terminates but the pod object is kept in the cluster/etcd)- meant for pods with OnFailure
, Never
restart policies
"pod deletion" -> pod terminates and the pod object is also deleted - meant for pods with Always
restart policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you. so you mean by default, evacuation aims to delete, but there will be a separate set of rules for this mechanism to determine if a pod is going to be terminated instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you. so you mean by default, evacuation aims to delete,
yes
but there will be a separate set of rules for this mechanism to determine if a pod is going to be terminated instead?
I think it should be decided by the evacuator thought the Evacuation API if it is useful to keep the pod around after the termination.
There can be many evacuation instigators for a single Evacuation. | ||
|
||
When an instigator decides that a pod needs to be evacuated, it should create an Evacuation: | ||
- With a predictable and consistent name in the following format: `${POD_UID}-${POD_NAME_PREFIX}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love magical hardcoded formatted strings. Why is it not sufficient to use POD_NAME or, if you really want to handle the ABA case, just POD_UID ? What I mean is that we already have places that assert that two things are linked 1:1 by having the same name. In this case, I suppose it's possilbe that you have:
- Pod "foo" exists
- create evacuation "foo", with pod's UID inside
- delete pod foo
- create pod foo
- try to create evacuation "foo", with new UID inside, fail
I (maybe wrongly) presume there's a cleanup loop that removes evacuations which have no matching pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of the Evacuation object must be unique and predictable for each pod to prevent the creation of multiple Evacuations for the same pod. We do not expect the evacuators to support interaction with multiple Evacuations for a pod.
I was trying to combine ease of use (POD_NAME) and uniqueness of each pod instance (POD_UID).
- Having name only: we can solve the uniqueness by an admission plugin and checking
.spec.podRef
. - Having ID only: we can let clients (kubectl) print out the name from
.spec.podRef
.
Even though we do not expect a big number of instigators, I agree that ${POD_UID}-${POD_NAME_PREFIX}
is overly complicated for the end user. Especially the number of characters limit.
I (maybe wrongly) presume there's a cleanup loop that removes evacuations which have no matching pod?
Yes there is a GC, but that can have a delay. Nevertheless I am mainly concerned about having the 1:1 mapping, to make the implementation easier for all the parties (mainly Evacuators)
after the previous pod is removed. | ||
- `.spec.progressDeadlineSeconds` should be set to a reasonable value. It is recommended to leave | ||
it at the default value of 1800 (30m) to allow for potential evacuator disruptions. | ||
- `.spec.evacuators` value will be resolved on the Evacuation admission from the pod and should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have gotten this far, but I really don't know what this is describing. This is a very long and detailed KEP, but I think it would really benefit from an overview with less details. E.g.
"""
Pods carry new information about which "evacuators" (described later) are involved in its lifecycle. Multiple evacutation initiators can coordinate, via a new "Evacuation" API object, to indicate the need for a pod to be removed from a node. This new object allows the evacuators to take action before the pod is removed, and...
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go over the whole KEP and try to make it more accessible.
characters (`63 - len("priority_")`) | ||
- `PRIORITY` and `ROLE` | ||
- `controller` should always set a `PRIORITY=10000` and `ROLE=controller` | ||
- other evacuators should set `PRIORITY` according to their own needs (minimum value is 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 0 "highest priority" or "lowest priority"? What does this priority govern? Can two of them have the same prio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 is "lowest priority"
What does this priority govern?
The order of Evacuator evaluation. The highest priority evacuators are evaluated first.
Can two of them have the same prio?
I think they can, but then their order is not defined in the API and most likely just order alpabetically by their class.
One exception is 10000/controller
which should be unique and set by the controller.
namespace: blueberry | ||
``` | ||
|
||
The evacuator should observe the evacuation objects that match the pods that the evacuator manages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does an evacuator know which pods it manages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is evacuator specific.
- It can use the same labels to select the evacuation as it does for its own pods (e.g. Deployment's
.spec.selector
) . We will add the labels to the Evacuation on admission. - For others they might have annotations/ownerReferences on the pod and then they need to pair them.
- Some might react to all pods (e.g. log the state of each pod)
``` | ||
|
||
The evacuator should observe the evacuation objects that match the pods that the evacuator manages. | ||
It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whose status? Pod or Evactuation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Evacuation.
It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the | ||
`EVACUATOR_CLASS` it previously set in the annotation. | ||
|
||
If the evacuator is not interested in evacuating the pod, it should set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere you talked about an evacuation not being necessary and being cancelled. Can it be cancelled once evacuators have started running? What if they have done something irreversible? Are they all expected to be reversible? Idempotent? Do they get called to reverse an evacuation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The inspiration for this was (partly) the declarative node maintenance KEP - #4212
With declaring a node as under maintenance it feels implicit that you can end maintenance at any time, either as a cancellation or by marking it complete. That's how organizations getting humans to maintain things typically model it.
(also, you can't “reverse“ a maintenance notification; you can rescind it, but you can't create a second maintenance period that overlaps with and cancels out the first one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was that I can't tell, from this KEP, how to answer those questions. If those answers are in there, they are buried.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere you talked about an evacuation not being necessary and being cancelled.
I am sorry, I will make this part more clear.
Can it be cancelled once evacuators have started running? What if they have done something irreversible? Are they all expected to be reversible? Idempotent? Do they get called to reverse an evacuation?
A @sftim has said, one of the consumers/inspirations is the NodeMaintenance or other projects that might want to stop the maintenance.
The cancellation should be under the control of the current Evacuator. Evacuators should offer the ability to cancel if possible, but they can forbid it if the termination is irreversible. Please see .status.EvacuationCancellationPolicy
.
`.spec.progressDeadlineSeconds` and make sure to update the status periodically in less than that | ||
duration. For example, if `.spec.progressDeadlineSeconds` is set to the default value of 1800 (30m), | ||
it may update the status every 3 minutes. The status updates should look as follows: | ||
- Check that `.status.activeEvacuatorClass` still matches the `EVACUATOR_CLASS` previously set in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels racy and brittle - easy to miss "events" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.status.activeEvacuatorClass
is only reconciled/coordinated by the the evacuation controller (single instance).
The reaction time can be tweaked with .spec.progressDeadlineSeconds
.
I think the following values might be reasonable, but welcome any other suggestions.
// The minimum value is 600 (10m) and the maximum value is 21600 (6h).
// The default value is 1800 (30m).
FYI, this mechanism is very quick if there is no evacuator advertised on the pod. It will instantly fallback to eviction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin Thanks a lot for the feedback. I see there are still quite a lot of places to improve.
Can we pin down some very concrete examples of what REAL evacuators might really do, and how the lifecycle works, especially WRT cancellation of an eviction?
I guess the user stories are still pretty abstract and it is best to show the flow with real examples.
I will work on the updates soon.
|
||
Example evacuation triggers: | ||
- Node maintenance controller: node maintenance triggered by an admin. | ||
- Descheduler: descheduling triggered by a descheduling rule. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change the name to EvictionRequest. It was originally chosen to distinguish it from eviction, but there is value in making it familiar to existing concepts.
|
||
The major issues are: | ||
|
||
1. Without extra manual effort, an application running with a single replica has to settle for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we made it possible and people do it. And also utilize validation webhooks/controllers to implement their desired behavior.
- After a partial cleanup (e.g. storage migrated, notification sent) and if the application is | ||
still in an available state, the eviction API can be used to respect PodDisruptionBudgets. | ||
|
||
In the end, the evacuation should always end with a pod being deleted (evict or delete call) by one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we should make it clear that the pod might not end up deleted/terminated. I also want to add a mechanism that would result in the pod termination, but not deletion.
There can be many evacuation instigators for a single Evacuation. | ||
|
||
When an instigator decides that a pod needs to be evacuated, it should create an Evacuation: | ||
- With a predictable and consistent name in the following format: `${POD_UID}-${POD_NAME_PREFIX}`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of the Evacuation object must be unique and predictable for each pod to prevent the creation of multiple Evacuations for the same pod. We do not expect the evacuators to support interaction with multiple Evacuations for a pod.
I was trying to combine ease of use (POD_NAME) and uniqueness of each pod instance (POD_UID).
- Having name only: we can solve the uniqueness by an admission plugin and checking
.spec.podRef
. - Having ID only: we can let clients (kubectl) print out the name from
.spec.podRef
.
Even though we do not expect a big number of instigators, I agree that ${POD_UID}-${POD_NAME_PREFIX}
is overly complicated for the end user. Especially the number of characters limit.
I (maybe wrongly) presume there's a cleanup loop that removes evacuations which have no matching pod?
Yes there is a GC, but that can have a delay. Nevertheless I am mainly concerned about having the 1:1 mapping, to make the implementation easier for all the parties (mainly Evacuators)
after the previous pod is removed. | ||
- `.spec.progressDeadlineSeconds` should be set to a reasonable value. It is recommended to leave | ||
it at the default value of 1800 (30m) to allow for potential evacuator disruptions. | ||
- `.spec.evacuators` value will be resolved on the Evacuation admission from the pod and should |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go over the whole KEP and try to make it more accessible.
characters (`63 - len("priority_")`) | ||
- `PRIORITY` and `ROLE` | ||
- `controller` should always set a `PRIORITY=10000` and `ROLE=controller` | ||
- other evacuators should set `PRIORITY` according to their own needs (minimum value is 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 is "lowest priority"
What does this priority govern?
The order of Evacuator evaluation. The highest priority evacuators are evaluated first.
Can two of them have the same prio?
I think they can, but then their order is not defined in the API and most likely just order alpabetically by their class.
One exception is 10000/controller
which should be unique and set by the controller.
namespace: blueberry | ||
``` | ||
|
||
The evacuator should observe the evacuation objects that match the pods that the evacuator manages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is evacuator specific.
- It can use the same labels to select the evacuation as it does for its own pods (e.g. Deployment's
.spec.selector
) . We will add the labels to the Evacuation on admission. - For others they might have annotations/ownerReferences on the pod and then they need to pair them.
- Some might react to all pods (e.g. log the state of each pod)
``` | ||
|
||
The evacuator should observe the evacuation objects that match the pods that the evacuator manages. | ||
It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Evacuation.
It should start the evacuation only if it observes `.status.activeEvacuatorClass` that matches the | ||
`EVACUATOR_CLASS` it previously set in the annotation. | ||
|
||
If the evacuator is not interested in evacuating the pod, it should set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere you talked about an evacuation not being necessary and being cancelled.
I am sorry, I will make this part more clear.
Can it be cancelled once evacuators have started running? What if they have done something irreversible? Are they all expected to be reversible? Idempotent? Do they get called to reverse an evacuation?
A @sftim has said, one of the consumers/inspirations is the NodeMaintenance or other projects that might want to stop the maintenance.
The cancellation should be under the control of the current Evacuator. Evacuators should offer the ability to cancel if possible, but they can forbid it if the termination is irreversible. Please see .status.EvacuationCancellationPolicy
.
`.spec.progressDeadlineSeconds` and make sure to update the status periodically in less than that | ||
duration. For example, if `.spec.progressDeadlineSeconds` is set to the default value of 1800 (30m), | ||
it may update the status every 3 minutes. The status updates should look as follows: | ||
- Check that `.status.activeEvacuatorClass` still matches the `EVACUATOR_CLASS` previously set in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.status.activeEvacuatorClass
is only reconciled/coordinated by the the evacuation controller (single instance).
The reaction time can be tweaked with .spec.progressDeadlineSeconds
.
I think the following values might be reasonable, but welcome any other suggestions.
// The minimum value is 600 (10m) and the maximum value is 21600 (6h).
// The default value is 1800 (30m).
FYI, this mechanism is very quick if there is no evacuator advertised on the pod. It will instantly fallback to eviction.
### Non-Goals | ||
|
||
- Implement evacuation capabilities in Kubernetes workloads (ReplicaSet, Deployment, etc.). | ||
- Synchronizing of the evacuation status to the pod status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean that pod status will not reflect the pod is being evacuated? on a high level, how do we identify a pod is currently being evacuated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would have to pair the pod to an evacuation. I think it would be a nice followup once we are set on the Evacuation API to add for example a pod condition to indicate that. This thread is also relevant: #4565 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so what we monitor is the existing evacuations and check the evacuation statuses instead? if an evacuation that failed to pair with a pod is eventually garbage collected, will the pod ever know it was intended to be evacuated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a nice followup once we are set on the Evacuation API to add for example a pod condition to indicate that.
I think we can add this followup to the Beta graduation.
so what we monitor is the existing evacuations and check the evacuation statuses instead?
yes, for now
if an evacuation that failed to pair with a pod is eventually garbage collected, will the pod ever know it was intended to be evacuated?
The pairing is the responsibility of the Evacuator. Nevertheless, we should check as much as we can on admission.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
needs high availability either, due to a preference for a simpler deployment model, lack of | ||
application support for HA, or to minimize compute costs. Also, any automated solution needs | ||
to edit the PDB to account for the additional pod that needs to be spun to move the workload | ||
from one node to another. This has been discussed in issue [kubernetes/kubernetes#66811](https://github.com/kubernetes/kubernetes/issues/66811) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I have read about EvacuationAPI, it seems to respect PDBs always.
But I wish to know, will EvacuationAPI be able to disable PDBs or ignore PDBs to help the case where the operator expects zero living nodes in the cluster (i.e. shutting down cluster during the weekends/holidays for maintenance and starting them again after the weekend)?
The above case requires all nodes to shut down, and having PDBs with minAvailable: 1
or maxUnavailable: 0
will definitely block the draining the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the feature is not limited by PDBs, the workloads can use them but they can also bypass them if there is a controller that can make more advanced decisions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But for the k8s built-in controllers like deployments, statefulsets, etc; if a PDB was configured, how will EvacuationAPI be able to bypass the disruption budget and allow node termination below the PDB configured values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the KEP soon. But the mechanism of optionally bypassing the PDB should not change, and requires action from the evacuator.. PTAL at the KEP or just the attached diagram: https://github.com/kubernetes/enhancements/pull/4565/files#diff-560381a7190e130fded9e8758d53d2e5c4b69c39838485fc7a3fb13be096ae47
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at the diagram, where in the innermost loop, it says the Evacuator updates the Evacuation status. I just want to clarify the meaning of the part where it says
"make sure the Evacuation status is up to date, evict the pod if it is not".
Does it mean if the evacuator failed to update the Evacuation status for a specific time, the pod will be terminated regardless?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, this is implemented via the .spec.progressDeadlineSeconds
, but the min/default values should be forgiving to allow for evacuator disruptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh thank you. As I see it, does .spec.progressDeadlineSeconds
double as the polling duration counting from the most recent evacuator update, waiting for other potential instigators?