From ec699a1e166bc8a2e052567cae4ec5e07575e3c7 Mon Sep 17 00:00:00 2001 From: Jan Chaloupka Date: Mon, 26 Feb 2024 10:58:49 +0100 Subject: [PATCH] KEP: Extend descheduler eviction handling for eviction requests --- proposals/evictions-in-background/README.md | 879 ++++++++++++++++++++ proposals/evictions-in-background/kep.yaml | 16 + 2 files changed, 895 insertions(+) create mode 100644 proposals/evictions-in-background/README.md create mode 100644 proposals/evictions-in-background/kep.yaml diff --git a/proposals/evictions-in-background/README.md b/proposals/evictions-in-background/README.md new file mode 100644 index 0000000000..127d988fba --- /dev/null +++ b/proposals/evictions-in-background/README.md @@ -0,0 +1,879 @@ + +# KEP-NNNN: Extend descheduler eviction handling for eviction requests + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +The descheduler eviction policy is built on top of the eviction API. The API currently does not support eviction requests that are not completed right away. Instead, any eviction needs to either succeed or be rejected in response. Nevertheless, there are cases where an eviction request is expected to only initiate eviction. While getting confirmation or rejection of the eviction initiation (or its promise). + +The descheduler treats all pods as cattle. Each descheduling loop consists of pods nomination followed by their eviction. When eviction of one pod does not succeed another pod is selected until a limit on the number of evictions is reached or all nominated pods are evicted. The descheduler does not keep track of pods that failed to be evicted. The next descheduling loop repeats the same routine. The order of eviction is not guaranteed. Thus, any subsequent loop can either evict pods that failed to be evicted in the previous loop or completely ignore them if a limit is reached. + +## Motivation + +Kubernetes ecosystem is wide. There are various types of workloads in various environments that may require different eviction policies to be applied. Some pods can be evicted right away, some pods may need more time for clean up. Other pods may need more time than a graceful termination period gives or be able to retry eviction in case a live migration did not succeed. + +### Goals + +- **Enhance Descheduler eviction policy**: Allow pods that require eviction without getting evicted right away to get a promise from third part component on handling the eviction without directly invoking the upstream eviction API. Eventually replacing the eviction API with a new [evacuation API](https://github.com/kubernetes/enhancements/pull/4565). +- **No more evictions, only requests for evacuation**: the descheduler will no longer guarantee eviction. All the configurable limits per namespace/node/run will change their semantics to a maximal number of evacuation requests per namespace/node/run. +- **New sorting plugin to prioritize in-progress evictions**: Sort the nominated pods to prefer those with eviction already in progress (or its promise) to minimize workload disruptions. +- **New prefilter plugin to exclude in-progress evictions**: Filter out all pods that have an already existing evacuation request present. +- **Measure pending evictions**: New metric to keep count of evictions in progress/evacuation requests + +### Non-Goals + +- Implementation of any customized eviction policy: The Descheduler will only acknowledge existence of pods that require different eviction handling. The actual eviction (e.g. live migration) will be implemented by third parties. + +## Proposal + +This enhancement proposes the implementation of a more informed eviction policy for the Descheduler. Allowing plugins (constructing lists of pods nominated for eviction) to be aware of pods with eviction in progress to avoid evicting more than a configured limit allows. Reducing unnecessary evictions can improve overall workload distribution and save cost of bringing up new pods. + +### Expected Outcomes + +- Live migration implemented through external admission webhooks are no longer perceived as eviction errors. Currently, no pod with a live migration in background is evicted properly. Thus resulting in eviction of all such pods from all possible nodes until a limit is reached. Which is an undesirable behaviour. +- Descheduling plugins are more aware of pods with live migration in background. Thus, improving the decision making behind which pods get nominated for eviction. +- The descheduler will no longer be responsible for direct eviction. Instead, only requesting eviction and letting other components (more informed and following various policies) to perform the actual eviction. + +### User Stories (Optional) + +- As a cluster admin running a K8s cluster with a descheduler, I want to evict KubeVirt pods that may require live migration of VMs before a pod can be evicted. A VM live migration can take minutes to complete or may require retries. +- As an end user deploying my application on a K8s cluster with a descheduler, I want to evict pods while performing pod live migration to persist pod's state without storing any necessary data into pod-independent persistent storage. +- As a developer I want to be able to implement a custom eviction policy to address various company use cases that are not supported by default when running a descheduler instance. +- As a descheduler plugin developer I want to be aware of evictions in progress to improve pod nomination and thus avoid unnecessary disruptions. +- As a security professional I want to make sure any live migration adhers to security policies and protects sensitive data. + +### Annotation vs. evacuation API based eviction + +The [evacuation API](https://github.com/kubernetes/enhancements/pull/4565) is expected to replace the [eviction API](https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/). Instead of checking whether an eviction was successful or failed, an evacuation request is created. In case the request fails to be created a pod eviction is skipped as previously. In addition, each cycle will keep a number of evacution requests created instead of a number of pods evicted. + +The annotation based approach is a special case of invoking the eviction API: + +| | Eviction API | Evacuation API | +|--|--|--| +|**Eviction vs. eviction request**|Eviction of a pod with `descheduler.alpha.kubernetes.io/request-evict-only` annotation. Checking the eviction error, return code and response text.|Creation of an evacuation CR for the given pod. Checking the evacuation CR creation error.| +|**Pod nomination and sorting**|Preferring pods with both `descheduler.alpha.kubernetes.io/request-evict-only` and `descheduler.alpha.kubernetes.io/eviction-in-progress` annotations.|Preferring pods that have a corresponding evacuation CR present.| +|**Cache reconstruction**|When a descheduler gets restarted and the internal cache of annotated pods (i.e. eviction requests) cleared the descheduler lists all pods and populate the cache with any pod that has both annotation (`descheduler.alpha.kubernetes.io/request-evict-only` and `descheduler.alpha.kubernetes.io/eviction-in-progress`) present. In case only the first annotation is present but the eviction request was already created, the update event handler will catch the second annotation addition and the cache gets synced. In the worst case the limit of max number of pods evicted gets exceeded.|The descheduler waits until all evacution CRs are synced.| + +Given the evucation API is still a work in progress without any existing implementation, the annotation based eviction can be seen as an v1alpha1 implemenation of this proposal. Integration with the evacuation API as an v1alpha2 or beta implementation. + +### Workflow Description + +This section outlines the workflow for enabling the proposed evictions in background that prevent workloads from being evicted right away without following additional eviction policies. + +#### Actors + +- Cluster Administrator: Responsible for configuring descheduling policies and overseeing cluster operations. +- User: Individuals or services running workloads within the cluster. +- Descheduler: Component responsible for requesting pod evictions + +#### Workflow Steps + +1. Policy configuration + 1. The cluster administrator configures the descheduler to enable the new functionality. +2. A user deploys a workload through various nodes + 1. A scheduler distributes the pods based on configured scheduling policies +3. Eviction nomination + 1. At the beginning of each descheduling cycle the descheduler increases all the internal counters to take into account all pods that are subjects to background eviction. Either by checking `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation, list of evacuation requests or confronting the internal caches. + 2. Each descheduling plugin nominates a set of pods to be evicted. + 3. All the nominated pods are sorted based on the eviction-in-progress first priority. +4. Pod eviction: the descheduler starts evicting nominated pods until a limit is reached + 1. If an annotation based eviction is enabled: + 1. If a pod is not annotated with `descheduler.alpha.kubernetes.io/request-evict-only` evict the pod normally without additional handling. + 2. If a pod is annotated with `descheduler.alpha.kubernetes.io/request-evict-only` evict the pod. + 1. If the eviction fails intepret the error as a request for eviction in background. + 2. Error code, resp. response text is checked to distinguish between a genuine error and a confirmation of an eviction in background. + 2. If an evacuation API is utilized create an evacution request + +### Notes/Constraints/Caveats (Optional) + +- The upstream eviction API does not currently support evictions in background implemented by external components. For that, there's no community dedicated error code nor a response text for this type of eviction. +- The eviction in background error code will be temporary mapped to 429 (TooManyRequests). +- The response text will be searched for strings containing "Eviction triggered" prefix or similar. +- The descheduler will still treat all the pods with eviction in background as cattle. It will only prioritize these pods to be processed sooner. There may still be unnecessary/sub-optimal evictions. + +### Risks and Mitigations + +- If an external component responsible for eviction in background does not work properly the descheduler might evict less pods than expected. Leaving pods that require eviction (e.g. long running or violating third party policies) running longer than expected until the corresponding entry in the cache expires. +- Most of the heavy lifting happens in the external component. The descheduler only acknowledges pods that are evicted in background. All the internal counters take these pods into account. When these pods are not properly evicted the internal counters might limit eviction of other pods that (when evicted) might improve overall cluster resource utilization. +- This new functionality needs to stay feature gated until the evacuation API is available and stable. + +## Design Details + +### Implementation Strategies + +- New category of pods for eviction: Two new annotations are introduced: + - `descheduler.alpha.kubernetes.io/request-evict-only`: to signal pods whose eviction is not expected to be completed right away. Instead, an eviction request is expected to be intercepted by an external component which will initiate the eviction process for the pod. E.g. by performing live migration or scheduling the eviction to a more suitable time. + - `descheduler.alpha.kubernetes.io/eviction-in-progress`: to mark pods whose eviction was initiated by an external component. Future implementation may read the eviction status from the [evacuation CR status](https://github.com/kubernetes/enhancements/pull/4565). +- Keep track of evictions in progress: A new cache will be introduced to keep track of pods annotated with `descheduler.alpha.kubernetes.io/request-evict-only` that were requested to be evicted through the upstream eviction API. In case a descheduler gets unexpectedly restarted any pod annotated with `descheduler.alpha.kubernetes.io/eviction-in-progress` will be added to the cache if not already present. The cache will help with sorting and avoiding a duplicated eviction request of pods that were not annotated with `descheduler.alpha.kubernetes.io/eviction-in-progress` quickly enough. Each item in the cache will have a TTL attribute to address cases where an eviction request was intercepted by an external component but the component failed to annotate the pod. +- Implement a new built-in sorting plugin: Extend the descheduling framework with a new sorting plugin that will prefer pods with `descheduler.alpha.kubernetes.io/eviction-in-progress` annotation or those that are already in the cache of to-be-evicted pods. The plugin will be enabled when the new functionality is enabled (through a feature gate). Each plugin will have option to either take the new sorting into account or not. The sorting plugin can be executed either in pre-filter or pre-eviction phase. +- Implement a multi-level sorting: Allow plugins to first sort pods based on eviction-in-progress quality and then based on configured sorting plugins. Preferring pods that are already in a process of eviction over other pods is a crucial step for reducing unnecessary disruptions. +- Internal counters aware of eviction in progress: At the beginning of each descheduling loop all internal counters will take into account pods that are either annotated with `descheduler.alpha.kubernetes.io/eviction-in-progress` or are present in the cache of to-be-evicted pods. +- The [evacuation API](https://github.com/kubernetes/enhancements/pull/4565) expects the evacuation request to be reconciled during each descheduling cycle. Given each cycle starts from scratch the descheduler needs to first list all evacuation requests and sort the pods accordingly in each plugin. Other components may create an evacuation request as well. The descheduler needs to take this into account. E.g. by creating a snapshot (per cycle, per strategy) to avoid inconsistencies. An evacuation request may exist without a corresponding pod before the evacuation controller garbage collects the request. +- The descheduler needs to keep a cache of evacution API requests (CRs) and remove itself from finalizers if the request is no longer needed. (e.g. `evacuation.coordination.k8s.io/instigator_descheduler.sigs.k8s.io` as a finalizer) + +#### API changes + +- **New eviction limits**: introduction of new `MaxNoOfPodsEvacuationsPerNode` and `MaxNoOfPodsEvacuationsPerNamespace` feature gated v1alpha2 fields. +- **New configurable list of eviction message prefixes (introduced later when needed)**: each third party validating admission webhook can provide different response message. It is not feasible to ask every party to change their messages for a temporary solution until the evacuation API is in place. Thus, temporary introducing a new feature gated `evictionMessagePrefixes` field (`[]string`) for a list of known prefixes signaling an eviction in background. Normally, "Eviction triggered" prefix will be acknowledged by default. + +#### Code changes + +The code responsible for invoking the eviction API is located under `sigs.k8s.io/descheduler/pkg/descheduler/evictions` under private `evictPod` function. The function can be generalized into a generic interface: + +```go +type PodEvacuator interface { + EvacutePod(ctx context.Context, pod *v1.Pod) error +} +``` + +to provide the [descheduling framework](https://github.com/kubernetes-sigs/descheduler/pull/1372) a generic way of evicting/evacuating pods. + +#### Metrics + +A new metric `pod_evacuations` for counting the number of evacuated pods is introduced. The descheduler will not observe whether a pod was ultimately evicted. Only provide a summary about how many pods were requested to be evacuated. If needed, another metric `pod_evacuations_in_progress` for counting the number of evacutions that are reconciled can be introduced. + + +### Open Questions [optional] + +This is where to call out areas of the design that require closure before deciding +to implement the design. + +* The evacution API proposal mentions the evacuation instigator should remove its intent when the evacuation is no longer necessary. Should each descheduling cycle reset all the eviction requests or wait until some of the eviction requests were completed to free a "bucket" for new evictions? + **Answer**: The first implementation will not reset any evacuation request. The descheduler will account for existing requests and update the internal counters accordingally. In the future a mechanism for deleting too old evacuation requests can be introduced. I.e. based on a new `--max-evacuation-request-ttl` option. +* The evacution API proposal mentions more than a one entity can request an eviction. Should the descheduler take these into account as well? What if another entity decides to evict a pod that is of a low priority from the descheduler's point of view? + **Answer**: The first implementation will consider all pods with an existing evacuation request as "already getting evacuated". Independent of the pod priorities. +* With evacution requests a plugin does not evict a pod right away. Meaning, other plugins will need to check whether a pod has a corresponding evacuation request present. If so, take each such pod into account when constructing a list of nominated pods. For that it might be easier to see the current evictions as marking pods for future evacuation and perform the actual eviction/evacution at the end of each descheduling cycle instead of evicting pods from within plugins. + **Answer**: This is a breaking change for existing plugins. The first implementation will filter out pods that has already an existing evacuation request present. Keeping the backward compability with the current implementation. Later, a new proposal can be created to discuss the possibility of evicting pods at the end of each descheduling cycle. + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +#### Alpha + +- Feature implemented behind a feature flag +- Initial e2e tests completed and enabled +- Annotation based evictions + +#### Beta + +- Gather feedback from developers and surveys +- Evacuation API based evictions + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + + +Eviction of pods can be delayed. Users need to make sure they have a reliable external components that can evict a pod based on the corresponding evacution request. + +For the annotation based eviction only pods annotated with `descheduler.alpha.kubernetes.io/request-evict-only` are effected. The eviction itself is still performed through the eviction API. Only the error (code) is interpreted differently. Pods with both annotations present are not evicted more than once. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +Yes. + +###### What happens if we reenable the feature if it was previously rolled back? + +The same as when the feature is enabled the first time. + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +When the evacuation API is used new evacuation create/update/list/watch requests are sent. + +###### Will enabling / using this feature result in introducing new API types? + + + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +Very likely. When the evacuation API is used the eviction itself very depends on existing external controllers facilitating the eviction. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +No. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/proposals/evictions-in-background/kep.yaml b/proposals/evictions-in-background/kep.yaml new file mode 100644 index 0000000000..ee772a22e7 --- /dev/null +++ b/proposals/evictions-in-background/kep.yaml @@ -0,0 +1,16 @@ +title: Extend descheduler eviction handling for eviction requests +kep-number: NNNN +authors: + - "@ingvagabund" +owning-sig: sig-scheduling +participating-sigs: + - TBD +status: provisional +creation-date: 2024-02-26 +reviewers: + - TBD +approvers: + - TBD +feature-gates: + - TBD +stage: alpha