From 77d24ee26af64ceb842b964cebc8f7844aadc526 Mon Sep 17 00:00:00 2001 From: Michal Wozniak Date: Fri, 4 Nov 2022 16:52:50 +0100 Subject: [PATCH 1/2] Update on changes for Beta for decisions taken during the implementation phase --- .../README.md | 301 +++++++----------- 1 file changed, 108 insertions(+), 193 deletions(-) diff --git a/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md b/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md index 67b99278cbf..b38e816f043 100644 --- a/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md +++ b/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md @@ -103,6 +103,9 @@ tags, and then generate with `hack/update-toc.sh`. - [Disconnected node when taint-manager is disabled](#disconnected-node-when-taint-manager-is-disabled) - [Direct container kill](#direct-container-kill) - [Termination initiated by Kubelet](#termination-initiated-by-kubelet) + - [Active deadline timeout exceeded](#active-deadline-timeout-exceeded) + - [Admission failures](#admission-failures) + - [Resource limits exceeded](#resource-limits-exceeded) - [JobSpec API alternatives](#jobspec-api-alternatives) - [Failing delete after a condition is added](#failing-delete-after-a-condition-is-added) - [Marking pods as Failed](#marking-pods-as-failed) @@ -110,11 +113,8 @@ tags, and then generate with `hack/update-toc.sh`. - [Garbage collected pods](#garbage-collected-pods) - [Evolving condition types](#evolving-condition-types) - [Stale DisruptionTarget condition which is not cleaned up](#stale-disruptiontarget-condition-which-is-not-cleaned-up) - - [OOM killer invoked when memory limits are not exceeded](#oom-killer-invoked-when-memory-limits-are-not-exceeded) - - [Broken compatibility of communicating OOM kills between container runtime and kubelet](#broken-compatibility-of-communicating-oom-kills-between-container-runtime-and-kubelet) - [Design Details](#design-details) - [New PodConditions](#new-podconditions) - - [Detection and handling of exceeded limits](#detection-and-handling-of-exceeded-limits) - [Interim FailureTarget Job condition](#interim-failuretarget-job-condition) - [JobSpec API](#jobspec-api) - [Evaluation](#evaluation) @@ -285,8 +285,13 @@ know that this has succeeded? - Handling of Pod configuration errors resulting in pods stuck in the `Pending` state (value of `status.phase`) rather than `Failed` (such as incorrect image name, non-matching configMap references, incorrect PVC references). -- Adding of `ResourceExhausted` condition for Windows containers due to - excessive memory consumption. +- Adding pod conditions to indicate admission failures + (see: [active deadline timeout exceeded](#active-deadline-timeout-exceeded)) + or exceeding of the active deadline timeout + (see: [admission failures](#admission-failures). + Also, adding pod conditions to + indicate failures tue to exhausting the resource (memory or ephemeral storage) + (see: [Resource limits exceeded](#resource-limits-exceeded)). - Adding of the disruption condition for graceful node shutdown on Windows, as there is some ground work required first to support Pod eviction due to graceful node shutdown on Windows. @@ -422,9 +427,6 @@ spec: backoffLimit: 3 podFailurePolicy: rules: - - action: FailJob - onPodConditions: - - type: ResourceExhausted - action: Ignore onPodConditions: - type: DisruptionTarget @@ -464,9 +466,6 @@ spec: backoffLimit: 3 podFailurePolicy: rules: - - action: FailJob - onPodConditions: - - type: ResourceExhausted - action: Count onPodConditions: - type: DisruptionTarget @@ -516,9 +515,7 @@ it effectively means that the use of job's pod failure policy requires This is in order to avoid the problematic race-conditions between Kubelet and Job controller. For example, Kubelet could restart a failed container before the Job controller decides to terminate the corresponding job due to a rule using -`onExitCodes`. Also, Kubelet could restart a failed container due to exceeded -resource limits before the Job controller decides to terminate the job due -to a rule using `onPodConditions` to match the `ResourceExhausted` condition. +`onExitCodes`. #### Current state review @@ -763,10 +760,16 @@ by the `crictl stop` command In Alpha, there is no support for Pod conditions for failures or disruptions initiated by kubelet. For Beta we introduce handling of Pod failures initiated by Kubelet by adding -dedicated Pod conditions. In particular, we add the condition in case a pod is -evicted due to exceeded memory (only Linux) or ephemeral-storage limits. -Additionally, we add the disruption condition (introduced in Alpha) in case of -disruptions initiated by kubetlet (see [Design details](#design-details)). +the pod disruption condition (introduced in Alpha) in case of disruptions +initiated by kubetlet (see [Design details](#design-details)). + +Kubelet can also evict a pod in some scenarios which are not covered with +adding a pod failure condition: +- [active deadline timeout exceeded](#active-deadline-timeout-exceeded) +- [admission failures](#admission-failures) +- [resource limits exceeded](#resource-limits-exceeded) + +##### Active deadline timeout exceeded Kubelet can also evict a pod due to exceeded active deadline timeout (configured by pod's `.spec.activeDeadlineSeconds` field). On one hand, exceeding the timeout, @@ -780,6 +783,67 @@ in terms of retriability and evolving Pod condition types to do not add any pod condition in this case. It should be re-considered in the future if there is a good motivating use-case. +##### Admission failures + +In some scenarios a pod admission failure could be restarted and is likely to +succeed on another node (for example a pod scheduled to a node with resource +pressure, see: [Pod admission error](#pod-admission-error)). However, in some +situations isn't clear as the failure can be caused by incompatible pod and node +configurations. As node configurations are often the same within a cluster it is +likely that the pod would fail if restarted on any node in the cluster. In that +case, adding `DisruptionTarget` condition could cause a never-ending loop of +retries if the pod failure policy was configured to ignore such failures. Thus, +we decide not to add any pod condition for such failures. If there is a +sufficient motivating use-case a dedicated pod condition might be introduced to +annotate some of the admission failure scenarios. + +##### Resource limits exceeded + +A Pod failure-initiated by Kubelet can be caused by exceeding pod's +(or container's) resource (memory or ephemeral-storage) limits. We have considered +(and prepared an initial implementation, see PR +[Add ResourceExhausted pod condition for oom killer and exceeding of local storage limits](https://github.com/kubernetes/kubernetes/pull/113436)) +introduction of a dedicated Pod failure condition `ResourceExhausted` to +annotate pod failures due to the above scenarios. + +However, it turned out, that there are complications with detection of exceeding +of memory limits: +- the approach we considered is to rely on the Out-Of-Memory (OOM) killer. +In particular, we could detect that a pod was terminated due to OOM killer based +on the container's `reason` field being equal to `OOMKilled`. This value is set +on Linux by the leading container runtime implementations: containerd (see +[here](https://github.com/containerd/containerd/blob/23f66ece59654ea431700576b6020baffe1a4e49/pkg/cri/server/events.go#L344) +for event handling and +[here](https://github.com/containerd/containerd/blob/36d0cfd0fddb3f2ca4301533a7e7dcf6853dc92c/pkg/cri/server/helpers.go#L62) +for the constant definition) +and CRI-O (see +[here](https://github.com/cri-o/cri-o/blob/edf889bd277ae9a8aa699c354f12baaef3d9b71d/server/container_status.go#L88-L89)). +- setting of the `reason` field to equal `OOMKilled` is not standardized. We have +started an effort to standardize the handling of handling OOM killed containers +(see: [Documentation for the CRI API reason field to standardize the field for containers terminated by OOM killer](https://github.com/kubernetes/kubernetes/pull/112977)). +However, in the process it turned out that on some configurations +(for example the CRI-O with cgroupv2, see: +[Add e2e_node test for oom killed container reason](https://github.com/kubernetes/kubernetes/pull/113205)), +the container's `reason` field is not set to `OOMKilled`. +- OOM killer might get invoked not only when container's limits are exceeded, +but also when the system is running low on memory. In such a scenario there +can be race conditions in which both the `DisruptionTarget` condition and the +`ResourceExhausted` could be added. + +Thus, we decide not to annotate the scenarios with the `ResourceExhausted` +condition. While there are not known issues with detection of the exceeding of +Pod's ephemeral storage limits we prefer to avoid future extension of the +semantics of the new condition type. Alternatively, we could introduce a pair of +dedicated pod condition types: `OOMKilled` and `EphemeralStorageLimitExceeded`. +This approach, however, could create an unnecessary proliferation of the pod +condition types. + +Finally, we would like to first hear user feedback on the preferred approach +and also on how important it is to cover the resource limits exceeded scenarios. +We believe to be able to collect this feedback after users start to use the +feature - using job failure policies based on the `DisruptionTarget` condition +and container exit codes. + #### JobSpec API alternatives Alternative versions of the JobSpec API to define requirements on exit codes and @@ -861,16 +925,13 @@ defined in the k8s.io/apis/core/v1 package. Thus, every addition of a new condition type will require an API review. The constants will allow users of the package to reduce the risk of typing mistakes. -Additionally, we introduce a pair of generic condition types: -- `DisruptionTarget`: indicates pod disruption (due to e.g. preemption, -API-initiated eviction or taint-based eviction). -- `ResourceExhausted`: indicates the pod is evicted due to exceeding -ephemeral-storage limits or due to being terminated by OOM killer. +Additionally, we introduce a generic condition type `DisruptionTarget` which +indicates pod disruption (due to e.g. preemption, API-initiated eviction or +taint-based eviction). A more detailed information about the condition (containing the kubernetes -component name which initiated the disruption or the resource name for the -exceeded resource limit) is conveyed by the `reason` and `message` fields of -the added pod condition. +component name which initiated the disruption) is conveyed by the `reason` and +`message` fields of the added pod condition. Finally, we are going to cover the handling of pod failures associated with the new pod condition types in integration tests. @@ -896,63 +957,6 @@ Pod retry. Given the factors above we assess this is an acceptable risk. -#### OOM killer invoked when memory limits are not exceeded - -Currently, it is not possible to recognize if a pod's container terminated by -OOM killer exceeded its configured limits or was terminated due the node running -very low on memory (the container runtime would annotate the terminated container -with the `OOMKiller` reason in both cases). As a consequence, the -`ResourceExhausted` condition might be added to a terminated pod even if the -pod's containers didn't exceed their configured limits. This in turn, may lead -to invoking the `FailJob` action if a user configures the job's pod failure -policy to interpret the presence of `ResourceExhausted` as a non-retriable failure. - -In order to mitigate this risk we are going to describe the scenario and the -risk clearly in the documentation. Further, in order to minimize the risk we -recommend: -- configuring `requests.memory`=`limits.memory` for all pod containers -- leaving enough available memory and enough memory reserved for the system, see: -[best practices for node-pressure eviction configuration](https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-pressure-eviction-good-practices). - -Additionally, we are going to introduce a validation step to check if the job -configuration contains a container for which `requests.memory`!=`limits.memory` -and there is a pod failure policy rule on the `ResourceExhausted` pod condition -type with action `FailJob`. - -In the future, if there is a motivating user feedback, we might consider -introducing, aside of the `ResourceExhausted` condition, a pair of specific -conditions (`OOMKiller` and `EphemeralStorageLimitExceeded`) added to a pod to -indicate if it is being terminated by OOM killer or by Kubelet due to exceeding -its ephemeral-storage limits. - -#### Broken compatibility of communicating OOM kills between container runtime and kubelet - -For Beta, the implementation of detecting if a container was OOM killed will -rely on the `reason` field set to `OOMKilled`. This is currently done by the -leading container runtime implementations (containerd and CRI-O), but is not -standardized and thus could break in the future -(see [here](#detection-and-handling-of-exceeded-limits)). - -If the compatibility is broken, then OOM kill events will not be detected by the -new Kubelet code and the `ResourceExhausted` Pod condition will not be added. -As a consequence a Pod might be unnecessarily restarted in a scenario when -a user configures the job's pod failure policy to interpret the presence of -`ResourceExhausted` as a non-retriable failure. - -First, we expect the change of the behaviour of the implementations unlikely -because the behaviour was introduced a long time ago (about 5 years ago for both -containerd and CRI-O) and probably many systems depend on the behaviour already. - -Second, in order to mitigate this risk for Beta we are going to describe the -risk in the user-facing documentation. - -Finally, we are going to discuss the standardization of the CRI API for -communication between container runtime and kubelet for the OOM kill events. -A kick-off the discussion is posted as a message to the CNFC Tag Runtime mailing -list (see: [Standardization of the OOM kill communication between container runtime and kubelet](https://lists.cncf.io/g/cncf-tag-runtime/topic/standardization_of_the_oom/94093173?p=,,,20,0,0,0::recentpostdate/sticky,,,20,2,0,94093173,previd%3D1664810761326460953,nextid%3D1637210209940337491&previd=1664810761326460953&nextid=1637210209940337491)). -The discussion and review of the implementation is added to graduation criteria -for [GA](#ga). - - `k8s.io/kubernetes/pkg/kubelet/eviction`: `13 Sep 2022` - `67.7%` -- `k8s.io/kubernetes/pkg/kubelet`: `13 Sep 2022` - `65.1%` ##### Integration tests @@ -1476,15 +1393,11 @@ Below are some examples to consider, in addition to the aforementioned [maturity - Address reviews and bug reports from Alpha users - E2e tests are in Testgrid and linked in KEP - implementation of extending the existing Job controller's metrics: - `job_finished_total` by the `reason` field; and `job_pods_finished_total` - by the `failure_policy_action` field (see also - [here](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads)) -- implementation of adding pod failure conditions (`DisruptionTarget` or - `ResourceExhausted` depending on the scenario) by Kubelet when terminating a - Pod (see: [Termination initiated by Kubelet](#termination-initiated-by-kubelet)) -- Commonize the code for appending pod conditions between components -- Do not update the pod disruption condition (with type=`DisruptionTarget`) if - it is already present with `status=True` + `job_finished_total` by the `reason` field; and introduction of the + `pod_failures_handled_by_failure_policy_total` metric with the `action` label + (see also [here](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads)) +- implementation of adding pod disruption conditions (`DisruptionTarget`) + by Kubelet when terminating a Pod (see: [Termination initiated by Kubelet](#termination-initiated-by-kubelet)) - Refactor adding of pod conditions with the use of [SSA](https://kubernetes.io/docs/reference/using-api/server-side-apply/) client. - The feature flag enabled by default @@ -1784,12 +1697,11 @@ It can be used to determine what is the relative frequency of job terminations due to different reasons. For example, if jobs are terminated often due to `BackoffLimitExceeded` it may suggest that the pod failure policy should be extended with new rules to terminate jobs early more often - - `job_pods_finished_total` (existing, extended by a label): the new -`failure_policy_action` label tracks the number of failed pods that are handled -by a specific failure policy action. Possible values are: `JobFailed`, `Ignored` -and `Counted`. If a pod failure does not match the pod failure policy then -the value for the label is left empty. This metric can be used to assess the -coverage of pod failure scenarios with `spec.podFailurePolicy` rules. + - `pod_failures_handled_by_failure_policy_total` (new): the `action` label +tracks the number of failed pods that are handled by a specific failure policy +action. Possible values are: `FailJob`, `Ignore` +and `Count`. This metric can be used to assess the coverage of pod failure +scenarios with `spec.podFailurePolicy` rules. - [x] Pod .status - - Condition type: `DisruptionTarget` or `ResourceExhausted` when a Pod is terminated due to a reason listed in [design details](#design-details). + - Condition type: `DisruptionTarget` when a Pod is terminated due to a reason listed in [design details](#design-details). - [x] Job .status - Condition reason: `PodFailurePolicy` for the job `Failed` condition if the job was terminated due to the matching `FailJob` rule. @@ -2078,6 +1990,9 @@ technics apply): - added [Story 3](#story-3) to demonstrate how to use the API to ensure there are no infinite Pod retries - updated [Graduation Criteria for Beta](#beta) - updated of kep.yaml and [PRR questionnaire](#production-readiness-review-questionnaire) to prepare the KEP for Beta +- 2022-10-27: PR "Use SSA to add pod failure conditions" ([link](https://github.com/kubernetes/kubernetes/pull/113304)) +- 2022-10-31: PR "Extend metrics with the new labels" ([link](https://github.com/kubernetes/kubernetes/pull/113324)) +- 2022-11-03: PR "Fix disruption controller permissions to allow patching pod's status" ([link](https://github.com/kubernetes/kubernetes/pull/113580)) @@ -785,21 +786,20 @@ future if there is a good motivating use-case. ##### Admission failures -In some scenarios a pod admission failure could be restarted and is likely to -succeed on another node (for example a pod scheduled to a node with resource -pressure, see: [Pod admission error](#pod-admission-error)). However, in some -situations isn't clear as the failure can be caused by incompatible pod and node -configurations. As node configurations are often the same within a cluster it is -likely that the pod would fail if restarted on any node in the cluster. In that -case, adding `DisruptionTarget` condition could cause a never-ending loop of -retries if the pod failure policy was configured to ignore such failures. Thus, -we decide not to add any pod condition for such failures. If there is a -sufficient motivating use-case a dedicated pod condition might be introduced to -annotate some of the admission failure scenarios. +In some scenarios a pod admission failure could result in a successful pod restart on another +node (for example a pod scheduled to a node with resource pressure, see: +[Pod admission error](#pod-admission-error)). However, in other situations it won't be as clear, +since the failure can be caused by incompatible pod and node configurations. Node configurations +are often the same within a cluster, so it is likely that the pod would fail if restarted on any +other node in the cluster. In that case, adding `DisruptionTarget` condition could cause +a never-ending loop of retries, if the pod failure policy was configured to ignore such failures. +Given the above, we decide not to add any pod condition for such failures. If there is a sufficient +motivating use-case, a dedicated pod condition might be introduced to annotate some of the +admission failure scenarios. ##### Resource limits exceeded -A Pod failure-initiated by Kubelet can be caused by exceeding pod's +A Pod failure initiated by Kubelet can be caused by exceeding pod's (or container's) resource (memory or ephemeral-storage) limits. We have considered (and prepared an initial implementation, see PR [Add ResourceExhausted pod condition for oom killer and exceeding of local storage limits](https://github.com/kubernetes/kubernetes/pull/113436)) @@ -807,7 +807,7 @@ introduction of a dedicated Pod failure condition `ResourceExhausted` to annotate pod failures due to the above scenarios. However, it turned out, that there are complications with detection of exceeding -of memory limits: +memory limits: - the approach we considered is to rely on the Out-Of-Memory (OOM) killer. In particular, we could detect that a pod was terminated due to OOM killer based on the container's `reason` field being equal to `OOMKilled`. This value is set @@ -818,21 +818,21 @@ for event handling and for the constant definition) and CRI-O (see [here](https://github.com/cri-o/cri-o/blob/edf889bd277ae9a8aa699c354f12baaef3d9b71d/server/container_status.go#L88-L89)). -- setting of the `reason` field to equal `OOMKilled` is not standardized. We have -started an effort to standardize the handling of handling OOM killed containers +- setting the `reason` field to `OOMKilled` is not standardized, either. We have +started an effort to standardize the handling of OOM killed containers (see: [Documentation for the CRI API reason field to standardize the field for containers terminated by OOM killer](https://github.com/kubernetes/kubernetes/pull/112977)). -However, in the process it turned out that on some configurations +However, in the process it turned out that in some configurations (for example the CRI-O with cgroupv2, see: [Add e2e_node test for oom killed container reason](https://github.com/kubernetes/kubernetes/pull/113205)), the container's `reason` field is not set to `OOMKilled`. - OOM killer might get invoked not only when container's limits are exceeded, -but also when the system is running low on memory. In such a scenario there +but also when the system is running low on memory. In such scenario there can be race conditions in which both the `DisruptionTarget` condition and the `ResourceExhausted` could be added. Thus, we decide not to annotate the scenarios with the `ResourceExhausted` condition. While there are not known issues with detection of the exceeding of -Pod's ephemeral storage limits we prefer to avoid future extension of the +Pod's ephemeral storage limits, we prefer to avoid future extension of the semantics of the new condition type. Alternatively, we could introduce a pair of dedicated pod condition types: `OOMKilled` and `EphemeralStorageLimitExceeded`. This approach, however, could create an unnecessary proliferation of the pod @@ -840,9 +840,6 @@ condition types. Finally, we would like to first hear user feedback on the preferred approach and also on how important it is to cover the resource limits exceeded scenarios. -We believe to be able to collect this feedback after users start to use the -feature - using job failure policies based on the `DisruptionTarget` condition -and container exit codes. #### JobSpec API alternatives @@ -996,7 +993,7 @@ When the failure is initiated by a component which deletes the pod, then the API call to append the condition will be issued as a pod status update call before the Pod delete request (not necessarily as the last update request before the actual delete). For Kubelet, which does not delete the pod itself, the pod -condition is added in the same API request as to change the phase to failed. +condition is added in the same API request as the phase change to failed. This way the Job controller will be able to see the condition, and match it against the pod failure policy, when handling a failed pod. @@ -2067,6 +2064,19 @@ first iteration of the feature, we intend to provide a user-friendly API targeting the known use-cases. A more flexible API can be considered as a future improvement. +### Possible future extensions + +As one possible direction of extending the feature is adding pod failure +conditions in the following scenarios (see links for discussions on the factors +that made us not to cover the scenarios in Beta): +- [active deadline timeout exceeded](#active-deadline-timeout-exceeded) +- [admission failures](#admission-failures) +- [resource limits exceeded](#resource-limits-exceeded). + +We are going to re-evaluate the decisions based on the user feedback after users +start to use the feature - using job failure policies based on the `DisruptionTarget` condition +and container exit codes. +