diff --git a/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md b/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md index 67b99278cbfb..013518376f63 100644 --- a/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md +++ b/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md @@ -285,8 +285,9 @@ know that this has succeeded? - Handling of Pod configuration errors resulting in pods stuck in the `Pending` state (value of `status.phase`) rather than `Failed` (such as incorrect image name, non-matching configMap references, incorrect PVC references). -- Adding of `ResourceExhausted` condition for Windows containers due to - excessive memory consumption. +- Adding pod conditions to indicate failures due to admission failures or + exceeding of the active deadline timeout. Also, adding pod conditions to indicate failures tue to exhausting the resource + (memory or ephemeral storage) (see: [Termination initiated by Kubelet](#termination-initiated-by-kubelet)) - Adding of the disruption condition for graceful node shutdown on Windows, as there is some ground work required first to support Pod eviction due to graceful node shutdown on Windows. @@ -422,9 +423,6 @@ spec: backoffLimit: 3 podFailurePolicy: rules: - - action: FailJob - onPodConditions: - - type: ResourceExhausted - action: Ignore onPodConditions: - type: DisruptionTarget @@ -464,9 +462,6 @@ spec: backoffLimit: 3 podFailurePolicy: rules: - - action: FailJob - onPodConditions: - - type: ResourceExhausted - action: Count onPodConditions: - type: DisruptionTarget @@ -516,9 +511,7 @@ it effectively means that the use of job's pod failure policy requires This is in order to avoid the problematic race-conditions between Kubelet and Job controller. For example, Kubelet could restart a failed container before the Job controller decides to terminate the corresponding job due to a rule using -`onExitCodes`. Also, Kubelet could restart a failed container due to exceeded -resource limits before the Job controller decides to terminate the job due -to a rule using `onPodConditions` to match the `ResourceExhausted` condition. +`onExitCodes`. #### Current state review @@ -763,10 +756,8 @@ by the `crictl stop` command In Alpha, there is no support for Pod conditions for failures or disruptions initiated by kubelet. For Beta we introduce handling of Pod failures initiated by Kubelet by adding -dedicated Pod conditions. In particular, we add the condition in case a pod is -evicted due to exceeded memory (only Linux) or ephemeral-storage limits. -Additionally, we add the disruption condition (introduced in Alpha) in case of -disruptions initiated by kubetlet (see [Design details](#design-details)). +the pod disruption condition (introduced in Alpha) in case of disruptions +initiated by kubetlet (see [Design details](#design-details)). Kubelet can also evict a pod due to exceeded active deadline timeout (configured by pod's `.spec.activeDeadlineSeconds` field). On one hand, exceeding the timeout, @@ -780,6 +771,20 @@ in terms of retriability and evolving Pod condition types to do not add any pod condition in this case. It should be re-considered in the future if there is a good motivating use-case. +Also, Kubelet can evict a pod due to admission failures. In some scenarios a pod +admission failure could be restarted and is likely to succeed on another node, +like in case of a pod scheduled to a node with resource pressure, see: +[Pod admission error](#pod-admission-error). However, in some situations it is +less clear as the failure can be caused by incompatible pod and node +configurations. As node configurations are often the same within a cluster +it is likely that the pod would fail if restarted on any node in the cluster. +Adding `DisruptionTarget` condition could cause a never-ending loop of retries +if the pod failure policy was configured to ignore such failures. Thus, similarly +as in case of pod failures due to exceeding the active deadline timeout we decide +not to add any pod condition for such failures. If there is a suffitient motivating +use-case a dedicated pod condition might be introduce to annotate pod failures +due to some of the admission failure scenarios. + #### JobSpec API alternatives Alternative versions of the JobSpec API to define requirements on exit codes and @@ -861,11 +866,9 @@ defined in the k8s.io/apis/core/v1 package. Thus, every addition of a new condition type will require an API review. The constants will allow users of the package to reduce the risk of typing mistakes. -Additionally, we introduce a pair of generic condition types: -- `DisruptionTarget`: indicates pod disruption (due to e.g. preemption, -API-initiated eviction or taint-based eviction). -- `ResourceExhausted`: indicates the pod is evicted due to exceeding -ephemeral-storage limits or due to being terminated by OOM killer. +Additionally, we introduce a pair of generic condition type `DisruptionTarget`: +indicates pod disruption (due to e.g. preemption, API-initiated eviction or +taint-based eviction). A more detailed information about the condition (containing the kubernetes component name which initiated the disruption or the resource name for the @@ -983,7 +986,7 @@ condition makes it easier to determine if a failed pod should be restarted): - DeletionByTaintManager (Pod evicted by kube-controller-manager due to taints) - EvictionByEvictionAPI (Pod deleted by Eviction API) - DeletionByPodGC (an orphaned Pod deleted by pod GC) -- DeletionByKubelet (Pod deleted due to graceful node shutdown, node pressure or Pod admission errors). +- DeletionByKubelet (Pod deleted due to graceful node shutdown, node pressure). We introduce a Pod condition type, called `ResourceExhausted`, used by Kubelet to indicate that a Pod failure is caused by exhausting pod's (or container's) @@ -995,10 +998,13 @@ the reason to one of the values: The already existing `status.conditions` field in Pod will be used by kubernetes components to append a dedicated condition. -The API call to append the condition will be issued as a pod status update call +When the failure is initiated by a component which deletes the pod, then the API +call to append the condition will be issued as a pod status update call before the Pod delete request (not necessarily as the last update request before -the actual delete). This way the Job controller will be able to see the -condition, and match it against the pod failure policy, when handling a failed pod. +the actual delete). For Kubelet, which does not delete the pod itself, the pod +condition is added in the same API request as to change the phase to failed. +This way the Job controller will be able to see the condition, and match it +against the pod failure policy, when handling a failed pod. During the implementation process we are going to review the places where the pod delete requests are issued to modify the code to also append a meaningful @@ -1275,21 +1281,11 @@ spec: containerName: main-job-container operator: In values: [1,2,3] - - action: FailJob - onPodConditions: - - type: ResourceExhausted - action: Ignore onPodConditions: - type: DisruptionTarget ``` - -Note that, it may happen that both the `DisruptionTarget` condition and -`ResourceExhausted` conditions are both added to the failed Pod status. This -may happen when different components (for example `kube-apiserver` and `kubelet`) -update the Pod status. The priority of handling of the conditions can be -expressed by the order of rules configured in the `podFailurePolicy` field. - ### Evaluation We use the `syncJob` function of the Job controller to evaluate the specified @@ -1362,12 +1358,8 @@ the following scenarios will be covered with unit tests: - a failed container with non-zero exit code, - a dedicated Pod condition indicating termmination originated by a kubernetes component - adding of the `DisruptionTarget` by Kubelet in case of: - - admission errors - eviciton due to graceful node shutdown - eviction due to node pressure -- adding of the `ResourceExhausted` by Kubelet: - - in response to OOMKilled reason - - due to exceeding ephemeral-storage limits - `k8s.io/kubernetes/pkg/kubelet/eviction`: `13 Sep 2022` - `67.7%` -- `k8s.io/kubernetes/pkg/kubelet`: `13 Sep 2022` - `65.1%` ##### Integration tests @@ -1476,15 +1467,12 @@ Below are some examples to consider, in addition to the aforementioned [maturity - Address reviews and bug reports from Alpha users - E2e tests are in Testgrid and linked in KEP - implementation of extending the existing Job controller's metrics: - `job_finished_total` by the `reason` field; and `job_pods_finished_total` - by the `failure_policy_action` field (see also - [here](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads)) -- implementation of adding pod failure conditions (`DisruptionTarget` or - `ResourceExhausted` depending on the scenario) by Kubelet when terminating a - Pod (see: [Termination initiated by Kubelet](#termination-initiated-by-kubelet)) + `job_finished_total` by the `reason` field; and introduction of the + `pod_failures_handled_by_failure_policy_total` metric with the `action` label + (see also [here](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads)) +- implementation of adding pod disruption conditions (`DisruptionTarget`) + by Kubelet when terminating a Pod (see: [Termination initiated by Kubelet](#termination-initiated-by-kubelet)) - Commonize the code for appending pod conditions between components -- Do not update the pod disruption condition (with type=`DisruptionTarget`) if - it is already present with `status=True` - Refactor adding of pod conditions with the use of [SSA](https://kubernetes.io/docs/reference/using-api/server-side-apply/) client. - The feature flag enabled by default @@ -1784,12 +1772,11 @@ It can be used to determine what is the relative frequency of job terminations due to different reasons. For example, if jobs are terminated often due to `BackoffLimitExceeded` it may suggest that the pod failure policy should be extended with new rules to terminate jobs early more often - - `job_pods_finished_total` (existing, extended by a label): the new -`failure_policy_action` label tracks the number of failed pods that are handled -by a specific failure policy action. Possible values are: `JobFailed`, `Ignored` -and `Counted`. If a pod failure does not match the pod failure policy then -the value for the label is left empty. This metric can be used to assess the -coverage of pod failure scenarios with `spec.podFailurePolicy` rules. + - `pod_failures_handled_by_failure_policy_total` (new): the `action` label +tracks the number of failed pods that are handled by a specific failure policy +action. Possible values are: `FailJob`, `Ignore` +and `Count`. This metric can be used to assess the coverage of pod failure +scenarios with `spec.podFailurePolicy` rules. - [x] Pod .status - - Condition type: `DisruptionTarget` or `ResourceExhausted` when a Pod is terminated due to a reason listed in [design details](#design-details). + - Condition type: `DisruptionTarget` when a Pod is terminated due to a reason listed in [design details](#design-details). - [x] Job .status - Condition reason: `PodFailurePolicy` for the job `Failed` condition if the job was terminated due to the matching `FailJob` rule. @@ -2078,6 +2065,9 @@ technics apply): - added [Story 3](#story-3) to demonstrate how to use the API to ensure there are no infinite Pod retries - updated [Graduation Criteria for Beta](#beta) - updated of kep.yaml and [PRR questionnaire](#production-readiness-review-questionnaire) to prepare the KEP for Beta +- 2022-10-27: PR "Use SSA to add pod failure conditions" ([link](https://github.com/kubernetes/kubernetes/pull/113304)) +- 2022-10-31: PR "Extend metrics with the new labels" ([link](https://github.com/kubernetes/kubernetes/pull/113324)) +- 2022-11-03: PR "Fix disruption controller permissions to allow patching pod's status" ([link](https://github.com/kubernetes/kubernetes/pull/113580))