Update on changes for Beta

kubernetes · Nov 7, 2022 · 5acc235 · 5acc235
1 parent 4ded6c4
commit 5acc235
Showing 1 changed file with 44 additions and 54 deletions.
diff --git a/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md b/keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md
@@ -285,8 +285,9 @@ know that this has succeeded?
 - Handling of Pod configuration errors resulting in pods stuck in the `Pending`
   state (value of `status.phase`) rather than `Failed` (such as incorrect image
   name, non-matching configMap references, incorrect PVC references).
-- Adding of `ResourceExhausted` condition for Windows containers due to
-  excessive memory consumption.
+- Adding pod conditions to indicate failures due to admission failures or
+  exceeding of the active deadline timeout. Also, adding pod conditions to indicate failures tue to exhausting the resource
+  (memory or ephemeral storage) (see: [Termination initiated by Kubelet](#termination-initiated-by-kubelet))
 - Adding of the disruption condition for graceful node shutdown on Windows, as
   there is some ground work required first to support Pod eviction due to
   graceful node shutdown on Windows.
@@ -422,9 +423,6 @@ spec:
   backoffLimit: 3
   podFailurePolicy:
     rules:
-    - action: FailJob
-      onPodConditions:
-      - type: ResourceExhausted
     - action: Ignore
       onPodConditions:
       - type: DisruptionTarget
@@ -464,9 +462,6 @@ spec:
   backoffLimit: 3
   podFailurePolicy:
     rules:
-    - action: FailJob
-      onPodConditions:
-      - type: ResourceExhausted
     - action: Count
       onPodConditions:
       - type: DisruptionTarget
@@ -516,9 +511,7 @@ it effectively means that the use of job's pod failure policy requires
 This is in order to avoid the problematic race-conditions between Kubelet and
 Job controller. For example, Kubelet could restart a failed container before the
 Job controller decides to terminate the corresponding job due to a rule using
-`onExitCodes`. Also, Kubelet could restart a failed container due to exceeded
-resource limits before the Job controller decides to terminate the job due
-to a rule using `onPodConditions` to match the `ResourceExhausted` condition.
+`onExitCodes`.
 
 #### Current state review
 
@@ -763,10 +756,8 @@ by the `crictl stop` command
 In Alpha, there is no support for Pod conditions for failures or disruptions initiated by kubelet.
 
 For Beta we introduce handling of Pod failures initiated by Kubelet by adding
-dedicated Pod conditions. In particular, we add the condition in case a pod is
-evicted due to exceeded memory (only Linux) or ephemeral-storage limits.
-Additionally, we add the disruption condition (introduced in Alpha) in case of
-disruptions initiated by kubetlet (see [Design details](#design-details)).
+the pod disruption condition (introduced in Alpha) in case of disruptions
+initiated by kubetlet (see [Design details](#design-details)).
 
 Kubelet can also evict a pod due to exceeded active deadline timeout (configured
 by pod's `.spec.activeDeadlineSeconds` field). On one hand, exceeding the timeout,
@@ -780,6 +771,20 @@ in terms of retriability and evolving Pod condition types
 to do not add any pod condition in this case. It should be re-considered in the
 future if there is a good motivating use-case.
 
+Also, Kubelet can evict a pod due to admission failures. In some scenarios a pod
+admission failure could be restarted and is likely to succeed on another node,
+like in case of a pod scheduled to a node with resource pressure, see:
+[Pod admission error](#pod-admission-error). However, in some situations it is
+less clear as the failure can be caused by incompatible pod and node
+configurations. As node configurations are often the same within a cluster
+it is likely that the pod would fail if restarted on any node in the cluster.
+Adding `DisruptionTarget` condition could cause a never-ending loop of retries
+if the pod failure policy was configured to ignore such failures. Thus, similarly
+as in case of pod failures due to exceeding the active deadline timeout we decide
+not to add any pod condition for such failures. If there is a suffitient motivating
+use-case a dedicated pod condition might be introduce to annotate pod failures
+due to some of the admission failure scenarios.
+
 #### JobSpec API alternatives
 
 Alternative versions of the JobSpec API to define requirements on exit codes and
@@ -861,11 +866,9 @@ defined in the k8s.io/apis/core/v1 package. Thus, every addition of a new
 condition type will require an API review. The constants will allow users of the
 package to reduce the risk of typing mistakes.
 
-Additionally, we introduce a pair of generic condition types:
-- `DisruptionTarget`: indicates pod disruption (due to e.g. preemption,
-API-initiated eviction or taint-based eviction).
-- `ResourceExhausted`: indicates the pod is evicted due to exceeding
-ephemeral-storage limits or due to being terminated by OOM killer.
+Additionally, we introduce a pair of generic condition type `DisruptionTarget`:
+indicates pod disruption (due to e.g. preemption, API-initiated eviction or
+taint-based eviction).
 
 A more detailed information about the condition (containing the kubernetes
 component name which initiated the disruption or the resource name for the
@@ -983,7 +986,7 @@ condition makes it easier to determine if a failed pod should be restarted):
 - DeletionByTaintManager (Pod evicted by kube-controller-manager due to taints)
 - EvictionByEvictionAPI (Pod deleted by Eviction API)
 - DeletionByPodGC (an orphaned Pod deleted by pod GC)
-- DeletionByKubelet (Pod deleted due to graceful node shutdown, node pressure or Pod admission errors).
+- DeletionByKubelet (Pod deleted due to graceful node shutdown, node pressure).
 
 We introduce a Pod condition type, called `ResourceExhausted`, used by Kubelet
 to indicate that a Pod failure is caused by exhausting pod's (or container's)
@@ -995,10 +998,13 @@ the reason to one of the values:
 The already existing `status.conditions` field in Pod will be used by kubernetes
 components to append a dedicated condition.
 
-The API call to append the condition will be issued as a pod status update call
+When the failure is initiated by a component which deletes the pod, then the API
+call to append the condition will be issued as a pod status update call
 before the Pod delete request (not necessarily as the last update request before
-the actual delete). This way the Job controller will be able to see the
-condition, and match it against the pod failure policy, when handling a failed pod.
+the actual delete). For Kubelet, which does not delete the pod itself, the pod
+condition is added in the same API request as to change the phase to failed.
+This way the Job controller will be able to see the condition, and match it
+against the pod failure policy, when handling a failed pod.
 
 During the implementation process we are going to review the places where the
 pod delete requests are issued to modify the code to also append a meaningful
@@ -1275,21 +1281,11 @@ spec:
         containerName: main-job-container
         operator: In
         values: [1,2,3]
-    - action: FailJob
-      onPodConditions:
-      - type: ResourceExhausted
     - action: Ignore
       onPodConditions:
       - type: DisruptionTarget
 ```
 
-
-Note that, it may happen that both the `DisruptionTarget` condition and
-`ResourceExhausted` conditions are both added to the failed Pod status. This
-may happen when different components (for example `kube-apiserver` and `kubelet`)
-update the Pod status. The priority of handling of the conditions can be
-expressed by the order of rules configured in the `podFailurePolicy` field.
-
 ### Evaluation
 
 We use the `syncJob` function of the Job controller to evaluate the specified
@@ -1362,12 +1358,8 @@ the following scenarios will be covered with unit tests:
   - a failed container with non-zero exit code,
   - a dedicated Pod condition indicating termmination originated by a kubernetes component
 - adding of the `DisruptionTarget` by Kubelet in case of:
-  - admission errors
   - eviciton due to graceful node shutdown
   - eviction due to node pressure
-- adding of the `ResourceExhausted` by Kubelet:
-  - in response to OOMKilled reason
-  - due to exceeding ephemeral-storage limits
 <!--
 Additionally, for Alpha try to enumerate the core package you will be touching
 to implement this enhancement and provide the current unit coverage for those
@@ -1387,7 +1379,6 @@ The core packages (with their unit test coverage) which are going to be modified
 The kubelet packages (with their unit test coverage) which are going to be modified during implementation:
 - `k8s.io/kubernetes/pkg/kubelet/nodeshutdown`: `13 Sep 2022` - `74.9%`  <!--(handling of nodeshutdown)-->
 - `k8s.io/kubernetes/pkg/kubelet/eviction`: `13 Sep 2022` - `67.7%`  <!--(handling of node-pressure eviction)-->
-- `k8s.io/kubernetes/pkg/kubelet`: `13 Sep 2022` - `65.1%`  <!--(handling of admission errors and exceeding resource limits)-->
 
 ##### Integration tests
 
@@ -1476,15 +1467,12 @@ Below are some examples to consider, in addition to the aforementioned [maturity
 - Address reviews and bug reports from Alpha users
 - E2e tests are in Testgrid and linked in KEP
 - implementation of extending the existing Job controller's metrics:
-  `job_finished_total` by the `reason` field; and `job_pods_finished_total`
-  by the `failure_policy_action` field (see also
-  [here](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads))
-- implementation of adding pod failure conditions (`DisruptionTarget` or
-  `ResourceExhausted` depending on the scenario) by Kubelet when terminating a
-  Pod (see: [Termination initiated by Kubelet](#termination-initiated-by-kubelet))
+  `job_finished_total` by the `reason` field; and introduction of the
+  `pod_failures_handled_by_failure_policy_total` metric with the `action` label
+  (see also [here](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads))
+- implementation of adding pod disruption conditions (`DisruptionTarget`)
+  by Kubelet when terminating a Pod (see: [Termination initiated by Kubelet](#termination-initiated-by-kubelet))
 - Commonize the code for appending pod conditions between components
-- Do not update the pod disruption condition (with type=`DisruptionTarget`) if
-  it is already present with `status=True`
 - Refactor adding of pod conditions with the use of
   [SSA](https://kubernetes.io/docs/reference/using-api/server-side-apply/) client.
 - The feature flag enabled by default
@@ -1784,12 +1772,11 @@ It can be used to determine what is the relative frequency of job terminations
 due to different reasons. For example, if jobs are terminated often due to
 `BackoffLimitExceeded` it may suggest that the pod failure policy should be extended
 with new rules to terminate jobs early more often
-  - `job_pods_finished_total` (existing, extended by a label): the new
-`failure_policy_action` label tracks the number of failed pods that are handled
-by a specific failure policy action. Possible values are: `JobFailed`, `Ignored`
-and `Counted`. If a pod failure does not match the pod failure policy then
-the value for the label is left empty. This metric can be used to assess the
-coverage of pod failure scenarios with `spec.podFailurePolicy` rules.
+  - `pod_failures_handled_by_failure_policy_total` (new): the `action` label
+tracks the number of failed pods that are handled by a specific failure policy
+action. Possible values are: `FailJob`, `Ignore`
+and `Count`. This metric can be used to assess the coverage of pod failure
+scenarios with `spec.podFailurePolicy` rules.
 
 <!--
 Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
@@ -1809,7 +1796,7 @@ Recall that end users cannot usually observe component logs or access metrics.
 -->
 
 - [x] Pod .status
-  - Condition type: `DisruptionTarget` or `ResourceExhausted` when a Pod is terminated due to a reason listed in [design details](#design-details).
+  - Condition type: `DisruptionTarget` when a Pod is terminated due to a reason listed in [design details](#design-details).
 - [x] Job .status
   - Condition reason: `PodFailurePolicy` for the job `Failed` condition if the job was terminated due to the matching `FailJob` rule.
 
@@ -2078,6 +2065,9 @@ technics apply):
   - added [Story 3](#story-3) to demonstrate how to use the API to ensure there are no infinite Pod retries
   - updated [Graduation Criteria for Beta](#beta)
   - updated of kep.yaml and [PRR questionnaire](#production-readiness-review-questionnaire) to prepare the KEP for Beta
+- 2022-10-27: PR "Use SSA to add pod failure conditions" ([link](https://github.com/kubernetes/kubernetes/pull/113304))
+- 2022-10-31: PR "Extend metrics with the new labels" ([link](https://github.com/kubernetes/kubernetes/pull/113324))
+- 2022-11-03: PR "Fix disruption controller permissions to allow patching pod's status" ([link](https://github.com/kubernetes/kubernetes/pull/113580))
 
 <!--
 Major milestones in the lifecycle of a KEP should be tracked in this section.