Skip to content

Commit

Permalink
Update on changes for Beta
Browse files Browse the repository at this point in the history
  • Loading branch information
mimowo committed Nov 7, 2022
1 parent 4ded6c4 commit 5acc235
Showing 1 changed file with 44 additions and 54 deletions.
98 changes: 44 additions & 54 deletions keps/sig-apps/3329-retriable-and-non-retriable-failures/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,8 +285,9 @@ know that this has succeeded?
- Handling of Pod configuration errors resulting in pods stuck in the `Pending`
state (value of `status.phase`) rather than `Failed` (such as incorrect image
name, non-matching configMap references, incorrect PVC references).
- Adding of `ResourceExhausted` condition for Windows containers due to
excessive memory consumption.
- Adding pod conditions to indicate failures due to admission failures or
exceeding of the active deadline timeout. Also, adding pod conditions to indicate failures tue to exhausting the resource
(memory or ephemeral storage) (see: [Termination initiated by Kubelet](#termination-initiated-by-kubelet))
- Adding of the disruption condition for graceful node shutdown on Windows, as
there is some ground work required first to support Pod eviction due to
graceful node shutdown on Windows.
Expand Down Expand Up @@ -422,9 +423,6 @@ spec:
backoffLimit: 3
podFailurePolicy:
rules:
- action: FailJob
onPodConditions:
- type: ResourceExhausted
- action: Ignore
onPodConditions:
- type: DisruptionTarget
Expand Down Expand Up @@ -464,9 +462,6 @@ spec:
backoffLimit: 3
podFailurePolicy:
rules:
- action: FailJob
onPodConditions:
- type: ResourceExhausted
- action: Count
onPodConditions:
- type: DisruptionTarget
Expand Down Expand Up @@ -516,9 +511,7 @@ it effectively means that the use of job's pod failure policy requires
This is in order to avoid the problematic race-conditions between Kubelet and
Job controller. For example, Kubelet could restart a failed container before the
Job controller decides to terminate the corresponding job due to a rule using
`onExitCodes`. Also, Kubelet could restart a failed container due to exceeded
resource limits before the Job controller decides to terminate the job due
to a rule using `onPodConditions` to match the `ResourceExhausted` condition.
`onExitCodes`.

#### Current state review

Expand Down Expand Up @@ -763,10 +756,8 @@ by the `crictl stop` command
In Alpha, there is no support for Pod conditions for failures or disruptions initiated by kubelet.

For Beta we introduce handling of Pod failures initiated by Kubelet by adding
dedicated Pod conditions. In particular, we add the condition in case a pod is
evicted due to exceeded memory (only Linux) or ephemeral-storage limits.
Additionally, we add the disruption condition (introduced in Alpha) in case of
disruptions initiated by kubetlet (see [Design details](#design-details)).
the pod disruption condition (introduced in Alpha) in case of disruptions
initiated by kubetlet (see [Design details](#design-details)).

Kubelet can also evict a pod due to exceeded active deadline timeout (configured
by pod's `.spec.activeDeadlineSeconds` field). On one hand, exceeding the timeout,
Expand All @@ -780,6 +771,20 @@ in terms of retriability and evolving Pod condition types
to do not add any pod condition in this case. It should be re-considered in the
future if there is a good motivating use-case.

Also, Kubelet can evict a pod due to admission failures. In some scenarios a pod
admission failure could be restarted and is likely to succeed on another node,
like in case of a pod scheduled to a node with resource pressure, see:
[Pod admission error](#pod-admission-error). However, in some situations it is
less clear as the failure can be caused by incompatible pod and node
configurations. As node configurations are often the same within a cluster
it is likely that the pod would fail if restarted on any node in the cluster.
Adding `DisruptionTarget` condition could cause a never-ending loop of retries
if the pod failure policy was configured to ignore such failures. Thus, similarly
as in case of pod failures due to exceeding the active deadline timeout we decide
not to add any pod condition for such failures. If there is a suffitient motivating
use-case a dedicated pod condition might be introduce to annotate pod failures
due to some of the admission failure scenarios.

#### JobSpec API alternatives

Alternative versions of the JobSpec API to define requirements on exit codes and
Expand Down Expand Up @@ -861,11 +866,9 @@ defined in the k8s.io/apis/core/v1 package. Thus, every addition of a new
condition type will require an API review. The constants will allow users of the
package to reduce the risk of typing mistakes.

Additionally, we introduce a pair of generic condition types:
- `DisruptionTarget`: indicates pod disruption (due to e.g. preemption,
API-initiated eviction or taint-based eviction).
- `ResourceExhausted`: indicates the pod is evicted due to exceeding
ephemeral-storage limits or due to being terminated by OOM killer.
Additionally, we introduce a pair of generic condition type `DisruptionTarget`:
indicates pod disruption (due to e.g. preemption, API-initiated eviction or
taint-based eviction).

A more detailed information about the condition (containing the kubernetes
component name which initiated the disruption or the resource name for the
Expand Down Expand Up @@ -983,7 +986,7 @@ condition makes it easier to determine if a failed pod should be restarted):
- DeletionByTaintManager (Pod evicted by kube-controller-manager due to taints)
- EvictionByEvictionAPI (Pod deleted by Eviction API)
- DeletionByPodGC (an orphaned Pod deleted by pod GC)
- DeletionByKubelet (Pod deleted due to graceful node shutdown, node pressure or Pod admission errors).
- DeletionByKubelet (Pod deleted due to graceful node shutdown, node pressure).

We introduce a Pod condition type, called `ResourceExhausted`, used by Kubelet
to indicate that a Pod failure is caused by exhausting pod's (or container's)
Expand All @@ -995,10 +998,13 @@ the reason to one of the values:
The already existing `status.conditions` field in Pod will be used by kubernetes
components to append a dedicated condition.

The API call to append the condition will be issued as a pod status update call
When the failure is initiated by a component which deletes the pod, then the API
call to append the condition will be issued as a pod status update call
before the Pod delete request (not necessarily as the last update request before
the actual delete). This way the Job controller will be able to see the
condition, and match it against the pod failure policy, when handling a failed pod.
the actual delete). For Kubelet, which does not delete the pod itself, the pod
condition is added in the same API request as to change the phase to failed.
This way the Job controller will be able to see the condition, and match it
against the pod failure policy, when handling a failed pod.

During the implementation process we are going to review the places where the
pod delete requests are issued to modify the code to also append a meaningful
Expand Down Expand Up @@ -1275,21 +1281,11 @@ spec:
containerName: main-job-container
operator: In
values: [1,2,3]
- action: FailJob
onPodConditions:
- type: ResourceExhausted
- action: Ignore
onPodConditions:
- type: DisruptionTarget
```
Note that, it may happen that both the `DisruptionTarget` condition and
`ResourceExhausted` conditions are both added to the failed Pod status. This
may happen when different components (for example `kube-apiserver` and `kubelet`)
update the Pod status. The priority of handling of the conditions can be
expressed by the order of rules configured in the `podFailurePolicy` field.

### Evaluation
We use the `syncJob` function of the Job controller to evaluate the specified
Expand Down Expand Up @@ -1362,12 +1358,8 @@ the following scenarios will be covered with unit tests:
- a failed container with non-zero exit code,
- a dedicated Pod condition indicating termmination originated by a kubernetes component
- adding of the `DisruptionTarget` by Kubelet in case of:
- admission errors
- eviciton due to graceful node shutdown
- eviction due to node pressure
- adding of the `ResourceExhausted` by Kubelet:
- in response to OOMKilled reason
- due to exceeding ephemeral-storage limits
<!--
Additionally, for Alpha try to enumerate the core package you will be touching
to implement this enhancement and provide the current unit coverage for those
Expand All @@ -1387,7 +1379,6 @@ The core packages (with their unit test coverage) which are going to be modified
The kubelet packages (with their unit test coverage) which are going to be modified during implementation:
- `k8s.io/kubernetes/pkg/kubelet/nodeshutdown`: `13 Sep 2022` - `74.9%` <!--(handling of nodeshutdown)-->
- `k8s.io/kubernetes/pkg/kubelet/eviction`: `13 Sep 2022` - `67.7%` <!--(handling of node-pressure eviction)-->
- `k8s.io/kubernetes/pkg/kubelet`: `13 Sep 2022` - `65.1%` <!--(handling of admission errors and exceeding resource limits)-->

##### Integration tests

Expand Down Expand Up @@ -1476,15 +1467,12 @@ Below are some examples to consider, in addition to the aforementioned [maturity
- Address reviews and bug reports from Alpha users
- E2e tests are in Testgrid and linked in KEP
- implementation of extending the existing Job controller's metrics:
`job_finished_total` by the `reason` field; and `job_pods_finished_total`
by the `failure_policy_action` field (see also
[here](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads))
- implementation of adding pod failure conditions (`DisruptionTarget` or
`ResourceExhausted` depending on the scenario) by Kubelet when terminating a
Pod (see: [Termination initiated by Kubelet](#termination-initiated-by-kubelet))
`job_finished_total` by the `reason` field; and introduction of the
`pod_failures_handled_by_failure_policy_total` metric with the `action` label
(see also [here](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads))
- implementation of adding pod disruption conditions (`DisruptionTarget`)
by Kubelet when terminating a Pod (see: [Termination initiated by Kubelet](#termination-initiated-by-kubelet))
- Commonize the code for appending pod conditions between components
- Do not update the pod disruption condition (with type=`DisruptionTarget`) if
it is already present with `status=True`
- Refactor adding of pod conditions with the use of
[SSA](https://kubernetes.io/docs/reference/using-api/server-side-apply/) client.
- The feature flag enabled by default
Expand Down Expand Up @@ -1784,12 +1772,11 @@ It can be used to determine what is the relative frequency of job terminations
due to different reasons. For example, if jobs are terminated often due to
`BackoffLimitExceeded` it may suggest that the pod failure policy should be extended
with new rules to terminate jobs early more often
- `job_pods_finished_total` (existing, extended by a label): the new
`failure_policy_action` label tracks the number of failed pods that are handled
by a specific failure policy action. Possible values are: `JobFailed`, `Ignored`
and `Counted`. If a pod failure does not match the pod failure policy then
the value for the label is left empty. This metric can be used to assess the
coverage of pod failure scenarios with `spec.podFailurePolicy` rules.
- `pod_failures_handled_by_failure_policy_total` (new): the `action` label
tracks the number of failed pods that are handled by a specific failure policy
action. Possible values are: `FailJob`, `Ignore`
and `Count`. This metric can be used to assess the coverage of pod failure
scenarios with `spec.podFailurePolicy` rules.

<!--
Ideally, this should be a metric. Operations against the Kubernetes API (e.g.,
Expand All @@ -1809,7 +1796,7 @@ Recall that end users cannot usually observe component logs or access metrics.
-->

- [x] Pod .status
- Condition type: `DisruptionTarget` or `ResourceExhausted` when a Pod is terminated due to a reason listed in [design details](#design-details).
- Condition type: `DisruptionTarget` when a Pod is terminated due to a reason listed in [design details](#design-details).
- [x] Job .status
- Condition reason: `PodFailurePolicy` for the job `Failed` condition if the job was terminated due to the matching `FailJob` rule.

Expand Down Expand Up @@ -2078,6 +2065,9 @@ technics apply):
- added [Story 3](#story-3) to demonstrate how to use the API to ensure there are no infinite Pod retries
- updated [Graduation Criteria for Beta](#beta)
- updated of kep.yaml and [PRR questionnaire](#production-readiness-review-questionnaire) to prepare the KEP for Beta
- 2022-10-27: PR "Use SSA to add pod failure conditions" ([link](https://github.com/kubernetes/kubernetes/pull/113304))
- 2022-10-31: PR "Extend metrics with the new labels" ([link](https://github.com/kubernetes/kubernetes/pull/113324))
- 2022-11-03: PR "Fix disruption controller permissions to allow patching pod's status" ([link](https://github.com/kubernetes/kubernetes/pull/113580))

<!--
Major milestones in the lifecycle of a KEP should be tracked in this section.
Expand Down

0 comments on commit 5acc235

Please sign in to comment.