Introduce the ResourceExhausted Pod condition into the API types #113248

mimowo · 2022-10-21T10:14:02Z

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

In order to decouple the API changes from the kubelet implementation in the PR (and thus to speed up the acceptance / review process):
#112360

Which issue(s) this PR fixes:

Tracking issue: Retriable and non-retriable Pod failures for Jobs enhancements#3329

Special notes for your reviewer:

The new Pod condition follows the same naming convention as for the previously added pod condition: DisruptionTarget.
For now we use the "Alpha" prefix in case the feature does not get fully promoted to Beta in this release cycle.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures

k8s-ci-robot · 2022-10-21T10:14:11Z

@mimowo: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mimowo · 2022-10-21T10:28:44Z

/assign @liggitt

mimowo · 2022-10-21T10:49:43Z

/retest

mimowo · 2022-10-21T10:50:45Z

/unassign @alculquicondor

k8s-triage-robot · 2022-10-21T11:52:24Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

mimowo · 2022-10-21T12:16:33Z

/retest

alculquicondor · 2022-10-21T14:32:48Z

pkg/apis/core/types.go

+	// ResourceExhausted indicates the pod is about to be deleted due to either
+	// exceeding its ephemeral storage limits or running an OOM killed container.
+	// The constant is to be renamed once the name is accepted within the KEP-3329.
+	AlphaNoCompatGuaranteeResourceExhausted = "ResourceExhausted"


since apiserver doesn't need this constant, it doesn't need to be defined in this file, only the versioned one.

We want it eventually here and it might be also needed by the disruption controller to clean up stale ones. I think it is possible that a pod exceeds (temporarily) ephemeral storage limit, adding the condition succeeds, but deleting fails. Then, I think kubelet wouldn't guarantee retrying the delete in the future IIUC.

the disruption controller is not in apiserver, so it should use the versioned packages

true, actually I think we may not need the change in the disruption controller too, please take a look at another comment

alculquicondor · 2022-10-21T14:33:17Z

staging/src/k8s.io/api/core/v1/types.go

@@ -2658,6 +2658,10 @@ const (
 	// disruption (such as preemption, eviction API or garbage-collection).
 	// The constant is to be renamed once the name is accepted within the KEP-3329.
 	AlphaNoCompatGuaranteeDisruptionTarget PodConditionType = "DisruptionTarget"
+	// ResourceExhausted indicates the pod is about to be deleted due to either
+	// exceeding its ephemeral storage limits or running an OOM killed container.


Is this always added when there is a failure or only if .RestartPolicy=Never?

I add it whenever there is a pod failure (and we detect exceeding of disk limits or OOM kill). This still means I don't add the condition in case of OOM killed container and the pod continues to run as the container is restarted.

Please add this semantics to the comment.

Updated the comment. Two remarks though:

the pod may not be necessarily yet in the Failed phase as kubelet makes the transition once all the containers are actually stopped, so typically the condition is added before the failed phase is send to the api-server.

After another thought I now add the ResourceExhausted condition only when spec.restartPolicy=Never. I think it simplifies the code as it makes the disruption controller change unnecessary. If we know a container exceeded its limits and it is not restarted, then the pod exceeded its limits, and this will not change - so the condition will never be stale. Note that, Job pod failure policy can currently only be used with restartPolicy=Never anyway.

If you think 2. makes sense then actually I don't need this new PodCondition type in the API as it will only be used by kubelet. Should we drop this PR then or declare the types still for better visibility? What do you think @liggitt and @alculquicondor ?

liggitt · 2022-10-21T12:24:38Z

pkg/apis/core/types.go

@@ -2434,6 +2434,10 @@ const (
 	// disruption (such as preemption, eviction API or garbage-collection).
 	// The constant is to be renamed once the name is accepted within the KEP-3329.
 	AlphaNoCompatGuaranteeDisruptionTarget PodConditionType = "DisruptionTarget"
+	// ResourceExhausted indicates the pod is about to be deleted due to either
+	// exceeding its ephemeral storage limits or running an OOM killed container.


are the examples given here (ephemeral storage and OOM) just examples or the final list of all resources this will represent?

is this condition expected to remain once set or flutter?

This is the complete list of scenarios, as documented in KEP and implemented in the initial implementation.

The condition is generally expected to remain once set, same logic as for DisruptionTarget condition. In two rare scenarios a once set condition could change: (1) a race condition when another controller attempts to add the condition at roughly the same time, (2) the pod is not deleted due to the delete operation fail - then if the pod is not deleted for 2min, the condition will be removed by the disruption controller.

liggitt · 2022-10-21T16:18:15Z

pkg/apis/core/types.go

+	// ResourceExhausted indicates the pod is about to be deleted due to either
+	// exceeding its ephemeral storage limits or running an OOM killed container.
+	// The constant is to be renamed once the name is accepted within the KEP-3329.
+	AlphaNoCompatGuaranteeResourceExhausted = "ResourceExhausted"


is this expected to be a new condition type or a reason for conditions of the DisruptionTarget type? how is this intended to be used by consumers?

It is a new condition type, but has a very similar lifecycle and mechanics as DIsruptionTarget. While DisruptionTarget is used to indicate a pod was disrupted, ResourceExhausted is expected to be used as an indicator that a pod was terminated due to a non-retriable application or configuration issue. Example job failure policies using this condition are shown in the KEP:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures#story-2

https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures#story-3

OOM or ephemeral storage limits being reached doesn't necessarily mean the error is not retriable, right?

is this condition expected to be mutually exclusive with DisruptionTarget?

OOM or ephemeral storage limits being reached doesn't necessarily mean the error is not retriable, right?

Correct, we leave the final decision to the user by the podFailurePolicy configuration.

is this condition expected to be mutually exclusive with DisruptionTarget?

Not necessarily, it may happen in some race situations that both conditions are added (there is no mechanism to prevent). The user has control of giving priority to handling the condition types by the Job's podFailurePolicy rule order.

I think there is one scenario in which they are non-orthogonal - this is when the OS is low on memory. This situation can potentially result in both node memory pressure (resulting in DisruptionTarget) and OOM killing one of the pod containers (resulting in ResourceExhausted), even if the pod container's limits are not exceeded.

That's a great insight, and makes me wonder how a user can reliably write a rule to fail job pods on ResourceExhausted conditions if it can be due to things unrelated to the workload.

The main thing I want to ensure is that any API surface is clear about how it should be used, and reliably works when used that way. Understanding whether this condition can be depended on to indicate there's a problem with the workload itself seems important

Unfortunately, there is no reliable way to detect when a pod was OOMkilled due to memory being exhausted by other pods #112910

Users can have a higher confidence that ResourceExhausted is due to the pod itself if all their pods have memory requests==limits. And this is the use case we are trying to serve.

@mimowo can you include some of these details in the comment?

Yes, the most important use case is serving workloads where all pods declare requests=limits for the containers. Admitting such pods indicate that the node has enough memory to run them. Then, invocation of OOM killer users could reliably interpret it as pod limits being exceeded. Still, technically some other process on the machine could cause the system to run out of memory and kill the workload container, but it seems an edge case.

When there are containers running with requests < limits, they could eat up the node's memory after being admitted and still not exceed the limits, but resulting in OOM killer invocation. However, the invocation of OOM killer can generally be prevented by cluster configuration to leave enough memory to the system so that memory pressure is raised before OOM killer is invoked, following the practices here: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-pressure-eviction-good-practices. When pods are evicted due to memory pressure DisruptionTarget condition is added.

Given the limitation of OOM killer not sure what can be done better except for documenting.

FYI it turns out that on CRI-O with cgroups v2 the OOM killed containers don't have the reason field set to OOMKilled, but generic Error. It seems like a bug in CRI-O, but discussion is open: #112977 (comment).
We are yet discussing the importance of the feature for the use cases we have. Thus, considering to drop the resource exhausted from the KEP and only add DisruptionTarget condition in case of node pressure eviction + node graceful shutdown.

The main motivation is to decouple the API changes from the kubelet implementation in the PR: kubernetes#112360 Use the same naming convention as for the previously added pod condition: AlphaNoCompatGuaranteeDisruptionTarget. Use the "Alpha" prefix in case the feature does not get fully promoted to Beta in this release cycle.

k8s-ci-robot · 2022-10-24T13:25:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mimowo
Once this PR has been reviewed and has the lgtm label, please ask for approval from liggitt by writing /assign @liggitt in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/apis/OWNERS
staging/src/k8s.io/api/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

alculquicondor · 2022-10-26T15:06:30Z

staging/src/k8s.io/api/core/v1/types.go

+	// ResourceExhausted indicates the pod is in the Failed phase or is about to
+	// transition into the Failed phase (and is about to be deleted) due to either:
+	// - exceeding its ephemeral storage limits; or
+	// - running an OOM killed container when the pod's .spec.restartPolicy=Never.


when it exceeds ephemeral storage limits, is it always applied or only if restartPolicy=Never?

In this case the condition is added regardless od restartpolicy as the entire pod is evicted by eviction_manager

alculquicondor · 2022-10-26T15:55:51Z

/lgtm

/assign @liggitt

mimowo · 2022-10-27T13:23:43Z

@liggitt let me know if there is something else needing attention here. I would like to use the new types in my downstream PR for kubelet changes.

mimowo · 2022-11-08T09:44:10Z

Closing for 1.26 release cycle to avoid confusion. It won't be done for 1.26 as discussed in the KEP (see update PR: kubernetes/enhancements#3646). We may reconsider the decision in the future.

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 21, 2022

k8s-ci-robot requested review from derekwaynecarr and yujuhong October 21, 2022 10:15

k8s-ci-robot assigned alculquicondor Oct 21, 2022

k8s-ci-robot assigned liggitt Oct 21, 2022

k8s-ci-robot unassigned alculquicondor Oct 21, 2022

alculquicondor reviewed Oct 21, 2022

View reviewed changes

liggitt reviewed Oct 21, 2022

View reviewed changes

k8s-ci-robot removed the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 24, 2022

mimowo added 2 commits October 24, 2022 15:25

Improve comments

7acef19

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 24, 2022

mimowo force-pushed the handling-pod-failures-beta-resourceexhausted-api branch from eb41376 to 7acef19 Compare October 24, 2022 13:25

alculquicondor reviewed Oct 26, 2022

View reviewed changes

k8s-ci-robot assigned alculquicondor Oct 26, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 26, 2022

mimowo mentioned this pull request Oct 27, 2022

Enable the "Retriable and non-retriable pod failures for jobs" feature into beta #113360

Merged

alculquicondor mentioned this pull request Oct 27, 2022

Retriable and non-retriable Pod failures for Jobs kubernetes/enhancements#3329

Closed

12 tasks

mimowo marked this pull request as draft November 7, 2022 07:43

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2022

mimowo closed this Nov 8, 2022

mimowo deleted the handling-pod-failures-beta-resourceexhausted-api branch November 29, 2023 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce the ResourceExhausted Pod condition into the API types #113248

Introduce the ResourceExhausted Pod condition into the API types #113248

mimowo commented Oct 21, 2022 •

edited

Loading

k8s-ci-robot commented Oct 21, 2022

mimowo commented Oct 21, 2022

mimowo commented Oct 21, 2022

mimowo commented Oct 21, 2022

k8s-triage-robot commented Oct 21, 2022

mimowo commented Oct 21, 2022

alculquicondor Oct 21, 2022

mimowo Oct 21, 2022

alculquicondor Oct 21, 2022

mimowo Oct 24, 2022

alculquicondor Oct 21, 2022

mimowo Oct 21, 2022 •

edited

Loading

alculquicondor Oct 21, 2022

mimowo Oct 24, 2022

liggitt Oct 21, 2022

liggitt Oct 21, 2022

mimowo Oct 21, 2022

mimowo Oct 21, 2022

liggitt Oct 21, 2022

mimowo Oct 21, 2022 •

edited

Loading

liggitt Oct 24, 2022

liggitt Oct 24, 2022

mimowo Oct 24, 2022

liggitt Oct 27, 2022

liggitt Oct 27, 2022

alculquicondor Oct 27, 2022

mimowo Oct 27, 2022 •

edited

Loading

mimowo Oct 28, 2022 •

edited

Loading

k8s-ci-robot commented Oct 24, 2022

alculquicondor Oct 26, 2022

mimowo Oct 26, 2022

alculquicondor commented Oct 26, 2022

mimowo commented Oct 27, 2022

mimowo commented Nov 8, 2022 •

edited

Loading

Introduce the ResourceExhausted Pod condition into the API types #113248

Introduce the ResourceExhausted Pod condition into the API types #113248

Conversation

mimowo commented Oct 21, 2022 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Oct 21, 2022

mimowo commented Oct 21, 2022

mimowo commented Oct 21, 2022

mimowo commented Oct 21, 2022

k8s-triage-robot commented Oct 21, 2022

mimowo commented Oct 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Oct 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Oct 27, 2022 • edited Loading

Choose a reason for hiding this comment

mimowo Oct 28, 2022 • edited Loading

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Oct 26, 2022

mimowo commented Oct 27, 2022

mimowo commented Nov 8, 2022 • edited Loading

mimowo commented Oct 21, 2022 •

edited

Loading

mimowo Oct 21, 2022 •

edited

Loading

mimowo Oct 21, 2022 •

edited

Loading

mimowo Oct 27, 2022 •

edited

Loading

mimowo Oct 28, 2022 •

edited

Loading

mimowo commented Nov 8, 2022 •

edited

Loading