-
Notifications
You must be signed in to change notification settings - Fork 39.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce the ResourceExhausted Pod condition into the API types #113248
Introduce the ResourceExhausted Pod condition into the API types #113248
Conversation
@mimowo: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @liggitt |
/retest |
/unassign @alculquicondor |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
/retest |
// ResourceExhausted indicates the pod is about to be deleted due to either | ||
// exceeding its ephemeral storage limits or running an OOM killed container. | ||
// The constant is to be renamed once the name is accepted within the KEP-3329. | ||
AlphaNoCompatGuaranteeResourceExhausted = "ResourceExhausted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since apiserver doesn't need this constant, it doesn't need to be defined in this file, only the versioned one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want it eventually here and it might be also needed by the disruption controller to clean up stale ones. I think it is possible that a pod exceeds (temporarily) ephemeral storage limit, adding the condition succeeds, but deleting fails. Then, I think kubelet wouldn't guarantee retrying the delete in the future IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the disruption controller is not in apiserver, so it should use the versioned packages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, actually I think we may not need the change in the disruption controller too, please take a look at another comment
@@ -2658,6 +2658,10 @@ const ( | |||
// disruption (such as preemption, eviction API or garbage-collection). | |||
// The constant is to be renamed once the name is accepted within the KEP-3329. | |||
AlphaNoCompatGuaranteeDisruptionTarget PodConditionType = "DisruptionTarget" | |||
// ResourceExhausted indicates the pod is about to be deleted due to either | |||
// exceeding its ephemeral storage limits or running an OOM killed container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this always added when there is a failure or only if .RestartPolicy=Never?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add it whenever there is a pod failure (and we detect exceeding of disk limits or OOM kill). This still means I don't add the condition in case of OOM killed container and the pod continues to run as the container is restarted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add this semantics to the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment. Two remarks though:
- the pod may not be necessarily yet in the Failed phase as kubelet makes the transition once all the containers are actually stopped, so typically the condition is added before the failed phase is send to the api-server.
- After another thought I now add the ResourceExhausted condition only when
spec.restartPolicy=Never
. I think it simplifies the code as it makes the disruption controller change unnecessary. If we know a container exceeded its limits and it is not restarted, then the pod exceeded its limits, and this will not change - so the condition will never be stale. Note that, Job pod failure policy can currently only be used withrestartPolicy=Never
anyway.
If you think 2. makes sense then actually I don't need this new PodCondition type in the API as it will only be used by kubelet. Should we drop this PR then or declare the types still for better visibility? What do you think @liggitt and @alculquicondor ?
pkg/apis/core/types.go
Outdated
@@ -2434,6 +2434,10 @@ const ( | |||
// disruption (such as preemption, eviction API or garbage-collection). | |||
// The constant is to be renamed once the name is accepted within the KEP-3329. | |||
AlphaNoCompatGuaranteeDisruptionTarget PodConditionType = "DisruptionTarget" | |||
// ResourceExhausted indicates the pod is about to be deleted due to either | |||
// exceeding its ephemeral storage limits or running an OOM killed container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the examples given here (ephemeral storage and OOM) just examples or the final list of all resources this will represent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this condition expected to remain once set or flutter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the complete list of scenarios, as documented in KEP and implemented in the initial implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition is generally expected to remain once set, same logic as for DisruptionTarget condition. In two rare scenarios a once set condition could change: (1) a race condition when another controller attempts to add the condition at roughly the same time, (2) the pod is not deleted due to the delete operation fail - then if the pod is not deleted for 2min, the condition will be removed by the disruption controller.
// ResourceExhausted indicates the pod is about to be deleted due to either | ||
// exceeding its ephemeral storage limits or running an OOM killed container. | ||
// The constant is to be renamed once the name is accepted within the KEP-3329. | ||
AlphaNoCompatGuaranteeResourceExhausted = "ResourceExhausted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this expected to be a new condition type or a reason for conditions of the DisruptionTarget
type? how is this intended to be used by consumers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a new condition type, but has a very similar lifecycle and mechanics as DIsruptionTarget
. While DisruptionTarget
is used to indicate a pod was disrupted, ResourceExhausted
is expected to be used as an indicator that a pod was terminated due to a non-retriable application or configuration issue. Example job failure policies using this condition are shown in the KEP:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOM or ephemeral storage limits being reached doesn't necessarily mean the error is not retriable, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this condition expected to be mutually exclusive with DisruptionTarget?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOM or ephemeral storage limits being reached doesn't necessarily mean the error is not retriable, right?
Correct, we leave the final decision to the user by the podFailurePolicy configuration.
is this condition expected to be mutually exclusive with DisruptionTarget?
Not necessarily, it may happen in some race situations that both conditions are added (there is no mechanism to prevent). The user has control of giving priority to handling the condition types by the Job's podFailurePolicy
rule order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is one scenario in which they are non-orthogonal - this is when the OS is low on memory. This situation can potentially result in both node memory pressure (resulting in DisruptionTarget) and OOM killing one of the pod containers (resulting in ResourceExhausted), even if the pod container's limits are not exceeded.
That's a great insight, and makes me wonder how a user can reliably write a rule to fail job pods on ResourceExhausted
conditions if it can be due to things unrelated to the workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main thing I want to ensure is that any API surface is clear about how it should be used, and reliably works when used that way. Understanding whether this condition can be depended on to indicate there's a problem with the workload itself seems important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, there is no reliable way to detect when a pod was OOMkilled due to memory being exhausted by other pods #112910
Users can have a higher confidence that ResourceExhausted is due to the pod itself if all their pods have memory requests==limits. And this is the use case we are trying to serve.
@mimowo can you include some of these details in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the most important use case is serving workloads where all pods declare requests=limits for the containers. Admitting such pods indicate that the node has enough memory to run them. Then, invocation of OOM killer users could reliably interpret it as pod limits being exceeded. Still, technically some other process on the machine could cause the system to run out of memory and kill the workload container, but it seems an edge case.
When there are containers running with requests < limits, they could eat up the node's memory after being admitted and still not exceed the limits, but resulting in OOM killer invocation. However, the invocation of OOM killer can generally be prevented by cluster configuration to leave enough memory to the system so that memory pressure is raised before OOM killer is invoked, following the practices here: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-pressure-eviction-good-practices. When pods are evicted due to memory pressure DisruptionTarget condition is added.
Given the limitation of OOM killer not sure what can be done better except for documenting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI it turns out that on CRI-O with cgroups v2 the OOM killed containers don't have the reason field set to OOMKilled
, but generic Error
. It seems like a bug in CRI-O, but discussion is open: #112977 (comment).
We are yet discussing the importance of the feature for the use cases we have. Thus, considering to drop the resource exhausted from the KEP and only add DisruptionTarget condition in case of node pressure eviction + node graceful shutdown.
The main motivation is to decouple the API changes from the kubelet implementation in the PR: kubernetes#112360 Use the same naming convention as for the previously added pod condition: AlphaNoCompatGuaranteeDisruptionTarget. Use the "Alpha" prefix in case the feature does not get fully promoted to Beta in this release cycle.
eb41376
to
7acef19
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mimowo The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// ResourceExhausted indicates the pod is in the Failed phase or is about to | ||
// transition into the Failed phase (and is about to be deleted) due to either: | ||
// - exceeding its ephemeral storage limits; or | ||
// - running an OOM killed container when the pod's .spec.restartPolicy=Never. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when it exceeds ephemeral storage limits, is it always applied or only if restartPolicy=Never?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the condition is added regardless od restartpolicy as the entire pod is evicted by eviction_manager
/lgtm /assign @liggitt |
@liggitt let me know if there is something else needing attention here. I would like to use the new types in my downstream PR for kubelet changes. |
Closing for 1.26 release cycle to avoid confusion. It won't be done for 1.26 as discussed in the KEP (see update PR: kubernetes/enhancements#3646). We may reconsider the decision in the future. |
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
In order to decouple the API changes from the kubelet implementation in the PR (and thus to speed up the acceptance / review process):
#112360
Which issue(s) this PR fixes:
Special notes for your reviewer:
The new Pod condition follows the same naming convention as for the previously added pod condition: DisruptionTarget.
For now we use the "Alpha" prefix in case the feature does not get fully promoted to Beta in this release cycle.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: