Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the ResourceExhausted Pod condition into the API types #113248

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Oct 21, 2022

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

In order to decouple the API changes from the kubelet implementation in the PR (and thus to speed up the acceptance / review process):
#112360

Which issue(s) this PR fixes:

Special notes for your reviewer:

The new Pod condition follows the same naming convention as for the previously added pod condition: DisruptionTarget.
For now we use the "Alpha" prefix in case the feature does not get fully promoted to Beta in this release cycle.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures 

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 21, 2022
@k8s-ci-robot
Copy link
Contributor

@mimowo: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 21, 2022
@mimowo
Copy link
Contributor Author

mimowo commented Oct 21, 2022

/assign @liggitt

@mimowo
Copy link
Contributor Author

mimowo commented Oct 21, 2022

/retest

@mimowo
Copy link
Contributor Author

mimowo commented Oct 21, 2022

/unassign @alculquicondor

@k8s-triage-robot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@mimowo
Copy link
Contributor Author

mimowo commented Oct 21, 2022

/retest

// ResourceExhausted indicates the pod is about to be deleted due to either
// exceeding its ephemeral storage limits or running an OOM killed container.
// The constant is to be renamed once the name is accepted within the KEP-3329.
AlphaNoCompatGuaranteeResourceExhausted = "ResourceExhausted"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since apiserver doesn't need this constant, it doesn't need to be defined in this file, only the versioned one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want it eventually here and it might be also needed by the disruption controller to clean up stale ones. I think it is possible that a pod exceeds (temporarily) ephemeral storage limit, adding the condition succeeds, but deleting fails. Then, I think kubelet wouldn't guarantee retrying the delete in the future IIUC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the disruption controller is not in apiserver, so it should use the versioned packages

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true, actually I think we may not need the change in the disruption controller too, please take a look at another comment

@@ -2658,6 +2658,10 @@ const (
// disruption (such as preemption, eviction API or garbage-collection).
// The constant is to be renamed once the name is accepted within the KEP-3329.
AlphaNoCompatGuaranteeDisruptionTarget PodConditionType = "DisruptionTarget"
// ResourceExhausted indicates the pod is about to be deleted due to either
// exceeding its ephemeral storage limits or running an OOM killed container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always added when there is a failure or only if .RestartPolicy=Never?

Copy link
Contributor Author

@mimowo mimowo Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add it whenever there is a pod failure (and we detect exceeding of disk limits or OOM kill). This still means I don't add the condition in case of OOM killed container and the pod continues to run as the container is restarted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this semantics to the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment. Two remarks though:

  1. the pod may not be necessarily yet in the Failed phase as kubelet makes the transition once all the containers are actually stopped, so typically the condition is added before the failed phase is send to the api-server.
  2. After another thought I now add the ResourceExhausted condition only when spec.restartPolicy=Never. I think it simplifies the code as it makes the disruption controller change unnecessary. If we know a container exceeded its limits and it is not restarted, then the pod exceeded its limits, and this will not change - so the condition will never be stale. Note that, Job pod failure policy can currently only be used with restartPolicy=Never anyway.

If you think 2. makes sense then actually I don't need this new PodCondition type in the API as it will only be used by kubelet. Should we drop this PR then or declare the types still for better visibility? What do you think @liggitt and @alculquicondor ?

@@ -2434,6 +2434,10 @@ const (
// disruption (such as preemption, eviction API or garbage-collection).
// The constant is to be renamed once the name is accepted within the KEP-3329.
AlphaNoCompatGuaranteeDisruptionTarget PodConditionType = "DisruptionTarget"
// ResourceExhausted indicates the pod is about to be deleted due to either
// exceeding its ephemeral storage limits or running an OOM killed container.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the examples given here (ephemeral storage and OOM) just examples or the final list of all resources this will represent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this condition expected to remain once set or flutter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the complete list of scenarios, as documented in KEP and implemented in the initial implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition is generally expected to remain once set, same logic as for DisruptionTarget condition. In two rare scenarios a once set condition could change: (1) a race condition when another controller attempts to add the condition at roughly the same time, (2) the pod is not deleted due to the delete operation fail - then if the pod is not deleted for 2min, the condition will be removed by the disruption controller.

// ResourceExhausted indicates the pod is about to be deleted due to either
// exceeding its ephemeral storage limits or running an OOM killed container.
// The constant is to be renamed once the name is accepted within the KEP-3329.
AlphaNoCompatGuaranteeResourceExhausted = "ResourceExhausted"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this expected to be a new condition type or a reason for conditions of the DisruptionTarget type? how is this intended to be used by consumers?

Copy link
Contributor Author

@mimowo mimowo Oct 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a new condition type, but has a very similar lifecycle and mechanics as DIsruptionTarget. While DisruptionTarget is used to indicate a pod was disrupted, ResourceExhausted is expected to be used as an indicator that a pod was terminated due to a non-retriable application or configuration issue. Example job failure policies using this condition are shown in the KEP:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOM or ephemeral storage limits being reached doesn't necessarily mean the error is not retriable, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this condition expected to be mutually exclusive with DisruptionTarget?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OOM or ephemeral storage limits being reached doesn't necessarily mean the error is not retriable, right?

Correct, we leave the final decision to the user by the podFailurePolicy configuration.

is this condition expected to be mutually exclusive with DisruptionTarget?

Not necessarily, it may happen in some race situations that both conditions are added (there is no mechanism to prevent). The user has control of giving priority to handling the condition types by the Job's podFailurePolicy rule order.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is one scenario in which they are non-orthogonal - this is when the OS is low on memory. This situation can potentially result in both node memory pressure (resulting in DisruptionTarget) and OOM killing one of the pod containers (resulting in ResourceExhausted), even if the pod container's limits are not exceeded.

That's a great insight, and makes me wonder how a user can reliably write a rule to fail job pods on ResourceExhausted conditions if it can be due to things unrelated to the workload.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main thing I want to ensure is that any API surface is clear about how it should be used, and reliably works when used that way. Understanding whether this condition can be depended on to indicate there's a problem with the workload itself seems important

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, there is no reliable way to detect when a pod was OOMkilled due to memory being exhausted by other pods #112910

Users can have a higher confidence that ResourceExhausted is due to the pod itself if all their pods have memory requests==limits. And this is the use case we are trying to serve.

@mimowo can you include some of these details in the comment?

Copy link
Contributor Author

@mimowo mimowo Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the most important use case is serving workloads where all pods declare requests=limits for the containers. Admitting such pods indicate that the node has enough memory to run them. Then, invocation of OOM killer users could reliably interpret it as pod limits being exceeded. Still, technically some other process on the machine could cause the system to run out of memory and kill the workload container, but it seems an edge case.

When there are containers running with requests < limits, they could eat up the node's memory after being admitted and still not exceed the limits, but resulting in OOM killer invocation. However, the invocation of OOM killer can generally be prevented by cluster configuration to leave enough memory to the system so that memory pressure is raised before OOM killer is invoked, following the practices here: https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/#node-pressure-eviction-good-practices. When pods are evicted due to memory pressure DisruptionTarget condition is added.

Given the limitation of OOM killer not sure what can be done better except for documenting.

Copy link
Contributor Author

@mimowo mimowo Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI it turns out that on CRI-O with cgroups v2 the OOM killed containers don't have the reason field set to OOMKilled, but generic Error. It seems like a bug in CRI-O, but discussion is open: #112977 (comment).
We are yet discussing the importance of the feature for the use cases we have. Thus, considering to drop the resource exhausted from the KEP and only add DisruptionTarget condition in case of node pressure eviction + node graceful shutdown.

@k8s-ci-robot k8s-ci-robot removed the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Oct 24, 2022
The main motivation is to decouple the API changes from the
kubelet implementation in the PR:
kubernetes#112360

Use the same naming convention as for the previously added pod
condition: AlphaNoCompatGuaranteeDisruptionTarget. Use the "Alpha" prefix
in case the feature does not get fully promoted to Beta in this release
cycle.
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 24, 2022
@mimowo mimowo force-pushed the handling-pod-failures-beta-resourceexhausted-api branch from eb41376 to 7acef19 Compare October 24, 2022 13:25
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mimowo
Once this PR has been reviewed and has the lgtm label, please ask for approval from liggitt by writing /assign @liggitt in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

// ResourceExhausted indicates the pod is in the Failed phase or is about to
// transition into the Failed phase (and is about to be deleted) due to either:
// - exceeding its ephemeral storage limits; or
// - running an OOM killed container when the pod's .spec.restartPolicy=Never.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when it exceeds ephemeral storage limits, is it always applied or only if restartPolicy=Never?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case the condition is added regardless od restartpolicy as the entire pod is evicted by eviction_manager

@alculquicondor
Copy link
Member

/lgtm

/assign @liggitt

@mimowo
Copy link
Contributor Author

mimowo commented Oct 27, 2022

@liggitt let me know if there is something else needing attention here. I would like to use the new types in my downstream PR for kubelet changes.

@mimowo mimowo marked this pull request as draft November 7, 2022 07:43
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 7, 2022
@mimowo
Copy link
Contributor Author

mimowo commented Nov 8, 2022

Closing for 1.26 release cycle to avoid confusion. It won't be done for 1.26 as discussed in the KEP (see update PR: kubernetes/enhancements#3646). We may reconsider the decision in the future.

@mimowo mimowo closed this Nov 8, 2022
@mimowo mimowo deleted the handling-pod-failures-beta-resourceexhausted-api branch November 29, 2023 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants