-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA pod status (KEP 4680) #48547
DRA pod status (KEP 4680) #48547
Conversation
👷 Deploy Preview for kubernetes-io-vnext-staging processing.
|
✅ Pull request preview available for checkingBuilt without sensitive environment variables
To edit notification comments on pull requests, go to your Netlify site configuration. |
Doc looks good to me. |
Hello @pacoxu & @tengqm 👋! I'm reaching out from the Docs team. Just checking in as we approach Docs Freeze on Tuesday November 26th 18:00 PDT. This documentation appears to still be under review. To meet the Docs Freeze, this PR must have a technical review as well as lgtm and approve labels applied, without any unaddressed comments or concerns from SIG Docs. Thank you! |
/approve |
/lgtm Feel free to remove the hold once other dra feature owners reviewed. |
will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus` | ||
field reports health information for each device assigned to the container. | ||
|
||
For a failed Pod, or or where you suspect a fault, you can use this status to understand whether |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or or
There are cases when devices fail or are shut down. The responsibility of the DRA plugin | ||
in this case is to notify the kubelet about the situation using the `WatchResources` API. | ||
|
||
Pods that were assigned to the failed devices will continue be assigned to this device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pods that were assigned to the failed devices will continue be assigned to this device. | |
Pods that were assigned to the failed devices will continue be assigned to these devices. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also not blocking feedback, but I think that it will help readability. If you can implement, please do, either in this PR or another.
For docs:
/lgtm
It is typical that code relying on the device will start failing and Pod may get | ||
into Failed phase if `restartPolicy` for the Pod was not `Always` or enter the crash loop | ||
otherwise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is typical that code relying on the device will start failing and Pod may get | |
into Failed phase if `restartPolicy` for the Pod was not `Always` or enter the crash loop | |
otherwise. | |
Code that relies on the devices will usually fail. Pods that don't set the `restartPolicy` | |
field to `Always` might enter the `Failed` phase. Pods that do set the `restartPolicy` | |
field to `Always` might enter a crash loop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge this into the previous paragraph to make that a single "problem statement"
By enabling the feature gate `ResourceHealthStatus`, the field `allocatedResourcesStatus` | ||
will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus` | ||
field reports health information for each device assigned to the container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By enabling the feature gate `ResourceHealthStatus`, the field `allocatedResourcesStatus` | |
will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus` | |
field reports health information for each device assigned to the container. | |
If you enable the `ResourceHealthStatus` feature gate, the `allocatedResourcesStatus` field | |
is added to each entry in the `status.containerStatuses` field of the Pod specification. The `allocatedResourcesStatus` field reports health information for each device assigned to the | |
container. |
For a failed Pod, or or where you suspect a fault, you can use this status to understand whether | ||
the Pod behavior may be associated with device failure. For example, if an accelerator is reporting | ||
an over-temperature event, the `allocatedResourcesStatus` field may be able to report this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a failed Pod, or or where you suspect a fault, you can use this status to understand whether | |
the Pod behavior may be associated with device failure. For example, if an accelerator is reporting | |
an over-temperature event, the `allocatedResourcesStatus` field may be able to report this. | |
For a failed Pod, or where you suspect a fault, you can use the `allocatedResourcesStatus` | |
field to understand whether the Pod behavior might be associated with device failure. For | |
example, if an accelerator is reporting an over-temperature event, the | |
`allocatedResourcesStatus` field might help you to identify the event. |
(the typos that @pacoxu pointed out might be good to fix before submission) |
The typos can be addressed in a follow-up PR |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: reylejano, tengqm The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
#48861 rebases and address comments |
/lgtm cancel Because of https://github.com/kubernetes/website/pull/48861/files#r1859269441 |
/approve cancel There are pending changes @pacoxu when we hold PRs here, we prefer to state under what conditions someone could remove the hold |
Description
This is a
placeholderPR for KEP: kubernetes/enhancements#4680 (alpha2)