DRA pod status (KEP 4680) #48547

SergeyKanzhelev · 2024-10-25T16:59:30Z

Description

This is a ~~placeholder~~ PR for KEP: kubernetes/enhancements#4680 (alpha2)

netlify · 2024-10-25T16:59:49Z

👷 Deploy Preview for kubernetes-io-vnext-staging processing.

Name	Link
🔨 Latest commit	`ffd715d`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-io-vnext-staging/deploys/671bce75186e6100080f31dc

netlify · 2024-10-25T17:08:23Z

✅ Pull request preview available for checking

Built without sensitive environment variables

Name	Link
🔨 Latest commit	`ffd715d`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/671bce75b1f4ac0008d67f1a
😎 Deploy Preview	https://deploy-preview-48547--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

tengqm · 2024-10-26T06:09:54Z

Doc looks good to me.
Pls provide link(s) to the upstream PR(s) so that people can help track the progress of this feature.

hacktivist123 · 2024-11-25T22:46:44Z

Hello @pacoxu & @tengqm 👋! I'm reaching out from the Docs team. Just checking in as we approach Docs Freeze on Tuesday November 26th 18:00 PDT. This documentation appears to still be under review. To meet the Docs Freeze, this PR must have a technical review as well as lgtm and approve labels applied, without any unaddressed comments or concerns from SIG Docs. Thank you!

tengqm · 2024-11-26T00:59:05Z

/approve

pacoxu · 2024-11-26T06:04:01Z

/lgtm
/hold
/cc @kannon92
to take a look as well if you have time.

Feel free to remove the hold once other dra feature owners reviewed.

pacoxu · 2024-11-26T06:08:05Z

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md

+will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`
+field reports health information for each device assigned to the container.
+
+For a failed Pod, or or where you suspect a fault, you can use this status to understand whether


pacoxu · 2024-11-26T06:11:55Z

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md

+There are cases when devices fail or are shut down. The responsibility of the DRA plugin
+in this case is to notify the kubelet about the situation using the `WatchResources` API.
+
+Pods that were assigned to the failed devices will continue be assigned to this device.


Suggested change

Pods that were assigned to the failed devices will continue be assigned to this device.

Pods that were assigned to the failed devices will continue be assigned to these devices.

k8s-ci-robot · 2024-11-26T20:29:42Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

shannonxtreme

This is also not blocking feedback, but I think that it will help readability. If you can implement, please do, either in this PR or another.

For docs:

/lgtm

shannonxtreme · 2024-11-26T21:12:06Z

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md

+It is typical that code relying on the device will start failing and Pod may get
+into Failed phase if `restartPolicy` for the Pod was not `Always` or enter the crash loop
+otherwise.


Suggested change

It is typical that code relying on the device will start failing and Pod may get

into Failed phase if `restartPolicy` for the Pod was not `Always` or enter the crash loop

otherwise.

Code that relies on the devices will usually fail. Pods that don't set the `restartPolicy`

field to `Always` might enter the `Failed` phase. Pods that do set the `restartPolicy`

field to `Always` might enter a crash loop.

Merge this into the previous paragraph to make that a single "problem statement"

shannonxtreme · 2024-11-26T21:14:22Z

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md

+By enabling the feature gate `ResourceHealthStatus`, the field `allocatedResourcesStatus`
+will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`
+field reports health information for each device assigned to the container.


Suggested change

By enabling the feature gate `ResourceHealthStatus`, the field `allocatedResourcesStatus`

will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`

field reports health information for each device assigned to the container.

If you enable the `ResourceHealthStatus` feature gate, the `allocatedResourcesStatus` field

is added to each entry in the `status.containerStatuses` field of the Pod specification. The `allocatedResourcesStatus` field reports health information for each device assigned to the

container.

shannonxtreme · 2024-11-26T21:15:43Z

content/en/docs/concepts/scheduling-eviction/dynamic-resource-allocation.md

+For a failed Pod, or or where you suspect a fault, you can use this status to understand whether
+the Pod behavior may be associated with device failure. For example, if an accelerator is reporting
+an over-temperature event, the `allocatedResourcesStatus` field may be able to report this.


Suggested change

For a failed Pod, or or where you suspect a fault, you can use this status to understand whether

the Pod behavior may be associated with device failure. For example, if an accelerator is reporting

an over-temperature event, the `allocatedResourcesStatus` field may be able to report this.

For a failed Pod, or where you suspect a fault, you can use the `allocatedResourcesStatus`

field to understand whether the Pod behavior might be associated with device failure. For

example, if an accelerator is reporting an over-temperature event, the

`allocatedResourcesStatus` field might help you to identify the event.

shannonxtreme · 2024-11-26T21:21:14Z

(the typos that @pacoxu pointed out might be good to fix before submission)

reylejano · 2024-11-26T21:22:54Z

The typos can be addressed in a follow-up PR
/approve

k8s-ci-robot · 2024-11-26T21:23:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: reylejano, tengqm

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~content/en/docs/OWNERS~~ [reylejano,tengqm]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

haircommander · 2024-11-26T21:42:37Z

#48861 rebases and address comments

pohly · 2024-11-26T21:49:47Z

/lgtm cancel

Because of https://github.com/kubernetes/website/pull/48861/files#r1859269441

sftim · 2024-11-27T14:06:25Z

/approve cancel

There are pending changes

@pacoxu when we hold PRs here, we prefer to state under what conditions someone could remove the hold
Could you add that detail in a comment?

DRA devices health status

ffd715d

k8s-ci-robot added this to the 1.32 milestone Oct 25, 2024

k8s-ci-robot added the language/en Issues or PRs related to English language label Oct 25, 2024

k8s-ci-robot requested a review from mickeyboxell October 25, 2024 16:59

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 25, 2024

k8s-ci-robot requested a review from pohly October 25, 2024 16:59

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 25, 2024

pacoxu mentioned this pull request Nov 8, 2024

Add Resource Health Status to the Pod Status for Device Plugin and DRA kubernetes/enhancements#4680

Open

10 tasks

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2024

k8s-ci-robot requested a review from kannon92 November 26, 2024 06:04

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 26, 2024

k8s-ci-robot assigned pacoxu Nov 26, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 26, 2024

pacoxu reviewed Nov 26, 2024

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 26, 2024

shannonxtreme reviewed Nov 26, 2024

View reviewed changes

k8s-ci-robot assigned shannonxtreme Nov 26, 2024

haircommander mentioned this pull request Nov 26, 2024

DRA pod status (KEP 4680) #48861

Closed

k8s-ci-robot assigned pohly Nov 26, 2024

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 26, 2024

k8s-ci-robot requested review from pacoxu and shannonxtreme November 26, 2024 21:49

SergeyKanzhelev closed this by deleting the head repository Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DRA pod status (KEP 4680) #48547

DRA pod status (KEP 4680) #48547

SergeyKanzhelev commented Oct 25, 2024 •

edited by sftim

Loading

netlify bot commented Oct 25, 2024

netlify bot commented Oct 25, 2024

tengqm commented Oct 26, 2024

hacktivist123 commented Nov 25, 2024

tengqm commented Nov 26, 2024

pacoxu commented Nov 26, 2024 •

edited

Loading

pacoxu Nov 26, 2024

pacoxu Nov 26, 2024

k8s-ci-robot commented Nov 26, 2024

shannonxtreme left a comment

shannonxtreme Nov 26, 2024

shannonxtreme Nov 26, 2024

shannonxtreme Nov 26, 2024

shannonxtreme Nov 26, 2024

shannonxtreme commented Nov 26, 2024

reylejano commented Nov 26, 2024

k8s-ci-robot commented Nov 26, 2024

haircommander commented Nov 26, 2024

pohly commented Nov 26, 2024

sftim commented Nov 27, 2024

	Pods that were assigned to the failed devices will continue be assigned to this device.
	Pods that were assigned to the failed devices will continue be assigned to these devices.

-For a failed Pod, or or where you suspect a fault, you can use this status to understand whether
-the Pod behavior may be associated with device failure. For example, if an accelerator is reporting
-an over-temperature event, the `allocatedResourcesStatus` field may be able to report this.
+For a failed Pod, or where you suspect a fault, you can use the `allocatedResourcesStatus`
+field to understand whether the Pod behavior might be associated with device failure. For
+example, if an accelerator is reporting an over-temperature event, the
+`allocatedResourcesStatus` field might help you to identify the event.

DRA pod status (KEP 4680) #48547

DRA pod status (KEP 4680) #48547

Conversation

SergeyKanzhelev commented Oct 25, 2024 • edited by sftim Loading

Description

netlify bot commented Oct 25, 2024

👷 Deploy Preview for kubernetes-io-vnext-staging processing.

netlify bot commented Oct 25, 2024

✅ Pull request preview available for checking

tengqm commented Oct 26, 2024

hacktivist123 commented Nov 25, 2024

tengqm commented Nov 26, 2024

pacoxu commented Nov 26, 2024 • edited Loading

pacoxu Nov 26, 2024

Choose a reason for hiding this comment

pacoxu Nov 26, 2024

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 26, 2024

shannonxtreme left a comment

Choose a reason for hiding this comment

shannonxtreme Nov 26, 2024

Choose a reason for hiding this comment

shannonxtreme Nov 26, 2024

Choose a reason for hiding this comment

shannonxtreme Nov 26, 2024

Choose a reason for hiding this comment

shannonxtreme Nov 26, 2024

Choose a reason for hiding this comment

shannonxtreme commented Nov 26, 2024

reylejano commented Nov 26, 2024

k8s-ci-robot commented Nov 26, 2024

haircommander commented Nov 26, 2024

pohly commented Nov 26, 2024

sftim commented Nov 27, 2024

SergeyKanzhelev commented Oct 25, 2024 •

edited by sftim

Loading

pacoxu commented Nov 26, 2024 •

edited

Loading