Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRA pod status (KEP 4680) #48547

Closed

Conversation

SergeyKanzhelev
Copy link
Member

@SergeyKanzhelev SergeyKanzhelev commented Oct 25, 2024

Description

This is a placeholder PR for KEP: kubernetes/enhancements#4680 (alpha2)

@k8s-ci-robot k8s-ci-robot added this to the 1.32 milestone Oct 25, 2024
@k8s-ci-robot k8s-ci-robot added the language/en Issues or PRs related to English language label Oct 25, 2024
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 25, 2024
@k8s-ci-robot k8s-ci-robot requested a review from pohly October 25, 2024 16:59
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 25, 2024
Copy link

netlify bot commented Oct 25, 2024

👷 Deploy Preview for kubernetes-io-vnext-staging processing.

Name Link
🔨 Latest commit ffd715d
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-vnext-staging/deploys/671bce75186e6100080f31dc

Copy link

netlify bot commented Oct 25, 2024

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit ffd715d
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/671bce75b1f4ac0008d67f1a
😎 Deploy Preview https://deploy-preview-48547--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@tengqm
Copy link
Contributor

tengqm commented Oct 26, 2024

Doc looks good to me.
Pls provide link(s) to the upstream PR(s) so that people can help track the progress of this feature.

@hacktivist123
Copy link
Contributor

Hello @pacoxu & @tengqm 👋! I'm reaching out from the Docs team. Just checking in as we approach Docs Freeze on Tuesday November 26th 18:00 PDT. This documentation appears to still be under review. To meet the Docs Freeze, this PR must have a technical review as well as lgtm and approve labels applied, without any unaddressed comments or concerns from SIG Docs. Thank you!

@tengqm
Copy link
Contributor

tengqm commented Nov 26, 2024

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 26, 2024
@pacoxu
Copy link
Member

pacoxu commented Nov 26, 2024

/lgtm
/hold
/cc @kannon92
to take a look as well if you have time.

Feel free to remove the hold once other dra feature owners reviewed.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 26, 2024
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 26, 2024
will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`
field reports health information for each device assigned to the container.

For a failed Pod, or or where you suspect a fault, you can use this status to understand whether
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or or

There are cases when devices fail or are shut down. The responsibility of the DRA plugin
in this case is to notify the kubelet about the situation using the `WatchResources` API.

Pods that were assigned to the failed devices will continue be assigned to this device.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pods that were assigned to the failed devices will continue be assigned to this device.
Pods that were assigned to the failed devices will continue be assigned to these devices.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 26, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@shannonxtreme shannonxtreme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not blocking feedback, but I think that it will help readability. If you can implement, please do, either in this PR or another.

For docs:

/lgtm

Comment on lines +321 to +323
It is typical that code relying on the device will start failing and Pod may get
into Failed phase if `restartPolicy` for the Pod was not `Always` or enter the crash loop
otherwise.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is typical that code relying on the device will start failing and Pod may get
into Failed phase if `restartPolicy` for the Pod was not `Always` or enter the crash loop
otherwise.
Code that relies on the devices will usually fail. Pods that don't set the `restartPolicy`
field to `Always` might enter the `Failed` phase. Pods that do set the `restartPolicy`
field to `Always` might enter a crash loop.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge this into the previous paragraph to make that a single "problem statement"

Comment on lines +325 to +327
By enabling the feature gate `ResourceHealthStatus`, the field `allocatedResourcesStatus`
will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`
field reports health information for each device assigned to the container.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
By enabling the feature gate `ResourceHealthStatus`, the field `allocatedResourcesStatus`
will be added to each container status, within the `.status` for each Pod. The `allocatedResourcesStatus`
field reports health information for each device assigned to the container.
If you enable the `ResourceHealthStatus` feature gate, the `allocatedResourcesStatus` field
is added to each entry in the `status.containerStatuses` field of the Pod specification. The `allocatedResourcesStatus` field reports health information for each device assigned to the
container.

Comment on lines +329 to +331
For a failed Pod, or or where you suspect a fault, you can use this status to understand whether
the Pod behavior may be associated with device failure. For example, if an accelerator is reporting
an over-temperature event, the `allocatedResourcesStatus` field may be able to report this.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For a failed Pod, or or where you suspect a fault, you can use this status to understand whether
the Pod behavior may be associated with device failure. For example, if an accelerator is reporting
an over-temperature event, the `allocatedResourcesStatus` field may be able to report this.
For a failed Pod, or where you suspect a fault, you can use the `allocatedResourcesStatus`
field to understand whether the Pod behavior might be associated with device failure. For
example, if an accelerator is reporting an over-temperature event, the
`allocatedResourcesStatus` field might help you to identify the event.

@shannonxtreme
Copy link
Contributor

(the typos that @pacoxu pointed out might be good to fix before submission)

@reylejano
Copy link
Member

The typos can be addressed in a follow-up PR
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: reylejano, tengqm

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@haircommander
Copy link
Contributor

#48861 rebases and address comments

@pohly
Copy link
Contributor

pohly commented Nov 26, 2024

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 26, 2024
@sftim
Copy link
Contributor

sftim commented Nov 27, 2024

/approve cancel

There are pending changes

@pacoxu when we hold PRs here, we prefer to state under what conditions someone could remove the hold
Could you add that detail in a comment?

@SergeyKanzhelev SergeyKanzhelev closed this by deleting the head repository Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. language/en Issues or PRs related to English language needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants