Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device plugin failures: KEP docs #47029

Merged

Conversation

SergeyKanzhelev
Copy link
Member

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/node Categorizes an issue or PR as relevant to SIG Node. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. language/en Issues or PRs related to English language labels Jun 29, 2024
@k8s-ci-robot k8s-ci-robot added the sig/docs Categorizes an issue or PR as relevant to SIG Docs. label Jun 29, 2024
@SergeyKanzhelev SergeyKanzhelev changed the base branch from main to dev-1.31 June 29, 2024 06:52
@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 29, 2024
Copy link

netlify bot commented Jun 29, 2024

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit 780f166
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/667faeee41ce650007f6b4f1
😎 Deploy Preview https://deploy-preview-47029--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

netlify bot commented Jun 29, 2024

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit bb5d11b
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/66a921eae8ded0000801d0a3
😎 Deploy Preview https://deploy-preview-47029--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@Princesso
Copy link
Contributor

Hello @SergeyKanzhelev 👋 please take a look at Documenting for a release - PR Ready for Review to get your PR ready for review before Tuesday July 16th 2024 18:00 PST. Thank you!

@Princesso
Copy link
Contributor

Hi @SergeyKanzhelev, a gentle reminder that tomorrow is the deadline for having your Docs PR ready for review. Please take a look at Documenting for a release - PR Ready for Review to get your PR ready for review before tomorrow, Tuesday, July 16th, 2024 18:00 PST.

@hacktivist123
Copy link
Contributor

hacktivist123 commented Jul 23, 2024

Hello @SergeyKanzhelev 👋! I'm reaching out from the Docs team. Just checking in as we approach Docs Freeze on Tuesday, July 30th 18:00 PDT. This documentation appears to still be under review. To meet the Docs Freeze, this PR must have a technical review as well as lgtm and approve labels applied, without any unaddressed comments or concerns from SIG Docs. The status of this enhancement is marked as at risk for docs freeze.

Thank you!

@hacktivist123
Copy link
Contributor

/sig node

@katcosgrove
Copy link
Contributor

Hi, @SergeyKanzhelev. This PR is very, very far beyond its deadlines. Release Docs has tried to reach you repeatedly. The Docs Ready for Review deadline was last week, and the Docs Freeze is in 5 days. If you do not have documentation in this PR today and begin reviews, we will be forced to consider pulling this enhancement from the release. Please acknowledge.

@kubernetes/sig-node-leads

@SergeyKanzhelev SergeyKanzhelev force-pushed the device-plugin-failures branch from 6e0ba11 to 6b542f0 Compare July 25, 2024 19:30
@k8s-ci-robot k8s-ci-robot removed the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 25, 2024
Copy link

netlify bot commented Jul 25, 2024

👷 Deploy Preview for kubernetes-io-vnext-staging processing.

Name Link
🔨 Latest commit bb5d11b
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-vnext-staging/deploys/66a921eae854fb0008f53981

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 25, 2024
@SergeyKanzhelev SergeyKanzhelev changed the title [WIP] Device plugin failures: KEP docs Device plugin failures: KEP docs Jul 25, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2024
@kannon92
Copy link
Contributor

/cc @ffromani

TBH I don't have as much understanding about how scheduling works with device plugins. Maybe @ffromani can also review?

Otherwise, it looks good to me.

@k8s-ci-robot k8s-ci-robot requested a review from ffromani July 25, 2024 19:35
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

from sig-node. Content looks correct and reflects KEP and implementation, minor nonblocking comment inside

@Princesso
Copy link
Contributor

Hello @SergeyKanzhelev,

v1.31 Doc Lead here. This PR has been marked as at risk for Doc Freeze.

The Doc Reviewers have given feedback on it, however, an update is yet to be made on it, based on feedback.

Please note that the Doc freeze deadline is tomorrow, and if the PR is not merged by tomorrow, you will need to file an exception or this enhancement will be removed from this release. Thank you!

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2024
@k8s-ci-robot k8s-ci-robot requested a review from ffromani July 29, 2024 16:38
Copy link
Contributor

@mrunalp mrunalp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 18451a0b4ae73242e5b211e8f16bc700cac8bc84

@ffromani
Copy link
Contributor

LGTM content-wise from sig-node perspective (I see @mrunalp already added the tag)

Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

Docs feedback; this is close to being OK for alpha, but I do recommend some edits.

Comment on lines 196 to 197
This status can be used to understand whether pod failure or misbehavior was associated
with the device failure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing here:

Suggested change
This status can be used to understand whether pod failure or misbehavior was associated
with the device failure.
For a failed Pod, or or where you suspect a fault, you can use this status to understand whether
the Pod behavior may be associated with device failure. For example, if an accelerator is reporting
an over-temperature event, the `allocatedResourcesStatus` field may be able to report this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that example look correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(would like confirmation I guessed right)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device plugin API doesn't have provision to report unhealthy reason. The device will thus report to the kubelet just healthy/unhealthy. The current user-facing API (in the pod object) also report just healthy/unhealty. So the example is realistic, but in the pod status the user won't actually find "unhealthy because overheating" in 1.31, but just "unhealthy". @SergeyKanzhelev to keep me honest here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK cool. Polishing this up post-merge is fine.

How do we feel about a separate controller annotating the Pod with more detail vs. reporting an Event vs. neither of those?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sftim IMO this is great input for the next stages of this work!

@Princesso
Copy link
Contributor

Princesso commented Jul 30, 2024

Hello @SergeyKanzhelev

v1.31 Doc Lead here. This PR is still at risk for Doc Freeze because it is yet to be merged.
Please implement the changes required by the reviewers to have another round of reviews.

Kindly note that if the PR is not merged by 18:00 PDT today, you will need to file an exception for the associated enhancement to be a part of this milestone. Thank you!

Copy link
Contributor

@katcosgrove katcosgrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick pass for grammar

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2024
@k8s-ci-robot k8s-ci-robot requested a review from mrunalp July 30, 2024 16:36
Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

contentwise from sig-node perspective

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 30, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 84c0677cdd1dc57d57c92fe8714a9f85824d779c

@sftim
Copy link
Contributor

sftim commented Jul 30, 2024

/label tide/merge-method-squash

If you squash this to 1 commit keeping the same merge base and final tree, Prow keeps the LGTM too. And it's tidier in the commit history.

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jul 30, 2024
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
/approve

I recommend a local squash, then forced push (git push --force-with-lease), then unhold to merge the now squashed commits.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 30, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp, sftim

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 30, 2024
@sftim
Copy link
Contributor

sftim commented Jul 30, 2024

OK to unhold if this has been squashed, or if we decide to skip that because of deadlines.

@SergeyKanzhelev SergeyKanzhelev force-pushed the device-plugin-failures branch from d8d2d87 to bb5d11b Compare July 30, 2024 17:24
@SergeyKanzhelev
Copy link
Member Author

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 30, 2024
@k8s-ci-robot k8s-ci-robot merged commit cdd0f61 into kubernetes:dev-1.31 Jul 30, 2024
6 checks passed
@k8s-ci-robot k8s-ci-robot added this to the 1.31 milestone Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/docs Categorizes an issue or PR as relevant to SIG Docs. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
Development

Successfully merging this pull request may close these issues.

10 participants