Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CDI for device plugins KEP for GA graduation #4446

Merged
merged 1 commit into from
Feb 7, 2024

Conversation

bart0sh
Copy link
Contributor

@bart0sh bart0sh commented Jan 26, 2024

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels Jan 26, 2024
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 26, 2024
@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 26, 2024

/cc @klueska

@k8s-ci-robot k8s-ci-robot requested a review from klueska January 26, 2024 23:50
Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bart0sh bart0sh force-pushed the PR014-CDI-device-plugin-GA branch from 9bbd2ef to 6697eb3 Compare January 29, 2024 19:49
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 29, 2024
@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 29, 2024

Could you fill in this section as part of this update?

https://github.com/kubernetes/enhancements/blob/47ea8f211b3720741d6218d9affe884e24555348/keps/sig-node/4009-add-cdi-devices-to-device-plugin-api/README.md#e2e-tests

Thanks for pointing this out. Added the links, but DevicePlugin tests are failing. I'm considering running CDI-related tests separately. Will update the links when it's done.

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should have coverage for both containerd/crio? I've seen some container runtime related features have this as a requirement for GA. I think CDI is supported in both crio and containerd but not sure about test coverage.

@bart0sh
Copy link
Contributor Author

bart0sh commented Jan 30, 2024

@kannon92 Yes, I think adding CRI-O job would make sense. It would be even easier than Containerd job as CDI is enabled in CRI-O out of the box.

@kannon92
Copy link
Contributor

kannon92 commented Feb 1, 2024

/assign @johnbelamaric

or a shadow!

@bart0sh
Copy link
Contributor Author

bart0sh commented Feb 1, 2024

@klueska @SergeyKanzhelev @mrunalp Can you lgtm and approve, please?

@mrunalp
Copy link
Contributor

mrunalp commented Feb 3, 2024

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2024
Links to tests:
- TBD: Will fill-in by code freeze
Links to test grid:
- https://testgrid.k8s.io/sig-node-containerd#e2e-cos-device-plugin-gpu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to fix them and add CRI-O tests.

Copy link
Contributor Author

@bart0sh bart0sh Feb 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea All DevicePlugin test cases except one (unrrelated to CDI) pass locally:

$ make test-e2e-node FOCUS='Device Plugin' SKIP='\[Flaky\]' PARALLELISM=1
...
Summarizing 1 Failure:
  [FAIL] [sig-node] Device Plugin [Feature:DevicePluginProbe] [NodeFeature:DevicePluginProbe] [Serial] DevicePlugin [Serial] [Disruptive] [BeforeEach] Keeps device plugin assignments across node reboots (no pod restart, no device plugin re-registration) [sig-node, Feature:DevicePluginProbe, NodeFeature:DevicePluginProbe, Serial, Disruptive]
  k8s.io/kubernetes/test/e2e/framework/pod/pod_client.go:106

Ran 10 of 559 Specs in 859.812 seconds
FAIL! -- 9 Passed | 1 Failed | 0 Pending | 549 Skipped
--- FAIL: TestE2eNode (859.85s)
FAIL

Ginkgo ran 1 suite in 14m20.00390501s

Test infra seems to be broken as https://testgrid.k8s.io/sig-node-containerd#e2e-cos-device-plugin-gpu fails to start device plugins tests with this error:

Last output from querying API server follows:
-----------------------------------------------------
*   Trying 34.168.10.172:443...
* connect to 34.168.10.172 port 443 failed: Connection refused
* Failed to connect to 34.168.10.172 port 443 after 39 ms: Couldn't connect to server
* Closing connection 0
curl: (7) Failed to connect to 34.168.10.172 port 443 after 39 ms: Couldn't connect to server

From the first look it looks like it uses external IP instead of internal one.

Anyway, my point is that this should be fixed before this feature goes GA, but it shouldn't prevent to merge this PR.

Considering that Device Plugins are already GA and most test cases are not related to this KEP, would it make sense to create new jobs to test only this feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is your call really, I just pointed out that the links to demonstrate the test health were red ... tindependently of the KEP I think you should have coverage on device plugins too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they definitely need to be fixed. I've created an issue for that: kubernetes/test-infra#31849

And I'm going to separate CDI-related tests at least until this feature goes GA. Will update both links when it's done.


Links to k8s-triage for tests:
- TBD: Will fill-in by code freeze
- https://storage.googleapis.com/k8s-triage/index.html?test=DevicePlugin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bart0sh
Copy link
Contributor Author

bart0sh commented Feb 4, 2024

Could you fill in this section as part of this update?
https://github.com/kubernetes/enhancements/blob/47ea8f211b3720741d6218d9affe884e24555348/keps/sig-node/4009-add-cdi-devices-to-device-plugin-api/README.md#e2e-tests

Thanks for pointing this out. Added the links, but DevicePlugin tests are failing. I'm considering running CDI-related tests separately. Will update the links when it's done.

@aojea ^^^^

@johnbelamaric
Copy link
Member

/approve

for PRR. Obviously the tests Antonio pointed out need to be working before release.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bart0sh, johnbelamaric, mrunalp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2024
@k8s-ci-robot k8s-ci-robot merged commit 5f3b394 into kubernetes:master Feb 7, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.30 milestone Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

6 participants