diff --git a/keps/prod-readiness/sig-node/4639.yaml b/keps/prod-readiness/sig-node/4639.yaml index d057927a9ec4..de3487931e8d 100644 --- a/keps/prod-readiness/sig-node/4639.yaml +++ b/keps/prod-readiness/sig-node/4639.yaml @@ -1,3 +1,5 @@ kep-number: 4639 alpha: approver: "@deads2k" +beta: + approver: "@deads2k" diff --git a/keps/sig-node/4639-oci-volume-source/README.md b/keps/sig-node/4639-oci-volume-source/README.md index 44bc7b071480..e19d48d88ed7 100644 --- a/keps/sig-node/4639-oci-volume-source/README.md +++ b/keps/sig-node/4639-oci-volume-source/README.md @@ -107,6 +107,9 @@ tags, and then generate with `hack/update-toc.sh`. - [Integration tests](#integration-tests) - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) @@ -144,20 +147,20 @@ checklist items _must_ be updated for the enhancement to be released. Items marked with (R) are required *prior to targeting to a milestone / release*. -- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented -- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [x] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free -- [ ] (R) Graduation criteria is in place - - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) -- [ ] (R) Production readiness review completed -- [ ] (R) Production readiness review approved -- [ ] "Implementation History" section is up-to-date for milestone -- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] -- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes +- [x] (R) Graduation criteria is in place + - [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [x] (R) Production readiness review completed +- [x] (R) Production readiness review approved +- [x] "Implementation History" section is up-to-date for milestone +- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes -- ``: `` - `` +- `pkg/kubelet/images`: `2-10-2024` - `83.8` +- `pkg/kubelet/kuberuntime`: `2-10-2024` - `66.6` ##### Integration tests @@ -779,6 +783,7 @@ https://storage.googleapis.com/k8s-triage/index.html --> - : +SIG node does not typically write e2e tests ##### e2e tests @@ -794,6 +799,11 @@ We expect no non-infra related flakes in the last month as a GA graduation crite - : +No tests exist yet, but a combination of e2e and e2e_node tests will be added for beta: +- Test that the feature works +- Test that GC applies to image volumes correctly +- Test volumes can be shared among different containers + ### Graduation Criteria +#### Alpha + +- Initial implementation added +- CRI implementations add support + +#### Beta + +- Add support for subpath volumes +- Expand unit and e2e tests + +#### GA + +- Multiple examples of real world uses +- Allowing time for feedback + + ### Upgrade / Downgrade Strategy +kube-apiserver, kubelet and CRI implementation have to support the feature for it to work e2e. +If any of these components downgrades or turns it off, the volume will not be mounted. + ### Version Skew Strategy +Same as above: all the components must have support for it to work. +Specifically, every component will ignore or filter the field if it doesn't recognize/support it. + ## Production Readiness Review Questionnaire Yes, unit tests for the alpha release for each component. End-to-end (serial @@ -1008,12 +1040,15 @@ rollout. Similarly, consider large clusters and how enablement/disablement will rollout across nodes. --> +kube-apiserver must support it, and every node that has a pod scheduled that attempts to mount image volumes must have a kubelet and CRI that support it. +If any don't, the volume won't be enabled + +Since this field only applies on a per node basis, once the control plane agrees on the feature gates, any kubelet with the feature on will have access to it, not needing +to wait for other workers. + ###### What specific metrics should inform a rollback? - +A sharp increase in high `kubelet_image_pull_duration_seconds_bucket`, potentially showing registry unavailability or delays. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? @@ -1023,12 +1058,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and can't do that now. --> +Not yet but they will be. + ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +N/A + ### Monitoring Requirements +There was not any metric added for alpha. +For beta, a metric can be added to report an image volume was successfully mounted in the kubelet + ###### How can someone using this feature know that it is working for their instance? -- [ ] Metrics - - Metric name: - - [Optional] Aggregation method: - - Components exposing the metric: +- [x] Metrics + - Metric name: `kubelet_image_pull_duration_seconds_bucket` + - Components exposing the metric: kubelet +- [x] Metrics + - Metric name: `pod_start_sli_duration_seconds` + - Components exposing the metric: kubelet - [ ] Other (treat as last resort) - Details: +Note: there is not currently a well defined SLI[1] for stateful pods, and it's likely this feature will drastically affect that if a big +image needs to be pulled from a registry. + +1: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#footnote3 + ###### Are there any missing metrics that would be useful to have to improve observability of this feature? +A metric to show volume type so admins could find when their image volumes were requested +potentially a success/error metric for showing that volumes did or did not get mounted + + ### Dependencies +A CRI implementation with support, and an available registry that has the requested image + ### Scalability +No new API calls, but yes new image pulls from an OCI registry. + ###### Will enabling / using this feature result in introducing new API types? +new type of volume: image + ###### Will enabling / using this feature result in any new calls to the cloud provider? +Potentially for credential plugins to give node image pull credentials, but it doesn't need to. + ###### Will enabling / using this feature result in increasing size or count of the existing API objects? +A new type of volume in the pods, which will have the same size as other volume types + ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? +Not technically because stateful pods don't have a defined SLI + ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +CPU/Memory will be used when pulling an image volume, proportional to the size of the image. + ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? +Yes, in the same way a large image used to run a container could use these resources (so no additional risk) + ### Troubleshooting +Registry unavailability +- They can be found in the pod's events and look like image pull failures +- No tests at the time of writing + ###### What steps should be taken if SLOs are not being met to determine the problem? +At the time of writing, check the pod events. If metrics are added for image pull issues, then checking those would help as well. + ## Implementation History +- 16-05-2024 Issue opened +- 21-06-2024 KEP merged, targeted at Alpha +- 2-10-2024 KEP updated to beta + + ## Drawbacks