Skip to content

Commit

Permalink
4639: bump to beta
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Hunt <[email protected]>
  • Loading branch information
haircommander committed Oct 2, 2024
1 parent ed5d0b1 commit 4e7aec4
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 22 deletions.
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-node/4639.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 4639
alpha:
approver: "@deads2k"
beta:
approver: "@deads2k"
127 changes: 106 additions & 21 deletions keps/sig-node/4639-oci-volume-source/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,9 @@ tags, and then generate with `hack/update-toc.sh`.
- [Integration tests](#integration-tests)
- [e2e tests](#e2e-tests)
- [Graduation Criteria](#graduation-criteria)
- [Alpha](#alpha)
- [Beta](#beta)
- [GA](#ga)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
Expand Down Expand Up @@ -144,20 +147,20 @@ checklist items _must_ be updated for the enhancement to be released.

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [x] (R) KEP approvers have approved the KEP status as `implementable`
- [x] (R) Design details are appropriately documented
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
- [x] (R) Graduation criteria is in place
- [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [x] (R) Production readiness review completed
- [x] (R) Production readiness review approved
- [x] "Implementation History" section is up-to-date for milestone
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

<!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
Expand Down Expand Up @@ -759,7 +762,8 @@ This can inform certain test coverage improvements that we want to do before
extending the production code to implement this enhancement.
-->

- `<package>`: `<date>` - `<test coverage>`
- `pkg/kubelet/images`: `2-10-2024` - `83.8`
- `pkg/kubelet/kuberuntime`: `2-10-2024` - `66.6`

##### Integration tests

Expand All @@ -779,6 +783,7 @@ https://storage.googleapis.com/k8s-triage/index.html
-->

- <test>: <link to test coverage>
SIG node does not typically write e2e tests

##### e2e tests

Expand All @@ -794,6 +799,11 @@ We expect no non-infra related flakes in the last month as a GA graduation crite

- <test>: <link to test coverage>

No tests exist yet, but a combination of e2e and e2e_node tests will be added for beta:
- Test that the feature works
- Test that GC applies to image volumes correctly
- Test volumes can be shared among different containers

### Graduation Criteria

<!--
Expand Down Expand Up @@ -858,6 +868,22 @@ in back-to-back releases.
- Deprecate the flag
-->

#### Alpha

- Initial implementation added
- CRI implementations add support

#### Beta

- Add support for subpath volumes
- Expand unit and e2e tests

#### GA

- Multiple examples of real world uses
- Allowing time for feedback


### Upgrade / Downgrade Strategy

<!--
Expand All @@ -872,6 +898,9 @@ enhancement:
cluster required to make on upgrade, in order to make use of the enhancement?
-->

kube-apiserver, kubelet and CRI implementation have to support the feature for it to work e2e.
If any of these components downgrades or turns it off, the volume will not be mounted.

### Version Skew Strategy

<!--
Expand All @@ -887,6 +916,9 @@ enhancement:
CRI or CNI may require updating that component before the kubelet.
-->

Same as above: all the components must have support for it to work.
Specifically, every component will ignore or filter the field if it doesn't recognize/support it.

## Production Readiness Review Questionnaire

<!--
Expand Down Expand Up @@ -984,7 +1016,7 @@ Additionally, for features that are introducing a new API field, unit tests that
are exercising the `switch` of feature gate itself (what happens if I disable a
feature gate after having objects written with the new field) are also critical.
You can take a look at one potential example of such test in:
https://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
htps://github.com/kubernetes/kubernetes/pull/97058/files#diff-7826f7adbc1996a05ab52e3f5f02429e94b68ce6bce0dc534d1be636154fded3R246-R282
-->

Yes, unit tests for the alpha release for each component. End-to-end (serial
Expand All @@ -1008,12 +1040,15 @@ rollout. Similarly, consider large clusters and how enablement/disablement
will rollout across nodes.
-->

kube-apiserver must support it, and every node that has a pod scheduled that attempts to mount image volumes must have a kubelet and CRI that support it.
If any don't, the volume won't be enabled

Since this field only applies on a per node basis, once the control plane agrees on the feature gates, any kubelet with the feature on will have access to it, not needing
to wait for other workers.

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->
A sharp increase in high `kubelet_image_pull_duration_seconds_bucket`, potentially showing registry unavailability or delays.

###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Expand All @@ -1023,12 +1058,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

Not yet but they will be.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

<!--
Even if applying deprecation policies, they may still surprise some users.
-->

N/A

### Monitoring Requirements

<!--
Expand All @@ -1046,6 +1085,9 @@ checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

There was not any metric added for alpha.
For beta, a metric can be added to report an image volume was successfully mounted in the kubelet

###### How can someone using this feature know that it is working for their instance?

<!--
Expand All @@ -1065,6 +1107,9 @@ Recall that end users cannot usually observe component logs or access metrics.
- [ ] Other (treat as last resort)
- Details:

TODO(haircommander): what's the best way to check this? If the kubelet ignores the field in the pod spec, then KAS could request a
pod be created with an image volume and report that intent, but the kubelet could have not actually mounted it.

###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?

<!--
Expand All @@ -1088,20 +1133,31 @@ question.
Pick one more of these and delete the rest.
-->

- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- [x] Metrics
- Metric name: `kubelet_image_pull_duration_seconds_bucket`
- Components exposing the metric: kubelet
- [x] Metrics
- Metric name: `pod_start_sli_duration_seconds`
- Components exposing the metric: kubelet
- [ ] Other (treat as last resort)
- Details:

Note: there is not currently a well defined SLI[1] for stateful pods, and it's likely this feature will drastically affect that if a big
image needs to be pulled from a registry.

1: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md#footnote3

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

<!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->

A metric to show volume type so admins could find when their image volumes were requested
potentially a success/error metric for showing that volumes did or did not get mounted


### Dependencies

<!--
Expand All @@ -1125,6 +1181,8 @@ and creating new ones, as well as about cluster-level services (e.g. DNS):
- Impact of its degraded performance or high-error rates on the feature:
-->

A CRI implementation with support, and an available registry that has the requested image

### Scalability

<!--
Expand Down Expand Up @@ -1152,6 +1210,8 @@ Focusing mostly on:
heartbeats, leader election, etc.)
-->

No new API calls, but yes new image pulls from an OCI registry.

###### Will enabling / using this feature result in introducing new API types?

<!--
Expand All @@ -1161,6 +1221,8 @@ Describe them, providing:
- Supported number of objects per namespace (for namespace-scoped objects)
-->

new type of volume: image

###### Will enabling / using this feature result in any new calls to the cloud provider?

<!--
Expand All @@ -1169,6 +1231,8 @@ Describe them, providing:
- Estimated increase:
-->

Potentially for credential plugins to give node image pull credentials, but it doesn't need to.

###### Will enabling / using this feature result in increasing size or count of the existing API objects?

<!--
Expand All @@ -1178,6 +1242,8 @@ Describe them, providing:
- Estimated amount of new objects: (e.g., new Object X for every existing Pod)
-->

A new type of volume in the pods, which will have the same size as other volume types

###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

<!--
Expand All @@ -1189,6 +1255,8 @@ Think about adding additional work or introducing new steps in between
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos
-->

Not technically because stateful pods don't have a defined SLI

###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

<!--
Expand All @@ -1201,6 +1269,8 @@ This through this both in small and large cases, again with respect to the
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
-->

CPU/Memory will be used when pulling an image volume, proportional to the size of the image.

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

<!--
Expand All @@ -1213,6 +1283,8 @@ Are there any tests that were run/should be run to understand performance charac
and validate the declared limits?
-->

Yes, in the same way a large image used to run a container could use these resources (so no additional risk)

### Troubleshooting

<!--
Expand All @@ -1228,6 +1300,8 @@ details). For now, we leave it here.

###### How does this feature react if the API server and/or etcd is unavailable?

Pods won't be able to be created, so the feature won't be accessible

###### What are other known failure modes?

<!--
Expand All @@ -1243,8 +1317,14 @@ For each of them, fill in the following information by copying the below templat
- Testing: Are there any tests for failure mode? If not, describe why.
-->

Registry unavailability
- They can be found in the pod's events and look like image pull failures
- No tests at the time of writing

###### What steps should be taken if SLOs are not being met to determine the problem?

At the time of writing, check the pod events. If metrics are added for image pull issues, then checking those would help as well.

## Implementation History

<!--
Expand All @@ -1258,6 +1338,11 @@ Major milestones might include:
- when the KEP was retired or superseded
-->

- 16-05-2024 Issue opened
- 21-06-2024 KEP merged, targeted at Alpha
- 2-10-2024 KEP updated to beta


## Drawbacks

<!--
Expand Down
2 changes: 1 addition & 1 deletion keps/sig-node/4639-oci-volume-source/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ stage: alpha
# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.31"
latest-milestone: "v1.32"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
Expand Down

0 comments on commit 4e7aec4

Please sign in to comment.