diff --git a/keps/prod-readiness/sig-node/4639.yaml b/keps/prod-readiness/sig-node/4639.yaml new file mode 100644 index 00000000000..d057927a9ec --- /dev/null +++ b/keps/prod-readiness/sig-node/4639.yaml @@ -0,0 +1,3 @@ +kep-number: 4639 +alpha: + approver: "@deads2k" diff --git a/keps/sig-node/4639-oci-volume-source/README.md b/keps/sig-node/4639-oci-volume-source/README.md new file mode 100644 index 00000000000..36716f6303d --- /dev/null +++ b/keps/sig-node/4639-oci-volume-source/README.md @@ -0,0 +1,1392 @@ + +# KEP-4639: OCI VolumeSource + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Story 3](#story-3) + - [Story 4](#story-4) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Vocabulary: OCI Images, Artifacts, and Objects](#vocabulary-oci-images-artifacts-and-objects) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Kubernetes API](#kubernetes-api) + - [Kubelet and Container Runtime Interface (CRI) support for OCI artifacts](#kubelet-and-container-runtime-interface-cri-support-for-oci-artifacts) + - [kubelet](#kubelet) + - [Pull Policy](#pull-policy) + - [Registry authentication](#registry-authentication) + - [CRI](#cri) + - [Container Runtimes](#container-runtimes) + - [Filesystem representation](#filesystem-representation) + - [SELinux](#selinux) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [No enhancement](#no-enhancement) + - [KEP 1495: Volume Populators](#kep-1495-volume-populators) + - [Custom CSI Plugin](#custom-csi-plugin) + - [Advantages of In-Tree OCI VolumeSource](#advantages-of-in-tree-oci-volumesource) + - [Conclusion](#conclusion) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +The proposed enhancement adds a new `VolumeSource` to Kubernetes that supports OCI images and/or OCI artifacts. +This allows users to package files and share them among containers in a pod without including them in the main image, +thereby reducing vulnerabilities and simplifying image creation. + +While OCI images are well-supported by Kubernetes and CRI, extending support to +OCI artifacts involves recognizing additional media types within container +runtimes, implementing custom lifecycle management, resolution of artifact +registry referrers use pattern for artifacts, and ensuring appropriate +validation and security measures. + +## Motivation + +Supporting OCI images and artifacts directly as a `VolumeSource` allows +Kubernetes to focus on OCI standards as well as allows to store and distribute +any content using OCI registries. This allows the project to grow into use cases +which go beyond running particular images. + +### Goals + +- Introduce a new `VolumeSource` type that allows mounting OCI images and/or artifacts. +- Simplify the process of sharing files among containers in a pod. +- Providing a runtime guideline of how artifact files and directories should be + mounted. + +### Non-Goals + +- This proposal does not aim to replace existing `VolumeSource` types. +- This proposal does not address other use cases for OCI objects beyond directory sharing among containers in a pod. +- Mounting thousands of images and artifacts in a single pod. +- The enhancement leaves single file use case out for now and restrict the mount + output to directories. +- The runtimes (CRI-O, containerd, others) will have to agree on the + implementation of how artifacts are manifested as directories. We don't want + to over-spec on selecting based on media types or other attributes now and can + consider that for later. + - We don't want to tie too strongly to how models are hosted on a particular + provider so we are flexible to adapt to different ways models and their + configurations are stored. + - If some file, say a VM format such as a qcow file, is stored as an + artifact, we don't want the runtime to be the entity responsible for + interpreting and correctly processing it to its final consumable state. + That could be delegated to the consumer or perhaps to some hooks and is + out of scope for alpha. +- Manifest list use cases are left out for now and will be restricted to + matching architecture like we do today for images. In the future (if there are + use cases) we will consider support for lists with items separated by + quantization or format or other attributes. However, that is out of scope for + now as it is easily worked around by creating a different image/artifact for + each instance/format/quantization of a model. + + +## Proposal + +We propose to add a new `VolumeSource` that supports OCI images and/or artifacts. This `VolumeSource` will allow users to mount an OCI object +directly into a pod, making the files within the image accessible to the containers without the need to include them in the main image and to be able to host them in OCI compatible registries. + +### User Stories (Optional) + +#### Story 1 + +As a Kubernetes user, I want to share a configuration file among multiple containers in a pod without including the file in my main image, so that I can +minimize security risks and image size. + +Beside that, I want: +- to package this file in an OCI object to take advantage of OCI distribution. +- the image to be downloaded with the same credentials that kubelet using for other images. +- to be able to use image pull secrets when downloading the image if an image is from the registry that requires image pull secrets. +- to be able to update the configuration if the artifact is referenced by a + moving tag like `latest`. To achieve that, I just have to restart the pod and + specify a `pullPolicy` of `Always`. + +#### Story 2 + +As a DevOps engineer, I want to package and distribute binary artifacts using +OCI image and distribution specification standards and mount them directly into +my Kubernetes pods, so that I can streamline my CI/CD pipeline. This allows me to +maintain a small set of base images by attaching the CI/CD artifacts to them. +Beside that, I want to package the artifacts in an OCI object to take advantage +of OCI distribution. + +#### Story 3 + +As a data scientist, MLOps engineer, or AI developer, I want to mount large +language model weights or machine learning model weights in a pod alongside a +model-server, so that I can efficiently serve them without including them in the +model-server container image. I want to package these in an OCI object to take +advantage of OCI distribution and ensure efficient model deployment. This allows +to separate the model specifications/content from the executables that process +them. + +#### Story 4 + +As a security engineer, I want to use a public image for a malware scanner and +mount in a volume of private (commercial) malware signatures, so that I can load +those signatures without baking my own combined image (which might not be +allowed by the copyright on the public image). Those files work regardless of +the OS or version of the scanning software. + + +### Notes/Constraints/Caveats (Optional) + +- This enhancement assumes that the cluster has access to the OCI registry. +- The implementation must handle image pull secrets and other registry authentication mechanisms. +- Performance considerations must be taken into account, especially for large images or artifacts. + +### Vocabulary: OCI Images, Artifacts, and Objects + +**OCI Image ([spec](https://github.com/opencontainers/image-spec/blob/main/spec.md)):** + - A container image that conforms to the Open Container Initiative (OCI) Image Specification. + It includes a filesystem bundle and metadata required to run a container. + - Consists of multiple layers (each layer being a tarball), a manifest (which lists the layers), and a config file + (which provides configuration data such as environment variables, entry points, etc.). + - **Use Case:** Used primarily for packaging and distributing containerized applications. + +**OCI Artifact ([guidance](https://github.com/opencontainers/image-spec/blob/main/artifacts-guidance.md)):** + - An artifact describes any content that is stored and distributed using the OCI image format. + It includes not just container images but also other types of content like Helm charts, WASM modules, machine learning models, etc. + - Artifacts use the same image manifest and layer structure but may contain different types of data + within those layers. The artifact manifest can have media types that differ from those in standard container images. + - **Use Case:** Allows the distribution of non-container content using the same infrastructure and tools developed for OCI images. + +**OCI Object:** + - Umbrella term encompassing both OCI images and OCI artifacts. It represents + any object that conforms to the OCI specifications for storage and + distribution and can be represented as file or filesystem by an OCI container runtime. + +### Risks and Mitigations + +- **Security Risks:**: + - Allowing direct mounting of OCI images introduces potential attack + vectors. Mitigation includes thorough security reviews and limiting access + to trusted registries. Limiting to OCI artifacts (non-runnable content) + and read-only mode will lessen the security risk. + - Path traversal attacks are a high risk for introducing security + vulnerabilities. Container Runtimes should re-use their existing + implementations to merge layers as well as secure join symbolic links in + the container storage prevent such issues. +- **Compatibility Risks:** Existing webhooks watching for the images used by the pod using some policies will need to be updated to expect the image to be specified as a `VolumeSource`. +- **Performance Risks:** Large images or artifacts could impact performance. Mitigation includes optimizations in the implementation and providing + guidance on best practices for users. + +## Design Details + +The new `VolumeSource` will be defined in the Kubernetes API, and the implementation will involve updating components (CRI, Kubelet) +to support this source type. Key design aspects include: + +- API changes to introduce the new `VolumeSource` type. +- Modifications to the Kubelet to handle mounting OCI images and artifacts. +- Handling image pull secrets and registry authentication. +- The regular OCI images (that are used to create container rootfs today) can + be setup similarly as a directory and mounted as a volume source. +- For OCI artifacts, we want to convert and represent them as a directory with + files. A single file could also be nested inside a directory. + +### Kubernetes API + +The following code snippet illustrates the proposed API change: + +```yaml +apiVersion: v1 +kind: Pod +metadata: + name: example-pod +spec: + volumes: + - name: oci-volume + oci: + image: "example.com/my-image:latest" + pullPolicy: IfNotPresent + containers: + - name: my-container + image: busybox + volumeMounts: + - mountPath: /data + name: oci-volume +``` + + +This means we extend the [`VolumeSource`](https://github.com/kubernetes/kubernetes/blob/7b359a2f9e1ff5cdc49cfcc4e350e9d796f502c0/staging/src/k8s.io/api/core/v1/types.go#L49) +by: + +```go +// Represents the source of a volume to mount. +// Only one of its members may be specified. +type VolumeSource struct { + // … + + // oci represents a OCI object pulled and mounted on kubelet's host machine + // +featureGate=OCIVolume + // +optional + OCI *OCIVolumeSource `json:"oci,omitempty" protobuf:"bytes,30,opt,name=oci" +} +``` + +And add the corresponding `OCIVolumeSource` type: + +```go +// OCIVolumeSource represents a OCI volume resource. +type OCIVolumeSource struct { + // Required: Image or artifact reference to be used + Reference string `json:"reference,omitempty" protobuf:"bytes,1,opt,name=reference"` + + // Policy for pulling OCI objects + // Defaults to IfNotPresent + // +optional + PullPolicy PullPolicy `json:"pullPolicy,omitempty" protobuf:"bytes,2,opt,name=pullPolicy,casttype=PullPolicy"` +} +``` + +The same will apply to [`pkg/apis/core/types.VolumeSource`](https://github.com/kubernetes/kubernetes/blob/7b359a2f9e1ff5cdc49cfcc4e350e9d796f502c0/pkg/apis/core/types.go#L58), +which is the internal API compared to the external one from staging. The [API Validation](https://github.com/kubernetes/kubernetes/blob/7b359a2f9e1ff5cdc49cfcc4e350e9d796f502c0/pkg/apis/core/validation/validation.go) +validation will be extended to disallow the `subPath`/`subPathExpr` field as +well as making the `reference` mandatory: + +```go +// … + +if source.OCI != nil { + if numVolumes > 0 { + allErrs = append(allErrs, field.Forbidden(fldPath.Child("oci"), "may not specify more than 1 volume type")) + } else { + numVolumes++ + allErrs = append(allErrs, validateOCIVolumeSource(source.OCI, fldPath.Child("oci"))...) + } +} + +// … +``` + +```go +func validateOCIVolumeSource(oci *core.OCIVolumeSource, fldPath *field.Path) field.ErrorList { + allErrs := field.ErrorList{} + if len(oci.Reference) == 0 { + allErrs = append(allErrs, field.Required(fldPath.Child("reference"), "")) + } + allErrs = append(allErrs, validatePullPolicy(oci.PullPolicy, fldPath.Child("pullPolicy"))...) + return allErrs +} +``` + +```go +// … + +// Disallow subPath/subPathExpr for OCI volumes +if v, ok := volumes[mnt.Name]; ok && v.OCI != nil { + if mnt.SubPath != "" { + allErrs = append(allErrs, field.Invalid(idxPath.Child("subPath"), mnt.SubPath, "not allowed in OCI volume sources")) + } + if mnt.SubPathExpr != "" { + allErrs = append(allErrs, field.Invalid(idxPath.Child("subPathExpr"), mnt.SubPathExpr, "not allowed in OCI volume sources")) + } +} + +// … +``` + +### Kubelet and Container Runtime Interface (CRI) support for OCI artifacts + +Kubelet and the Container Runtime Interface (CRI) currently handle OCI images. To support OCI artifacts, +potential enhancements may be required: + +**Extended Media Type Handling in the container runtime:** + - Update container runtimes to recognize and handle new media types associated with OCI artifacts. + - Ensure that pulling and storing these artifacts is as efficient and secure as with OCI images. + +**Lifecycling and Garbage Collection:** + - Reuse the existing kubelet logic for managing the lifecycle of OCI objects. + - Extending the existing image garbage collection to not remove an OCI volume + image if a pod is still referencing it. + +**Artifact-Specific Configuration:** + - Introduce new configuration options to handle the unique requirements of different types of OCI artifacts. + +**Artifacts as Subject Referrers:** + - Introduce new refer to image and filter for artifact type criterion/options for to be mounted artifact(s). + - Certain types of OCI artifacts include a subject reference. That reference + identifies the artifact/image for which this artifact refers. For example a + signature artifact could refer to a platform index, for certifying the + platform images, or to an SBOM artifact that refers to a platform matched + image. These artifacts may or may not be located on the same + registry/repository. The new referrers API allows for discovering + artifacts from a requested repository. + - How Kubernetes and especially runtimes should support OCI referrers is not + part of the alpha feature and will be considered in future graduations. + +**Validation:** + - Extend validation and security checks to cover new artifact types. + - Disallow `subPath`/`subPathExpr` mounting through the API validation + +**Storage Optimization in the container runtime:** + - Develop optimized storage solutions tailored for different artifact types, + potentially integrating with existing storage solutions or introducing new mechanisms. + +#### kubelet + +While the container runtime will be responsible of pulling and storing the OCI +objects in the same way as for images, the kubelet still has to manage the full +lifecycle of them. This means that some parts of the existing kubelet code can +be reused, for example: + +- The logic how to ensure that an image exists on the node: + https://github.com/kubernetes/kubernetes/blob/39c6bc3/pkg/kubelet/images/image_manager.go#L102 +- The retrieval of available secrets for a pod: + https://github.com/kubernetes/kubernetes/blob/39c6bc3/pkg/kubelet/kubelet_pods.go#L988 + +##### Pull Policy + +While the `imagePullPolicy` is working on container level, the introduced +`pullPolicy` is a pod level construct. This means that we can support the same +values `IfNotPresent`, `Always` and `Never`, but will only pull once per pod. + +Technically it means that we need to pull in [`SyncPod`](https://github.com/kubernetes/kubernetes/blob/b498eb9/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L1049) +for OCI objects on a pod level and not during [`EnsureImageExists`](https://github.com/kubernetes/kubernetes/blob/b498eb9/pkg/kubelet/images/image_manager.go#L102) +before the container gets started. + +If users want to re-pull artifacts when referencing moving tags like `latest`, +then they need to restart / evict the pod. + +The [AlwaysPullImages](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#alwayspullimages) +admission plugin needs to respect the pull policy as well and has to set the +field accordingly. + +##### Registry authentication + +For registry authentication purposes the same logic will be used as for the +container image. + +#### CRI + +The CRI API is already capable of managing container images [via the `ImageService`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L146-L161). +Those RPCs will be re-used for managing OCI artifacts, while the [`ImageSpec`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L798-L813) +as well as [`PullImageResponse`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L1530-L1534) +will be extended to mount the OCI object to a local path: + +```protobuf + +// ImageSpec is an internal representation of an image. +message ImageSpec { + // … + + // Indicate that the OCI object should be mounted. + bool mount = 20; + + // SELinux label to be used. + string mount_label = 21; +} + +message PullImageResponse { + // … + + // Absolute local path where the OCI object got mounted. + string mountpoint = 2; +} +``` + +This allows to re-use the existing kubelet logic for managing the OCI objects, +with the caveat that the new `VolumeSource` won't be isolated in a dedicated +plugin as part of the existing [volume manager](https://github.com/kubernetes/kubernetes/tree/6d0aab2/pkg/kubelet/volumemanager). + +The added `mount_label` allow the kubelet to support SELinux contexts. + +The kubelet will use the `mountpoint` on container creation +(by calling the `CreateContainer` RPC) to indicate the additional required volume mount ([`ContainerConfig.Mount`](https://github.com/kubernetes/cri-api/blob/3a66d9d/pkg/apis/runtime/v1/api.proto#L1102)) +from the runtime. The runtime needs to ensure that mount and also manages its +lifecycle, for example to remove the bind mount on container removal. + +The kubelet tracks the information about which OCI object is used by which +sandbox and therefore manages the lifecycle of them. + +The proposal also considers smaller CRI changes, for example to add a list of +mounted volume paths to the `ImageStatusResponse.Image` message returned by the +`ImageStatus` RPC. This allows providing the right amount of information between +the kubelet and the runtime to ensure that no context gets lost in restart +scenarios. + +The overall flow for container creation will look like this: + +```mermaid +sequenceDiagram + participant K as kubelet + participant C as Container Runtime + Note left of K: During pod sync + Note over K,C: CRI + K->>+C: RPC: PullImage + Note right of C: Pull and mount
OCI object + C-->>-K: PullImageResponse.Mountpoint + Note left of K: Add mount points
to container
creation request + K->>+C: RPC: CreateContainer + Note right of C: Add bind mounts
from object mount
point to container + C-->>-K: CreateContainerResponse +``` + +1. **Kubelet Initiates Image Pull**: + - During pod setup, the kubelet initiates the pull for the OCI object based on the volume source. + - The kubelet passes the necessary indicator to mount the object to the container runtime. + +2. **Runtime Handles Mounting**: + - The container runtime mounts the OCI object as a filesystem using the metadata provided by the kubelet. + - The runtime returns the mount point information to the kubelet. + +3. **Redirecting of the Mountpoint**: + - The kubelet uses the returned mount point to build the container creation request for each container using that mount. + - The kubelet initiates the container creation and the runtime creates the required bind mounts to the target location. + This is the current implemented behavior for all other mounts and should require no actual container runtime code change. + +4. **Lifecycle Management**: + - The container runtime manages the lifecycle of the mounts, ensuring they are created during pod setup and cleaned up upon sandbox removal. + +5. **Tracking and Coordination**: + - The kubelet and runtime coordinate to track pods requesting mounts to avoid removing containers with volumes in use. + - During image garbage collection, the runtime provides the kubelet with the necessary mount information to ensure proper cleanup. + +6. **SELinux Context Handling**: + - The runtime applies SELinux labels to the volume mounts based on the security context provided by the kubelet, ensuring consistent enforcement of security policies. + +7. **Pull Policy Implementation**: + - The `pullPolicy` at the pod level will determine when the OCI object is pulled, with options for `IfNotPresent`, `Always`, and `Never`. + - `IfNotPresent`: Prevents redundant pulls and uses existing images when available. + - `Always`: Ensures the latest images are used, for example, with development and testing environments. + - `Never`: Ensures only pre-pulled images are used, for example, in air-gapped or controlled environments. + +8. **Security and Performance Optimization**: + - Implement thorough security checks to mitigate risks such as path traversal attacks. + - Optimize performance for handling large OCI artifacts, including caching strategies and efficient retrieval methods. + +#### Container Runtimes + +Container runtimes need to support the new `mount` field, otherwise the +feature cannot be used. The kubelet will verify if the returned `mountpoint` +actually exists on disk to check the feature availability, because Protobuf will +strip the field in a backwards compatible way for older runtimes. Pods using the +new `VolumeSource` combined with a not supported container runtime version will +fail to run on the node. + +For security reasons, volume mounts should set the [`noexec`] and `ro` +(read-only) options by default. + +##### Filesystem representation + +Container Runtimes are expected to return a `mountpoint`, which is a single +directory containing the unpacked (in case of tarballs) and merged layer files +from the image or artifact. If an OCI artifact has multiple layers (in the same +way as for container images), then the runtime is expected to merge them +together. Duplicate files from distinct layers will be overwritten from the +higher indexed layer. + +Runtimes are expected to be able to handle layers as tarballs (like they do for +images right now) as well as plain single files. How the runtimes implement the +expected output and which media types they want to support is deferred to them +for now. Kubernetes only defines the expected output as a single directory +containing the (unpacked) content. + +###### Example using ORAS + +Assuming the following directory structure: + +```console +./ +├── dir/ +│ └── file +└── file +``` + +```console +$ cat dir/file +layer0 + +$ cat file +layer1 +``` + +Then we can manually create two distinct layers by: + +```bash +tar cfvz layer0.tar dir +tar cfvz layer1.tar file +``` + +We also need a `config.json`, ideally indicating the requested architecture: + +```bash +jq --null-input '.architecture = "amd64" | .os = "linux"' > config.json +``` + +Now using [ORAS](https://oras.land) to push the distinct layers: + +```bash +oras push --config config.json:application/vnd.oci.image.config.v1+json \ + localhost:5000/image:v1 \ + layer0.tar:application/vnd.oci.image.layer.v1.tar+gzip \ + layer1.tar:application/vnd.oci.image.layer.v1.tar+gzip +``` + +```console +✓ Uploaded layer1.tar 129/129 B 100.00% 73ms + └─ sha256:0c26e9128651086bd9a417c7f0f3892e3542000e1f1fe509e8fcfb92caec96d5 +✓ Uploaded application/vnd.oci.image.config.v1+json 47/47 B 100.00% 126ms + └─ sha256:4a2128b14c6c3699084cd60f24f80ae2c822f9bd799b24659f9691cbbfccae6b +✓ Uploaded layer0.tar 166/166 B 100.00% 132ms + └─ sha256:43ceae9994ffc73acbbd123a47172196a52f7d1d118314556bac6c5622ea1304 +✓ Uploaded application/vnd.oci.image.manifest.v1+json 752/752 B 100.00% 40ms + └─ sha256:7728cb2fa5dc31ad8a1d05d4e4259d37c3fc72e1fbdc0e1555901687e34324e9 +Pushed [registry] localhost:5000/image:v1 +ArtifactType: application/vnd.oci.image.config.v1+json +Digest: sha256:7728cb2fa5dc31ad8a1d05d4e4259d37c3fc72e1fbdc0e1555901687e34324e9 +``` + +The resulting manifest looks like: + +```bash +oras manifest fetch localhost:5000/image:v1 | jq . +``` + +```json +{ + "schemaVersion": 2, + "mediaType": "application/vnd.oci.image.manifest.v1+json", + "config": { + "mediaType": "application/vnd.oci.image.config.v1+json", + "digest": "sha256:4a2128b14c6c3699084cd60f24f80ae2c822f9bd799b24659f9691cbbfccae6b", + "size": 47 + }, + "layers": [ + { + "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", + "digest": "sha256:43ceae9994ffc73acbbd123a47172196a52f7d1d118314556bac6c5622ea1304", + "size": 166, + "annotations": { + "org.opencontainers.image.title": "layer0.tar" + } + }, + { + "mediaType": "application/vnd.oci.image.layer.v1.tar+gzip", + "digest": "sha256:0c26e9128651086bd9a417c7f0f3892e3542000e1f1fe509e8fcfb92caec96d5", + "size": 129, + "annotations": { + "org.opencontainers.image.title": "layer1.tar" + } + } + ], + "annotations": { + "org.opencontainers.image.created": "2024-06-14T07:49:06Z" + } +} +``` + +The container runtime can now pull the artifact with the `mount = true` CRI +field set, for example using an experimental [`crictl pull --mount` flag](https://github.com/kubernetes-sigs/cri-tools/compare/master...saschagrunert:oci-volumesource-poc): + +```bash +sudo crictl pull --mount localhost:5000/image:v1 +``` + +```console +Image is up to date for localhost:5000/image@sha256:7728cb2fa5dc31ad8a1d05d4e4259d37c3fc72e1fbdc0e1555901687e34324e9 +Image mounted to: /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged +``` + +And the returned `mountpoint` contains the unpacked layers as directory tree: + +```bash +sudo tree /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged +``` + +```console +/var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged +├── dir +│   └── file +└── file + +2 directories, 2 files +``` + +```console +$ sudo cat /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged/dir/file +layer0 + +$ sudo cat /var/lib/containers/storage/overlay/7ee9a1dcea9f152b10590871e55e485b249cd42ea912111ff9f99ab663c1001a/merged/file +layer1 +``` + +ORAS (and other tools) are also able to push multiple files or directories +within a single layer. This should be supported by container runtimes in the +same way. + +##### SELinux + +Traditionally, the container runtime is responsible of applying SELinux labels +to volume mounts, which are inherited from the `securityContext` of the pod or +container. Relabeling volume mounts can be time-consuming, especially when there +are many files on the volume. + +If the following criteria are met, then the kubelet will use the `mount_label` +field in the CRI to apply the right SELinux label to the mount. + +- The operating system must support SELinux +- The Pod must have at least `seLinuxOptions.level` assigned in the + `PodSecurityContext` or all volume using containers must have it set in their + `SecurityContexts`. Kubernetes will read the default user, role and type from + the operating system defaults (typically `system_u`, `system_r` and `container_t`). + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: OCIVolume + - Components depending on the feature gate: + - kube-apiserver (API validation) + - kubelet (volume mount) + +###### Does enabling the feature change any default behavior? + + + +Yes, it makes the new `VolumeSource` API functional. + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +Yes, by disabling the feature gate. Existing workloads will not be affected by +the change. + +To clear old volumes, all workloads using the `VolumeSource` needs to be +recreated after restarting the kubelets. The kube-apiserver does only the API +validation whereas the kubelets serve the implementation. This means means that +a restart of the kubelet as well as the workload would be enough to disable the +feature functionality. + +###### What happens if we reenable the feature if it was previously rolled back? + +It will make the API functional again. If the feature gets re-enabled only for a +subset of kubelets and a user runs a scalable deployment or daemonset, then the +volume source will be only available for some pod instances. + +###### Are there any tests for feature enablement/disablement? + + + +Yes, unit tests for the alpha release for each component. End-to-end (serial +node) tests will be targeted for beta. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + +### No enhancement + +Currently, a shared volume approach can be used. This involves packaging file to share in an image that includes a shell in its base layer. +An init container can be used to copy files from an image to a shared volume using shell commands. This volume can be made accessible to all +containers in the pod. + +An OCI VolumeSource eliminates the need for a shell and an init container by allowing the direct mounting of OCI images as volumes, +making it easier to modularize. For example, in the case of LLMs and model-servers, it is useful to package them in separate images, +so various models can plug into the same model-server image. An OCI VolumeSource not only simplifies file copying but also allows +container native distribution, authentication, and version control for files. + +### [KEP 1495: Volume Populators](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1495-volume-populators) + +The volume-populators API extension allows you to populate a volume with data from an external data source when the volume is created. +This is a good solution for restoring a volume from a snapshot or initializing a volume with data from a database backup. However, it does not +address the desire to use OCI distribution, versioning, and signing for mounted data. + +The proposed in-tree OCI VolumeSource provides a direct and integrated approach to mount OCI artifacts, leveraging the existing OCI +infrastructure for packaging, distribution, and security. + +### Custom CSI Plugin + +See [https://github.com/warm-metal/container-image-csi-driver](https://github.com/warm-metal/container-image-csi-driver) + +An out-of-tree CSI plugin can provide flexibility and modularity, but there are trade-offs to consider: + + - Complexity of managing an external CSI plugin. This includes handling the installation, configuration, and updates of the CSI driver, which adds + an additional operational burden. For a generic, vendor-agnostic, and widely-adopted solution this would not make sense. + - Supporting the image pull secrets as well as credentials provider will be tricky and needs to be reimplemented with the separate API calls. + - External CSI plugins implement their own lifecycle management and garbage collection mechanisms, + yet these already exist in-tree for OCI images. + - The kubelet has max parallel image pull constant to maintain the reasonable + load on a disk and network. This will not be respected by CSI driver and + the only point of integration may be if we move this constant down to + runtime. + - The kubelet has GC logic that is not cleaning up images immediately in + case they will be reused. Also GC logic has it's own thresholds and + behavior on eviction. It will be nice to have those integrated. + - The kubelet exposes metrics on image pulls and we have KEP in place to + improve it even further. Having CSI exposing those metrics will require + customer to integrate with one more source of data. + - Performance: There is additional overhead with an out-of-tree CSI plugin, especially in scenarios requiring frequent image pulls + or large volumes of data. + +### Advantages of In-Tree OCI VolumeSource + +1. **Leverage Existing Mechanisms:** + - **No New Data Types or Objects:** OCI images are already a core part of the Kubernetes ecosystem. Extending support for OCI artifacts, many of + the same mechanisms will be reused. This ensures consistency and reduces complexity, as both adhere to the same OCI image format. + - **Existing Lifecycle Management and Garbage Collection:** Kubernetes has efficient lifecycle management and garbage collection mechanisms for + volumes and container images. The in-tree OCI VolumeSource will utilize these existing mechanisms. + +2. **Integration with Kubernetes:** + - **Optimal Performance:** Deep integration with the scheduler and kubelet ensures optimal performance and + resource management. This integration allows the OCI VolumeSource to benefit from all existing optimizations and features. + - **Unified Interface:** Users interact with a consistent and unified interface for managing volumes, reducing the learning curve and + potential for configuration errors. + +3. **Simplified Maintenance and Updates:** + - **Core Project Maintenance:** In-tree features are maintained and updated as part of the core project. It makes sense + for widely-used and vendor agnostic features to utilize the core testing infrastructure, release cycles, and security updates. + +### Conclusion + +The in-tree implementation of an OCI VolumeSource offers significant advantages by leveraging existing core mechanisms, +ensuring deep integration, and simplifying management. This approach avoids the complexity, duplication, and other potential inefficiencies +of out-of-tree CSI plugins, providing a more reliable solution for mounting OCI images and artifacts. diff --git a/keps/sig-node/4639-oci-volume-source/kep.yaml b/keps/sig-node/4639-oci-volume-source/kep.yaml new file mode 100644 index 00000000000..e53effcbcaf --- /dev/null +++ b/keps/sig-node/4639-oci-volume-source/kep.yaml @@ -0,0 +1,67 @@ +title: OCI objects as VolumeSource +kep-number: 4639 +authors: + - "@sallyom" + - "@saschagrunert" +owning-sig: sig-node +participating-sigs: + - sig-node + - sig-storage +status: implementable +creation-date: 2024-05-17 +reviewers: + - "@BigVan" + - "@ChaoyiHuang" + - "@SergeyKanzhelev" + - "@aojea" + - "@arewm" + - "@cgwalters" + - "@gnufied" + - "@haircommander" + - "@humblec" + - "@jsafrane" + - "@kfox1111" + - "@ktarplee" + - "@liu-cong" + - "@mariasalcedo" + - "@mikebrow" + - "@mrunalp" + - "@rchincha" + - "@samuelkarp" + - "@sftim" + - "@smarterclayton" + - "@sudo-bmitch" + - "@syed" + - "@terrytangyuan" + - "@vsoch" + - "@xing-yang" +approvers: + - "@sig-node-leads" + - "@sig-storage-leads" + - "@mrunalp" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.31" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.31" + beta: "v1.32" + stable: "v1.33" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: OCIVolume + components: + - kube-apiserver + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +metrics: []