diff --git a/keps/sig-storage/1698-generic-ephemeral-volumes/README.md b/keps/sig-storage/1698-generic-ephemeral-volumes/README.md new file mode 100644 index 000000000000..dd606d1fc70a --- /dev/null +++ b/keps/sig-storage/1698-generic-ephemeral-volumes/README.md @@ -0,0 +1,566 @@ + +# KEP-1698: generic ephemeral inline volumes + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Persistent Memory as DRAM replacement for memcached](#persistent-memory-as-dram-replacement-for-memcached) + - [Local LVM storage as scratch space](#local-lvm-storage-as-scratch-space) + - [Read-only access to volumes with data](#read-only-access-to-volumes-with-data) + - [Risks and Mitigations](#risks-and-mitigations) +- [Example](#example) +- [Design Details](#design-details) + - [PVC meta data](#pvc-meta-data) + - [Preventing accidental collision with existing PVCs](#preventing-accidental-collision-with-existing-pvcs) + - [Pod events](#pod-events) + - [Feature gate](#feature-gate) + - [Modifying volumes](#modifying-volumes) + - [Late binding](#late-binding) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Alpha -> Beta Graduation](#alpha---beta-graduation) + - [Beta -> GA Graduation](#beta---ga-graduation) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature enablement and rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Alternatives](#alternatives) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [X] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [X] (R) Graduation criteria is in place +- [ ] (R) Production readiness review completed +- [ ] Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This KEP proposes an alternative mechanism for specifying volumes +inside a pod spec. Those inline volumes then get converted to normal +volume claim objects for provisioning, so storage drivers do not need +to be modified. + +The lifecycle of those volumes is ephemeral: they will get deleted +once the pod terminates. + +Because this is expected to be slower than [CSI ephemeral inline +volumes](https://github.com/kubernetes/enhancements/issues/596), both +approaches will need to be supported. + + +## Motivation + +The current CSI ephemeral inline volume feature has demonstrated that +ephemeral inline volumes are a useful concept, for example to provide +additional scratch space for a pod. Several CSI drivers support it +now, most of them in addition to normal volumes (see +https://kubernetes-csi.github.io/docs/drivers.html). + +However, the original design was intentionally focused on +light-weight, local volumes. It is not a good fit for volumes provided +by a more traditional storage system because: +- The normal API for selecting volume parameters (like size and + storage class) is not supported. +- Integration into storage capacity aware pod scheduling is + challenging and would depend on extending the CSI ephemeral inline + volume API. +- CSI drivers need to be adapted and have to take over some of the + work normally done by Kubernetes, like tracking of orphaned volumes. + + +### Goals + +- Volumes can be specified inside the pod spec ("inline"). +- Volumes are created for specific pods and deleted after the pod + terminates ("ephemeral"). +- A normal, unmodified storage driver can be selected via a storage class. +- The volume will be created using the normal storage provisioning + mechanism, without having to modify the driver or its deployment. +- Storage capacity tracking can be enabled also for such volumes. + +### Non-Goals + +- This will not replace CSI ephemeral inline volumes because the + goals for those (light-weight local volumes) are not a good fit + for the approach proposed here. +- These inline volumes will always be ephemeral. If making them persistent + was allowed, some additional controller would be needed to manage them + after pod termination, in which case it will probably be simpler to + also create them separately. + +## Proposal + +A new volume source will be introduced: + +``` +type EphemeralVolumeSource struct { + // Will be created as a stand-alone PVC to provision the volume. + // Required, must not be nil. + VolumeClaimTemplate *PersistentVolumeClaim + ReadOnly bool +} +``` + +This mimics a +[`PersistentVolumeClaimVolumeSource`](https://pkg.go.dev/k8s.io/api/core/v1?tab=doc#PersistentVolumeClaimVolumeSource), +except that it contains a `PersistentVolumeClaim` instead of +referencing one by name. A full `PersistentVolumeClaim` is used for +consistency with the StatefulSet API. + +`VolumeClaimTemplate` is defined as a pointer because in the future there +might be alternative ways to specify how the ephemeral +volume gets provisioned, in which case `nil` will become a valid value. + +It gets embedded in the existing [`VolumeSource` +struct](https://pkg.go.dev/k8s.io/api/core/v1?tab=doc#VolumeSource) +struct as another alternative to the other ways of providing the +volume, like +[`CSIVolumeSource`](https://pkg.go.dev/k8s.io/api/core/v1?tab=doc#CSIVolumeSource) +and +[`PersistentVolumeClaimVolumeSource`](https://pkg.go.dev/k8s.io/api/core/v1?tab=doc#PersistentVolumeClaimVolumeSource). + +``` +type VolumeSource struct { + ... + CSI *CSIVolumeSource + PersistentVolumeClaim *PersistentVolumeClaimVolumeSource + ... + Ephemeral *EphemeralVolumeSource + ... +} +``` + +A new controller in `kube-controller-manager` is responsible for +creating new PVCs for each such ephemeral inline volume. It does that: +- with a deterministic name that is a concatenation of pod name and + the `Volume.Name` of the volume, +- in the namespace of the pod, +- with the pod as owner. + +Kubernetes already prevents adding, removing or updating volumes in a +pod and the ownership relationship ensures that volumes get deleted +when the pod gets deleted, so the new controller only needs to take +care of creating missing PVCs. + +When the [volume scheduling +library](https://github.com/kubernetes/kubernetes/tree/v1.18.0/pkg/controller/volume/scheduling) +inside the kube-scheduler or kubelet encounter such a volume inside a +pod, they determine the name of the PVC and proceed as they currently +do for a `PersistentVolumeClaimVolumeSource`. + +### User Stories + +#### Persistent Memory as DRAM replacement for memcached + +Recent releases of memcached added [support for using Persistent +Memory](https://memcached.org/blog/persistent-memory/) (PMEM) instead +of normal DRAM. When deploying memcached through one of the app +controllers, `EphemeralVolumeSource` makes it possible to request a volume +of a certain size from a CSI driver like +[PMEM-CSI](https://github.com/intel/pmem-csi). + +#### Local LVM storage as scratch space + +Applications working with data sets that exceed the RAM size can +request local storage with performance characteristics or size that is +not met by the normal Kubernetes `EmptyDir` volumes. For example, +[TopoLVM](https://github.com/cybozu-go/topolvm) was written for that +purpose. + +#### Read-only access to volumes with data + +Provisioning a volume might result in a non-empty volume: +- [restore a snapshot](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#volume-snapshot-and-restore-volume-from-snapshot-support) +- [cloning a volume](https://kubernetes.io/docs/concepts/storage/volume-pvc-datasource) +- [generic data populators](https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/20200120-generic-data-populators.md) + +For those, it might make sense to mount the volume read-only inside +the pod to prevent accidental modification of the data. For example, +the goal might be to just retrieve the data and/or copy it elsewhere. + +### Risks and Mitigations + +Enabling this feature allows users to create PVCs indirectly if they can +create pods, even if they do not have permission to create them +directly. Cluster administrators must be made aware of this. If this +does not fit their security model, they can disable the feature +through the feature gate that will be added for the feature. + +In addition, with a new +[`FSType`](https://github.com/kubernetes/kubernetes/blob/1fb0dd4ec5134014e466509163152112626d52c3/pkg/apis/policy/types.go#L278-L309) +it will be possible to limit the usage of this volume source via the +[PodSecurityPolicy +(PSP)](https://kubernetes.io/docs/concepts/policy/pod-security-policy/#volumes-and-file-systems). + +The normal namespace quota for PVCs in a namespace still applies, so +even if users are allowed to use this new mechanism, they cannot use +it to circumvent other policies. + +## Example + +Here is a full example for a higher-level object that uses a generic ephemeral +inline volume: + +``` +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: fluentd-elasticsearch + namespace: kube-system +spec: + selector: + matchLabels: + name: fluentd-elasticsearch + template: + metadata: + labels: + name: fluentd-elasticsearch + spec: + containers: + - name: fluentd-elasticsearch + image: quay.io/fluentd_elasticsearch/fluentd:v2.5.2 + volumeMounts: + - name: varlog + mountPath: /var/log + - name: scratch + mountPath: /scratch + volumes: + - name: varlog + hostPath: + path: /var/log + - name: scratch + ephemeral: + metadata: + labels: + type: fluentd-elasticsearch-volume + spec: + accessModes: [ "ReadWriteOnce" ] + storageClassName: "scratch-storage-class" + resources: + requests: + storage: 1Gi +``` + +The DaemonSet controller will create pods with names like +`fluentd-elasticsearch-b96sd` and the new controller will then add a +PVC called `fluentd-elasticsearch-b96sd-scratch` for that pod. + +## Design Details + +### PVC meta data + +The namespace is the same as for the pod and the name is the +concatenation of pod name and the `Volume.Name`. This is guaranteed to +be unique for the pod and each volume in that pod. Care must be taken +by the user to not exceed the length limit for object names, otherwise +volumes cannot be created. + +Other meta data is copied from the `VolumeClaimTemplate.ObjectMeta`. + +### Preventing accidental collision with existing PVCs + +The new controller will only create missing PVCs. It neither needs to +delete (handled by garbage collection) nor update (`PodSpec.Volumes` +is immutable) existing PVCs. Therefore there is no risk that the +new controller will accidentally modify unrelated PVCs. + +The volume scheduling library and kubelet must check that a PVC is +owned by the pod before using it. Otherwise they must ignored it. This +ensures that a volume is not used accidentally for a pod in case of a +name conflict. + +### Pod events + +Errors that occur while creating the PVC will be reported as events +for the pod and thus be visible to the user. This includes quota +violations and name collisions. + +### Feature gate + +The `GenericEphemeralVolumes` feature gate controls whether: +- the new controller is active in the `kube-controller-manager`, +- new pods can be created with a `EphemeralVolumeSource`, +- anything that specifically acts upon an `EphemeralVolumeSource` (scheduler, + kubelet, etc.) instead of merely copying it (statefulset controller) + accepts the new volume source. + +Existing pods with such a volume will not be started when the feature +gate is off. + +### Modifying volumes + +Once the PVC for an ephemeral volume has been created, it can be updated +directly like other PVCs. The new controller will not interfere with that +because it never updates PVCs. This can be used to control features +like volume resizing. + +### Late binding + +Ideally, provisioning should use late binding (aka +`WaitForFirstConsumer`). The initial implementation assumes that the +user is taking care of that by selecting that provisioning mode in the +storage class for the PVC. + +Later, `kube-scheduler` and `external-provisioner` can be changed to +automatically enable late binding for PVCs which are owned by a pod. + +### Test Plan + +- Unit tests will be added for the API change. +- Unit tests will cover the functionality of the controller, similar + to + https://github.com/kubernetes/kubernetes/blob/v1.18.2/pkg/controller/volume/persistentvolume/pv_controller_test.go. +- Unit tests need to cover the positive case (feature enabled) as + well as negative case (feature disabled or feature used incorrectly). +- A new [storage test + suite](https://github.com/kubernetes/kubernetes/blob/2b2cf8df303affd916bbeda8c2184b023f6ee53c/test/e2e/storage/testsuites/base.go#L84-L94) + will be added which tests ephemeral volume creation, usage and deletion + in combination with all drivers that support dynamic volume + provisioning. + +### Graduation Criteria + +#### Alpha -> Beta Graduation + +- Gather feedback from developers and surveys +- Errors emitted as pod events +- Decide whether `CSIVolumeSource` (in beta at the moment) should be + merged with `EphemeralVolumeSource` +- Tests are in Testgrid and linked in KEP + +#### Beta -> GA Graduation + +- 3 examples of real world usage +- Downgrade tests and scalability tests +- Allowing time for feedback + +### Upgrade / Downgrade Strategy + +When downgrading to a cluster which does not support generic ephemeral inline +volumes (either by disabling the feature flag or an actual version +downgrade), pods using such volumes will no longer be started. + +### Version Skew Strategy + +As with downgrades, having some of the relevant components at an older +version will prevent pods from starting. + +## Production Readiness Review Questionnaire + +### Feature enablement and rollback + +* **How can this feature be enabled / disabled in a live cluster?** + - [X] Feature gate + - Feature gate name: GenericEphemeralVolumes + - Components depending on the feature gate: + - kube-apiserver + - kube-controller-manager + - kubelet + +* **Does enabling the feature change any default behavior?** + If users are allowed to create pods but not PVCs, then generic ephemeral inline volumes + grants them permission to create PVCs indirectly. Cluster admins must take + that into account in their permission model. + +* **Can the feature be disabled once it has been enabled (i.e. can we rollback + the enablement)?** + Yes, by disabling the feature gates. Existing pods with generic ephemeral inline + volumes that haven't started yet will not be able to start up anymore, because + kubelet does not know what to do with the volume. It also will not know how + to unmount such volumes which would cause pods to get stuck, so nodes + should be drained before removing the feature gate in kubelet. + +* **What happens if we reenable the feature if it was previously rolled back?** + Pods that got stuck will work again. + +* **Are there any tests for feature enablement/disablement?** + Yes, unit tests for the apiserver and kubelet. + +### Rollout, Upgrade and Rollback Planning + +Will be added before the transition to beta. + +* **How can a rollout fail? Can it impact already running workloads?** + +* **What specific metrics should inform a rollback?** + +* **Were upgrade and rollback tested? Was upgrade->downgrade->upgrade path tested?** + +* **Is the rollout accompanied by any deprecations and/or removals of features, + APIs, fields of API types, flags, etc.?** + +### Monitoring requirements + +Will be added before the transition to beta. + +* **How can an operator determine if the feature is in use by workloads?** + +* **What are the SLIs (Service Level Indicators) an operator can use to + determine the health of the service?** + +* **What are the reasonable SLOs (Service Level Objectives) for the above SLIs?** + +* **Are there any missing metrics that would be useful to have to improve + observability if this feature?** + +### Dependencies + +Will be added before the transition to beta. + +* **Does this feature depend on any specific services running in the cluster?** + +### Scalability + +Will be added before the transition to beta. + +* **Will enabling / using this feature result in any new API calls?** + +* **Will enabling / using this feature result in introducing new API types?** + +* **Will enabling / using this feature result in any new calls to cloud + provider?** + +* **Will enabling / using this feature result in increasing size or count + of the existing API objects?** + +* **Will enabling / using this feature result in increasing time taken by any + operations covered by [existing SLIs/SLOs][]?** + +* **Will enabling / using this feature result in non-negligible increase of + resource usage (CPU, RAM, disk, IO, ...) in any components?** + +### Troubleshooting + +Will be added before the transition to beta. + +* **How does this feature react if the API server and/or etcd is unavailable?** + +* **What are other known failure modes?** + +* **What steps should be taken if SLOs are not being met to determine the problem?** + +[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md +[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos + +## Implementation History + +- Kubernetes 1.19: alpha (tentative) + +## Alternatives + +The alternative to creating the PVC is to modify components that +currently interact with a PVC such that they can work with stand-alone +PVC objects (like they do now) and with the embedded PVCs inside +pods. The downside is that this then no longer works with unmodified +CSI deployments because extensions in the CSI external-provisioner +will be needed. + +Some of the current usages of PVC will become a bit unusual (status +update inside pod spec) or tricky (references from PV to PVC). + +The advantage is that no automatically created PVCs are +needed. However, other controllers also create user-visible objects +(statefulset -> pod and PVC, deployment -> replicaset -> pod), so this +concept is familiar to users. diff --git a/keps/sig-storage/1698-generic-ephemeral-volumes/kep.yaml b/keps/sig-storage/1698-generic-ephemeral-volumes/kep.yaml new file mode 100644 index 000000000000..2435ed3d41a9 --- /dev/null +++ b/keps/sig-storage/1698-generic-ephemeral-volumes/kep.yaml @@ -0,0 +1,30 @@ +title: generic ephemeral inline volumes +kep-number: 1698 +authors: + - "@pohly" +owning-sig: sig-storage +participating-sigs: + - sig-storage +status: provisional +creation-date: 2020-04-17 +reviewers: + - "@jsafrane" +approvers: + - "@saad-ali" +stage: alpha +latest-milestone: "v1.19" +milestone: + alpha: "v1.19" + beta: "" + stable: "" +feature-gate: + name: GenericEphemeralVolumes + components: + - kube-apiserver + - kube-controller-manager + - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +#metrics: +# - my_feature_metric