From 9d4ff2e6b4832db625801a38331595c4b5fd7f34 Mon Sep 17 00:00:00 2001 From: Patrick Ohly Date: Fri, 17 Apr 2020 19:42:07 +0200 Subject: [PATCH] initial generic inline volumes KEP --- .../1698-generic-inline-volumes/README.md | 410 ++++++++++++++++++ .../1698-generic-inline-volumes/kep.yaml | 13 + 2 files changed, 423 insertions(+) create mode 100644 keps/sig-storage/1698-generic-inline-volumes/README.md create mode 100644 keps/sig-storage/1698-generic-inline-volumes/kep.yaml diff --git a/keps/sig-storage/1698-generic-inline-volumes/README.md b/keps/sig-storage/1698-generic-inline-volumes/README.md new file mode 100644 index 000000000000..901d180c8755 --- /dev/null +++ b/keps/sig-storage/1698-generic-inline-volumes/README.md @@ -0,0 +1,410 @@ + +# KEP-1698: generic inline volumes + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +- [ ] Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] KEP approvers have approved the KEP status as `implementable` +- [ ] Design details are appropriately documented +- [ ] Test plan is in place, giving consideration to SIG Architecture and SIG Testing input +- [ ] Graduation criteria is in place +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This KEP proposes an alternative mechanism for specifying volumes +inside a pod spec. Those inline volumes then get converted to normal +volume claim objects for provisioning, so storage drivers do not need +to be modified. + +Because this is expected to be slower than [CSI ephemeral inline +volumes](https://github.com/kubernetes/enhancements/issues/596), both +approaches will need to be supported. + + +## Motivation + +The current CSI ephemeral inline volume feature has demonstrated that +ephemeral inline volumes are a useful concept, for example to provide +additional scratch space for a pod. Several CSI drivers support it +now, most of them in addition to normal volumes (see +https://kubernetes-csi.github.io/docs/drivers.html). + +However, the original design was intentionally focused on +light-weight, local volumes. It is not a good fit for volumes provided +by a more traditional storage system because: +- The normal API for selecting volume parameters (like size and + storage class) is not supported. +- Integration into storage capacity aware pod scheduling is + challenging and would depend on extending the CSI ephemeral inline + volume API. +- CSI drivers need to be adapted and have to take over some of the + work normally done by Kubernetes, like tracking of orphaned volumes. + + +### Goals + +- Volumes can be specified inside the pod spec ("inline"). +- Volumes are created for specific pods and deleted after the pod + terminates ("ephemeral"). +- A normal, unmodified storage driver can be selected via a storage class. +- The volume will be created using the normal storage provisioning + mechanism, without having to modify the driver or its deployment. +- Storage capacity tracking can be enabled also for such inline + volumes. + +### Non-Goals + +- This will not replace CSI ephemeral inline volumes. +- Inline volumes could also be kept alive after their pod terminates, + but that is only useful if some higher-level logic then takes care + of deletion. For now this is out of scope. + +## Proposal + +A new volume source will be introduced: + +``` +typedef InlineVolumeSource struct { + VolumeClaimTemplate PersistentVolumeClaim + ReadOnly bool +} +``` + +When the [volume scheduling +library](https://github.com/kubernetes/kubernetes/tree/v1.18.0/pkg/controller/volume/scheduling) +inside the kube-scheduler encounters such a volume inside a pod, it +creates a PVC inside the same namespace as the pod. The name of that +PVC is a concatenation of pod name and the `Volume.Name` of the volume +and thus unique for the pod and each volume in that pod. Care must be +taken by the user to not exceed the length limit for object names, +otherwise volumes cannot be created. + +The `VolumeClaimTemplate.ObjectMeta.Name` and +`VolumeClaimTemplate.ObjectMeta.Namespace` are ignored. Labels from +`VolumeClaimTemplate.ObjectMeta.Labels` are copied. + + +<<[UNRESOLVED @pohly ]>> + +Using a full-blown `PersistentVolumeClaim` instead of just +`PersistentVolumeClaimSpec` might be overkill. The approach above +follows the example set by statefulset. + +<<[/UNRESOLVED]>> + +When creating that PVC, the pod is set as owner. That ensures that the +volume will be deleted automatically when the pod gets deleted. + +When a PVC with that name already exists, it is only used if it has +the pod as owner. Otherwise it is left unmodified and the pod cannot +start because the volume cannot be provisioned. This covers the case +where there is some accidental conflict with some unrelated PVC. + +When kubelet is asked to start a pod with such a volume, it uses the +same code as for `PersistentVolumeClaimVolumeSource`, with the only +exception that the claim name is computed dynamically. + +<<[UNRESOLVED @pohly ]>> + +Which part of a PersistentVolumeClaimSpec are immutable? Do we need to +support updating the `VolumeClaimTemplate` by copying the mutable +fields into the PVC? + +<<[/UNRESOLVED]>> + +<<[UNRESOLVED @pohly ]>> + +Ideally, the storage class should use late binding. Do we want to +leave that to the user (no further changes needed) or change the late +binding check in kube-scheduler and external-provisioner so that PVCs +with a pod as owner are always treated as "late binding"? + +<<[/UNRESOLVED]>> + +### User Stories (optional) + +#### Persistent Memory as DRAM replacement for memcached + +Recent releases of memcached added [support for using Persistent +Memory](https://memcached.org/blog/persistent-memory/) (PMEM) instead +of normal DRAM. When deploying memcached through one of the app +controllers, `InlineVolumeSource` makes it possible to request a volume +of a certain size from a CSI driver like +[PMEM-CSI](https://github.com/intel/pmem-csi). + +#### Local LVM storage as scratch space + +Applications working with data sets that exceed the RAM size can +request local storage with performance characteristics or size that is +not met by the normal Kubernetes `EmptyDir` volumes. For example, +[TopoLVM](https://github.com/cybozu-go/topolvm) was written for that +purpose. + +### Risks and Mitigations + +Enabling this feature allows users to create PVCs indirectly if they can +create pods, even if they do not have permission to create them +directly. Cluster administrators must be made aware of this. If this +does not fit their security model, they can disable the feature +through the feature gate that will be added for the feature. + +Alternatively, a label on a namespace could be used to disable the +feature just for that namespace. + +## Design Details + + + +### Test Plan + + + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + +The alternative to creating the PVC is to modify components that +currently interact with a PVC such that they can work with stand-alone +PVC objects (like they do now) and with the embedded PVCs inside +pods. The downside is that this then no longer works with unmodified +CSI deployments because extensions in the CSI external-provisioner +will be needed. + +Some of the current usages of PVC will become a bit unusual (status +update inside pod spec) or tricky (references from PV to PVC). + +The advantage is that no automatically created PVCs are +needed. However, other controllers also create user-visible objects +(statefulset -> pod and PVC, deployment -> replicaset -> pod), so this +concept is familiar to users. diff --git a/keps/sig-storage/1698-generic-inline-volumes/kep.yaml b/keps/sig-storage/1698-generic-inline-volumes/kep.yaml new file mode 100644 index 000000000000..57435f50725e --- /dev/null +++ b/keps/sig-storage/1698-generic-inline-volumes/kep.yaml @@ -0,0 +1,13 @@ +title: generic inline volumes +kep-number: 1698 +authors: + - "@pohly" +owning-sig: sig-storage +participating-sigs: + - sig-storage +status: provisional +creation-date: 2020-04-17 +reviewers: + - TBD +approvers: + - TBD