-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generic ephemeral inline volumes KEP #1701
Conversation
9d4ff2e
to
081d0c5
Compare
/assign @saad-ali |
During a design review meeting (recording: https://www.youtube.com/watch?v=ANAlOZ5GhCs) we agreed to move ahead with investigating this because it looks promising. We identified the need to clarify where the PVC creation will be handled. Options:
I'll investigate that. |
Meeting Notes:
|
I think we can rule out this option exactly for this reason: while "admission controllers may sometimes have side effects" (https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/), they also need a "corresponding reclamation or reconciliation process" because pod creation comes later. Not being able to rely on the ownership relation seems complicated.
Let's focus on that. |
@jsafrane: I've finished updating the document with the If I remember correctly, we volunteered you as reviewer because you said that you like this approach, right? 😛 Can you have a look? |
98ab835
to
052ea44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still not sure about naming...
A new volume source will be introduced: | ||
|
||
``` | ||
typedef InlineVolumeSource struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typedef
-> type
:-)
|
||
``` | ||
typedef InlineVolumeSource struct { | ||
VolumeClaimTemplate PersistentVolumeClaim |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we will need a separate type, PersistentVolumeClaimTemplate
, but that can be solved during API review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current approach follows how volumes are created in the stateful set. But yes, let's discuss this during the API review.
``` | ||
typedef InlineVolumeSource struct { | ||
VolumeClaimTemplate PersistentVolumeClaim | ||
ReadOnly bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, newly provisioned volumes are empty and ReadOnly does not make sense. Is it for cloned volumes / snapshot restore? Please describe somewhere below how ReadOnly
can be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The driver itself or a mechanism like the image populator might create the volume with data, which then might be supposed to be left unmodified. Cloning is indeed another example.
But I'm making this up on-the-fly 😁 The original motivation was to have feature-parity with PersistentVolumeClaimVolumeSource
, without having considered all implications.
It's definitely useful to write this down, so let me add a use-case.
Alternatively, a label on a namespace could be used to disable the | ||
feature just for that namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow PodSecurityPolicy (PSP), where admin can allow/deny volume types used in a pod. Just add a new enum value for InlineVolumeSource
to PSP.Spec.Volumes.
https://kubernetes.io/docs/concepts/policy/pod-security-policy/#volumes-and-file-systems
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I hadn't thought of that.
The `GenericInlineVolumes` feature gate controls whether: | ||
- the new controller is active in the `kube-controller-manager`, | ||
- new pods can be created with a `InlineVolumeSource`, | ||
- kubelet and scheduler accept such volumes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add attach/detach controller, PV/PVC protection controller, resize controller, VolumePluginMgr and all volume plugins. Basically everything that touches Pod.Spec.Volumes
must be updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code that merely copies a Pod.Spec.Volumes
IMHO shouldn't need changes: if it gets an InlineVolumeSource
, it can pass that through and let those components who really have to do something special with it check the feature gate.
Let's check during the implementation phase what changes are needed where. For now I rephrased that line item a bit to be more open-ended.
downgrade), pods using generic inline volumes will no longer be | ||
scheduled or started. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember scheduler correctly, you can't prevent a pod with an unknown VolumeSource from scheduling.
The pod gets stuck in kubelet, where VolumeManager can't find a volume plugin for an unknown VolumeSource and reports errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the "be scheduled".
Ideally, provisioning should use late binding (aka | ||
`WaitForFirstConsumer`). The initial implementation assumes that the | ||
user is taking care of that by selecting that provisioning mode in the | ||
storage class for the PVC. | ||
|
||
Later, `kube-scheduler` and `external-provisioner` can be changed to | ||
automatically enable late binding for PVCs which are owned by a pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does late binding relate to this feature? While it's a good thing to use, especially when using topology, this KEP will work even when immediate binding is used for storage that does not have any topology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a reason why late binding should not be used. The volume is clearly going to be created for use by this one pod, so using late binding seems like the right thing to do (don't create volumes unless the pod can start, better scheduling) in all cases.
By enabling late binding for inline volumes, we reduce the amount of storage classes which have to be defined in the cluster. It might not be relevant, though, if all classes already have late binding enabled.
|
||
## Design Details | ||
|
||
### PVC meta data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mention somewhere that the newly created PVC is subject of namespace quota as any other PVC and does not allow user cheating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably belongs under risks and mitigations. I've added something there.
|
||
## Design Details | ||
|
||
### PVC meta data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to above, please note that the new controller reports any errors on PVC creation as events on the Pod. This way, user can get for example quota errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added under implementation details, together with name collision.
A new volume source will be introduced: | ||
|
||
``` | ||
typedef InlineVolumeSource struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming is hard. How is this InlineVolumeSource
different from other inline volume sources (say ISCSIVolumeSource
)? The name should imply that it is ephemeral (won't persist data beyond Pod lifetime)... EphemeralVolume
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including "ephemeral" in the name forever rules out the possibility to make the volumes persistent. I don't think it'll be useful, but I'd prefer to not paint us into a corner here.
The more important aspect is that this inline volume uses the claim mechanism and normal provisioning. How about ClaimVolumeSource
? It is more accurate, but doesn't role of the tongue that easily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VolumeClaimVolumeSource
is another option, albeit with Volume
in the name twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've pushed an update.
A new volume source will be introduced: | ||
|
||
``` | ||
typedef InlineVolumeSource struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Including "ephemeral" in the name forever rules out the possibility to make the volumes persistent. I don't think it'll be useful, but I'd prefer to not paint us into a corner here.
The more important aspect is that this inline volume uses the claim mechanism and normal provisioning. How about ClaimVolumeSource
? It is more accurate, but doesn't role of the tongue that easily.
A new volume source will be introduced: | ||
|
||
``` | ||
typedef InlineVolumeSource struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VolumeClaimVolumeSource
is another option, albeit with Volume
in the name twice.
``` | ||
typedef InlineVolumeSource struct { | ||
VolumeClaimTemplate PersistentVolumeClaim | ||
ReadOnly bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The driver itself or a mechanism like the image populator might create the volume with data, which then might be supposed to be left unmodified. Cloning is indeed another example.
But I'm making this up on-the-fly 😁 The original motivation was to have feature-parity with PersistentVolumeClaimVolumeSource
, without having considered all implications.
It's definitely useful to write this down, so let me add a use-case.
Alternatively, a label on a namespace could be used to disable the | ||
feature just for that namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I hadn't thought of that.
|
||
## Design Details | ||
|
||
### PVC meta data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably belongs under risks and mitigations. I've added something there.
|
||
## Design Details | ||
|
||
### PVC meta data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added under implementation details, together with name collision.
The `GenericInlineVolumes` feature gate controls whether: | ||
- the new controller is active in the `kube-controller-manager`, | ||
- new pods can be created with a `InlineVolumeSource`, | ||
- kubelet and scheduler accept such volumes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code that merely copies a Pod.Spec.Volumes
IMHO shouldn't need changes: if it gets an InlineVolumeSource
, it can pass that through and let those components who really have to do something special with it check the feature gate.
Let's check during the implementation phase what changes are needed where. For now I rephrased that line item a bit to be more open-ended.
Ideally, provisioning should use late binding (aka | ||
`WaitForFirstConsumer`). The initial implementation assumes that the | ||
user is taking care of that by selecting that provisioning mode in the | ||
storage class for the PVC. | ||
|
||
Later, `kube-scheduler` and `external-provisioner` can be changed to | ||
automatically enable late binding for PVCs which are owned by a pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find a reason why late binding should not be used. The volume is clearly going to be created for use by this one pod, so using late binding seems like the right thing to do (don't create volumes unless the pod can start, better scheduling) in all cases.
By enabling late binding for inline volumes, we reduce the amount of storage classes which have to be defined in the cluster. It might not be relevant, though, if all classes already have late binding enabled.
Yes, by disabling the feature gates. Existing pods will continue to run and | ||
volumes will be deleted once they stop. Existing pods with generic inline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Existing pods will continue to run and volumes will be deleted once they stop.
If someone just restarts kubelet with the feature gate disabled, it won't be able to unmount InLine volumes from running Pods. I haven't found documentation how to actually flip a feature gate in a running cluster. Can you please add a note that node must be drained before flipping the feature gate on kubelet, just to be on the safe side?
Avoid getting hung up on specific details and instead aim to get the goals of | ||
the KEP clarified and merged quickly. The best way to do this is to just | ||
start with the high-level sections and fill out details incrementally in | ||
subsequent PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check the ones that are complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure which those are. The only safe one seems to be "Enhancement issue in release milestone, which links to KEP dir". I've marked that.
"Design details are appropriately documented" I think is okay, but perhaps someone else should confirm that?
"Test plan is in place" (yes), "giving consideration to SIG Architecture and SIG Testing input (no?)" - do they need to review?
"Graduation criteria is in place" - done.
- [ ] "Implementation History" section is up-to-date for milestone | ||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check the ones already done?
type EphemeralVolumeSource struct { | ||
// Required. Defined as a pointer because in the future there | ||
// might be alternative ways to specify how the ephemeral | ||
// volume gets provisioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify what it means if VolumeClaimTemplate is nil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it must not be nil. I've reworded a bit.
type VolumeSource struct { | ||
... | ||
CSI *CSIVolumeSource | ||
PersistentVolumeClaim *PersistentVolumeClaimVolumeSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add comments above to describe what this is and to differentiate it from "Ephemeral" below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming that readers are familiar with the existing API. Do I really need to explain it?
I've reworded the introduction a bit and added links.
Recent releases of memcached added [support for using Persistent | ||
Memory](https://memcached.org/blog/persistent-memory/) (PMEM) instead | ||
of normal DRAM. When deploying memcached through one of the app | ||
controllers, `InlineVolumeSource` makes it possible to request a volume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/InlineVolumeSource/EphemeralVolumeSource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, also below.
|
||
For those, it might make sense to mount the volume read-only inside | ||
the pod to prevent accidental modification of the data. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that "mounting the volume read-only inside the pod to prevent accidental modification of the data" is a valid use case, but the examples given above such as restoring a volume from snapshot is usually because the original volume is corrupted so that it needs to be restored. After restore, however, the volume will continue to be used in the application so that it cannot be read-only.
I think a valid case would be to have a pod created with read-only volumes created from snapshots, and then copy that data to someplace else such as a backup device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added " For example, the goal might be to just retrieve the data and/or copy it elsewhere."
|
||
The `GenericInlineVolumes` feature gate controls whether: | ||
- the new controller is active in the `kube-controller-manager`, | ||
- new pods can be created with a `InlineVolumeSource`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/InlineVolumeSource/EphemeralVolumeSource
The `GenericInlineVolumes` feature gate controls whether: | ||
- the new controller is active in the `kube-controller-manager`, | ||
- new pods can be created with a `InlineVolumeSource`, | ||
- anything that specifically acts upon an `InlineVolumeSource` (scheduler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/InlineVolumeSource/EphemeralVolumeSource
|
||
### Feature gate | ||
|
||
The `GenericInlineVolumes` feature gate controls whether: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be called GenericEphemeralVolume now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Do I now also need to rename the KEP to match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree (see my other comment), but if that's what it takes to get the KEP merged... done.
kubelet, etc.) instead of merely copying it (statefulset controller) | ||
accepts the new volume source. | ||
|
||
Existing pods with such a volumes will not be scheduled respectively |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/such a volumes/such volumes
or such a volume
|
||
### Modifying volumes | ||
|
||
Once the PVC for an inline volume has been created, it can be updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/inline/ephemeral
|
||
When downgrading to a cluster which does not support generic inline | ||
volumes (either by disabling the feature flag or an actual version | ||
downgrade), pods using generic inline volumes will no longer be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/inline/ephemeral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I consider the full name of the feature "generic ephemeral inline volumes". Both are relevant aspects of the volumes, so only mentioning one of them is arbitrary. "Generic" has to be added to distinguish from the existing "CSI ephemeral inline volumes".
In the API, "inline" is redundant because it is implied by the usage, so there it gets dropped.
|
||
* **How can this feature be enabled / disabled in a live cluster?** | ||
- [X] Feature gate | ||
- Feature gate name: GenericInlineVolumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenericEphemeralVolumes
- kubelet | ||
|
||
* **Does enabling the feature change any default behavior?** | ||
If users are allowed to create pods but not PVCs, then generic inline volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/inline/ephemeral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is missed.
|
||
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md | ||
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove sections that you have not filled out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
beta: "" | ||
stable: "" | ||
feature-gate: | ||
name: GenericInlineVolumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GenericEphemeralVolumes
I've updated the KEP and fixed kep.yaml. Please have another look. |
|
||
* **Can the feature be disabled once it has been enabled (i.e. can we rollback | ||
the enablement)?** | ||
Yes, by disabling the feature gates. Existing pods with generic inline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is missed: s/inline/ephemeral
- kubelet | ||
|
||
* **Does enabling the feature change any default behavior?** | ||
If users are allowed to create pods but not PVCs, then generic inline volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is missed.
``` | ||
|
||
A new controller in `kube-controller-manager` is responsible for | ||
creating new PVCs for each such inline volume. It does that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's another one: s/inline/ephemeral
## Example | ||
|
||
Here is a full example for a higher-level object that uses ephemeral | ||
inline volumes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"inline" can be removed?
|
||
### Upgrade / Downgrade Strategy | ||
|
||
When downgrading to a cluster which does not support generic inline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is missed.
|
||
When downgrading to a cluster which does not support generic inline | ||
volumes (either by disabling the feature flag or an actual version | ||
downgrade), pods using generic inline volumes will no longer be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one too.
Yes, by disabling the feature gates. Existing pods with generic inline | ||
volumes that haven't started yet will not be able to start up anymore, because | ||
kubelet does not know what to do with the volume. It also will not know how | ||
to unmount generic inline volumes which would cause pods to get stuck, so nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/inline/ephemeral
@xing-yang I've renamed the KEP, please check again. |
@@ -0,0 +1,30 @@ | |||
title: generic inline volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ephemeral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed.
``` | ||
|
||
A new controller in `kube-controller-manager` is responsible for | ||
creating new PVCs for each such inline volume. It does that: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ephemeral
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Let's hope this is the last one 😄 🤞
Can you squash your commits? |
93f33e1
to
0362074
Compare
Done, force-pushed. |
hostPath: | ||
path: /var/log | ||
- name: scratch | ||
ephemeral: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this all looks good.
The only thing I'm not sure of is the naming here.
I would have loved to have the volume type ephemeral for the local inline ephemeral use case we have now. But I was told it had to hang off of the csi driver.
Having some ephemeral be under csi and some under volumeSource would be hard to explain to users I think.
At some point, maybe pod api v2 whenever that happens, the csi driver should probably fold back out so it looks like its more at the root of the volumeSource struct. As all drivers but configmap/secret/emptydir/hostpath will be moving to csi, so having csi in the api kind of stops making sense at some point.
All in all, I'm not against this. I think in the long run its more usable. But, I think in the intermediate time its kind of confusing to end users. Perhaps we should design the api here such that it could be used for all kinds of ephemeral storage. And we treat the csi version as an alias and deprecate it? In my mind, the local ephemeral drivers are just an optimization (shortcuts) of the usage of general ephemeral inline. So to a user, ideally should look the same. They shouldn't have to care they are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, I think in the intermediate time its kind of confusing to end users. Perhaps we should design the api here such that it could be used for all kinds of ephemeral storage. And we treat the csi version as an alias and deprecate it?
This is part of the long term plan and the reason why VolumeClaimTemplate
is a pointer: we can add a new, alternative pointer to a struct with the values for a "CSI ephemeral inline volume", allow that as alternative and then with a suitable deprecation period, remove the CSI
entry in VolumeSource
in favor of this unified Ephemeral
entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how that might work with the raw api. but what about the yaml representation? How will it know what type of object to generate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya we might need a wrapper struct here PersistentVolumeClaimTemplate
to disambiguate and be consistent with existing PodTemplate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The YAML example was broken, the API itself was fine. The correct example is:
volumes:
- name: varlog
hostPath:
path: /var/log
- name: scratch
ephemeral:
volumeClaimTemplate:
metadata:
labels:
type: fluentd-elasticsearch-volume
spec:
I.e. the volumeClaimTemplate
field name was missing.
Fixed.
Existing pods with such a volume will not be started when the feature | ||
gate is off. | ||
|
||
### Modifying volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about updating the volume template in a higher level controller like deployment or statefulset? Would that cause a new rollout of the deployment/statefulset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that to cause an update as its just a change from the deployment/statefulset/daemonset/etc's perspective. I think that's also a good behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lifecycle should be tied to the pod. If higher level controller deletes pod and creates a new one, we should delete PVC and create a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, the controller which created the pods has to delete the existing ones and create them anew, because updating the Volumes
field in the pod spec is immutable (I checked). The new pods then start also with new PVCs/PVs.
I have not checked whether a deployment can update volumes in the replicaset.
owning-sig: sig-storage | ||
participating-sigs: | ||
- sig-storage | ||
status: provisional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ready to be updated to implementable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This to go in as an alpha next or beta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope so. Let me make the change, otherwise it will miss 1.19.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some questions about details. But overall this looks good to me! Nice work @pohly
Existing pods with such a volume will not be started when the feature | ||
gate is off. | ||
|
||
### Modifying volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lifecycle should be tied to the pod. If higher level controller deletes pod and creates a new one, we should delete PVC and create a new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we are close to merging, I amended the commit instead of creating a new one. The diff is:
diff --git a/keps/sig-storage/1698-generic-ephemeral-volumes/README.md b/keps/sig-storage/1698-generic-ephemeral-volumes/README.md
index dd606d1..4772f55 100644
--- a/keps/sig-storage/1698-generic-ephemeral-volumes/README.md
+++ b/keps/sig-storage/1698-generic-ephemeral-volumes/README.md
@@ -140,3 +140,3 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
This KEP proposes an alternative mechanism for specifying volumes
-inside a pod spec. Those inline volumes then get converted to normal
+inside a pod spec. Those ephemeral volumes then get converted to normal
volume claim objects for provisioning, so storage drivers do not need
@@ -144,6 +144,10 @@ to be modified.
-The lifecycle of those volumes is ephemeral: they will get deleted
-once the pod terminates.
+The lifecycle of those volumes is ephemeral: they will get created
+when the pod is created (for immediate [volume
+binding](https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode))
+or when the pod starts (for `WaitForFirstConsumer` volume binding),
+and deleted once the pod terminates.
-Because this is expected to be slower than [CSI ephemeral inline
+Because volume creation and deletion is expected to be slower than
+with [CSI ephemeral inline
volumes](https://github.com/kubernetes/enhancements/issues/596), both
@@ -181,2 +185,6 @@ by a more traditional storage system because:
- Storage capacity tracking can be enabled also for such volumes.
+- Eventual API alignment with CSI ephemeral inline volumes, as well as
+ current ephemeral in-tree volumes (EmptyDir, etc.), to ensure
+ consistency across ephemeral volume APIs and minimize user
+ confusion.
@@ -200,3 +208,3 @@ type EphemeralVolumeSource struct {
// Required, must not be nil.
- VolumeClaimTemplate *PersistentVolumeClaim
+ VolumeClaimTemplate *PersistentVolumeClaimTemplate
ReadOnly bool
@@ -207,5 +215,10 @@ This mimics a
[`PersistentVolumeClaimVolumeSource`](https://pkg.go.dev/k8s.io/api/core/v1?tab=doc#PersistentVolumeClaimVolumeSource),
-except that it contains a `PersistentVolumeClaim` instead of
-referencing one by name. A full `PersistentVolumeClaim` is used for
-consistency with the StatefulSet API.
+except that it contains a `PersistentVolumeClaimTemplate` instead of
+referencing a `PersistentVolumeClaim` by name. The content of this
+`PersistentVolumeClaimTemplate` is identical to
+`PersistentVolumeClaim` and just uses a different name to clarify the
+intend, which is to serve as template for creating a volume claim
+object. Embedding a full object is similar to the StatefulSet API and
+allows storing labels and annotations. The name follows the example
+set by [`PodTemplate`](https://pkg.go.dev/k8s.io/api/core/v1?tab=doc#PodTemplate).
@@ -368,4 +381,4 @@ new controller will accidentally modify unrelated PVCs.
-The volume scheduling library and kubelet must check that a PVC is
-owned by the pod before using it. Otherwise they must ignored it. This
+The volume scheduling library and kubelet must verify that the PVC has
+an owner reference to the pod before using it. Otherwise they must ignored it. This
ensures that a volume is not used accidentally for a pod in case of a
@@ -430,2 +443,5 @@ automatically enable late binding for PVCs which are owned by a pod.
merged with `EphemeralVolumeSource`
+- Decide whether in-tree ephemeral volume sources, like EmptyDir (GA
+ already), should also be added EphemeralVolumeSource for sake of API
+ consistency
- Tests are in Testgrid and linked in KEP
diff --git a/keps/sig-storage/1698-generic-ephemeral-volumes/kep.yaml b/keps/sig-storage/1698-generic-ephemeral-volumes/kep.yaml
index 2435ed3..bb1b683 100644
--- a/keps/sig-storage/1698-generic-ephemeral-volumes/kep.yaml
+++ b/keps/sig-storage/1698-generic-ephemeral-volumes/kep.yaml
@@ -7,3 +7,3 @@ participating-sigs:
- sig-storage
-status: provisional
+status: implementable
creation-date: 2020-04-17
hostPath: | ||
path: /var/log | ||
- name: scratch | ||
ephemeral: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The YAML example was broken, the API itself was fine. The correct example is:
volumes:
- name: varlog
hostPath:
path: /var/log
- name: scratch
ephemeral:
volumeClaimTemplate:
metadata:
labels:
type: fluentd-elasticsearch-volume
spec:
I.e. the volumeClaimTemplate
field name was missing.
Fixed.
Existing pods with such a volume will not be started when the feature | ||
gate is off. | ||
|
||
### Modifying volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, the controller which created the pods has to delete the existing ones and create them anew, because updating the Volumes
field in the pod spec is immutable (I checked). The new pods then start also with new PVCs/PVs.
I have not checked whether a deployment can update volumes in the replicaset.
owning-sig: sig-storage | ||
participating-sigs: | ||
- sig-storage | ||
status: provisional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope so. Let me make the change, otherwise it will miss 1.19.
0362074
to
7e9ce8b
Compare
This KEP proposes an alternative mechanism for specifying volumes inside a pod spec. Those inline volumes then get converted to normal volume claim objects for provisioning, so storage drivers do not need to be modified. The lifecycle of those volumes is ephemeral: they will get deleted once the pod terminates. Because this is expected to be slower than CSI ephemeral inline volumes, both approaches will need to be supported.
7e9ce8b
to
a080b55
Compare
That last change hadn't made it into the squashed commit and was still sitting in my tree - sorry! I've squashed it and force-pushed, now the commit is a080b55. |
Thanks @pohly /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pohly, saad-ali The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This describes an alternative, more general approach for specifying volumes inside a pod spec ("inline"), with a lifecycle of the volumes that is tied to the lifecycle of the pod ("ephemeral").
Enhancement issue: #1698