-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI: support for in-line volumes in pods. #2273
Conversation
// VolumeSource represents the source location of a volume to attach. | ||
// Only CSIVolumeSource can be specified. | ||
// +optional | ||
VolumeSource *v1.VolumeSource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this data denormalized instead of referencing the pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the pod can be deleted when external A/D controller needs to detach a volume. Current CSI implementation expects that PV is not deleted (and uses finalizes for that). I think it would be too much to add finalizers to pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what indicates whether an inline volume requires an attach/detach phase? many potential uses of inline CSI volumes only require mount/unmount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the object references in v1.VolumeSource (secret, endpoints, etc) are ambiguous when this is embedded in an object located outside the source pod's namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what indicates whether an inline volume requires an attach/detach phase? many potential uses of inline CSI volumes only require mount/unmount
We currently have no answer here, all CSI volumes (both inline and PVs) must go through external attacher and have VolumeAttachment. It is probably for another proposal to solve this (and we need a smart idea how to do it - A/D controller currently does not know what drivers are deployed to the cluster and which of them need attach/detach).
the object references in v1.VolumeSource (secret, endpoints, etc) are ambiguous when this is embedded in an object located outside the source pod's namespace.
Touche!
-
We could add pointer to pod to VolumeAttachment, but that complicates things when two pods share the same volume and the one that's referenced dies and user deletes its secrets. Having two pods sharing the same volume and having different Secrets looks quite bizarre though..
-
We could copy the secrets to VolumeAttachment.
-
We could change CSI not to require per-volume secrets on Detach time. IMO, all in-tree volume plugins don't need it. All of them need it either at attach or mount time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We could change CSI not to require per-volume secrets on Detach time. IMO, all in-tree volume plugins don't need it. All of them need it either at attach or mount time.
This may not be true when Cinder is updated to process secrets introduced in kubernetes/kubernetes#63826
@dims, do you expect that Secrets introduced there will be used to detach volumes? If so, how do you make sure that the Secrets are available at detach time? Especially for the in-line volume in a pod, user can delete the Secrets at any time.
} | ||
``` | ||
|
||
* A/D controller **copies whole `VolumeSource`** from `Pod` into `VolumeAttachment`. This allows external CSI attacher to detach volumes for deleted pods without keeping any internal database of attached VolumeSources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- what will create VolumeAttachment objects corresponding to pods?
- what will clean up VolumeAttachment objects after pods are deleted?
- how will the VolumeAttachment objects be named? is it derived from the pod namespace and name?
- what is the behavior if two pods both contain identical VolumeSources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what will create VolumeAttachment objects corresponding to pods?
what will clean up VolumeAttachment objects after pods are deleted?
A/D controller. There is no difference there from attaching PVs.
how will the VolumeAttachment objects be named? is it derived from the pod namespace and name?
It will be the same as PVs, i.e. CSI driver name + CSI volume handle + node name, all hashed to conform to naming rules.
what is the behavior if two pods both contain identical VolumeSources?

They both get the volume, assuming the volume is ReadWriteMany. And it will work even if one of the pods uses PV and the other has the same volume inlined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be the same as PVs, i.e. CSI driver name + CSI volume handle + node name, all hashed to conform to naming rules.
CSI volume handle in a PV is considered trustworthy, because creating a PV is a highly privileged operation. The entity creating the PV is assumed to create unique volume handles, right?
CSI volume handle in a pod is not particularly trustworthy. What happens if two pods in different namespaces (scheduled to the same node) both specify the same CSI driver and volume handle, but different VolumeAttributes, fstype, etc? which one wins and gets its content in the single VolumeAttachment object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first one wins. Its the same as when two users create non-CSI in-line volume, such as Ceph RBD or AWS EBS.
Which brings another topic: we should extend PSP / SCC to allow/disallow individual CSI drivers inlined in pods, similar to Flex:
https://kubernetes.io/docs/concepts/policy/pod-security-policy/#flexvolume-drivers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added PSP change.
* Using whole `VolumeSource` allows us to re-use `VolumeAttachment` for any other in-line volume in the future. We provide validation that this `VolumeSource` contains only `CSIVolumeSource` to clearly state that only CSI is supported now. | ||
* TBD: `CSIVolumeSource` would be enough... | ||
* External CSI attacher must be extended to process either `PersistentVolumeName` or `VolumeSource`. | ||
* Since in-line volume in a pod can refer to a secret in the same namespace as the pod, **external attacher must get permissions to read any Secrets in any namespace**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that determination should be made by the deployer of the attacher. it may not be expected to service all namespaces, and may not require user-specified attach secrets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note about this.
@liggitt, I got one idea about the secrets for in-line volumes.
This way, users may delete any Secrets in their namespaces, attacher keeps a copy. This needs to be carefully documented and admins must make sure that namespace with the attacher is safe enough. Would it be a way? |
hmm, and the external attacher should copy secrets also from PVs, in case the same volume is used inlined in another pod. |
Currently, CSI can be used only though PersistentVolume object. All other persistent volume sources support in-line volumes in Pods, CSI should be no exception. There are two main drivers: | ||
* We want to move away from in-tree volume plugins to CSI, as designed in a separate proposal https://github.com/kubernetes/community/pull/2199/. In-line volumes should use CSI too. | ||
* CSI drivers can be used to provide Secrets-like volumes to pods, e.g. providing secrets from a remote vault. We don't want to force users to create PVs for each secret, we should allow to use them in-line in pods as regular Secrets or Secrets-like Flex volumes. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You hint at it a bit, but might be good to explicitly add it as a motivation:
- Reach feature parity with FlexVolume api and allow for its deprecation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new motivation item.
|
||
// VolumeHandle is the unique volume name returned by the CSI volume | ||
// plugin’s CreateVolume to refer to the volume on all subsequent calls. | ||
// Required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be a mechanism to make this optional. CSI allows a driver to not implement CreateVolume, in which case VolumeHandle could be generated by kubelet itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rephrase the comment.. The VolumeHandle is required field in the structure. And it is not required that this ID comes from CreateVolume
, it's just a suggestion how to get the ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah. I was confusing the two. even still, the use case I see for one of the drivers looks like the following. There isn't really any benefit in requiring VolumeHandle I think:
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx:1.13-alpine
ports:
- containerPort: 80
volumeMount:
- name: data
mountPath: /usr/share/nginx/html
volumes:
- name: data
csi:
driver: drivers.k8s.io/image-driver
image: kfox1111/misc:test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no CSIVolumeSource.Image
. If kfox1111/misc:test
is unique volume ID (and it looks like it uniquely identifies source of the data to mount), it should be used as VolumeHandle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I haven't figured out the actual mapping to k8s. Reading through the pv->csi api spec, I think it actually should be:
volumes:
- name: data
csi:
driver: drivers.k8s.io/image-driver
volumeAttributes:
image: kfox1111/misc:test
I have a prototype csi driver working here: https://github.com/kfox1111/drivers/blob/volume-driver/pkg/image/nodeserver.go#L143
It uniquely identifies the container image to use for the volume. the volume identifier though should match the pod's lifecycle. so the easiest thing I can think of is to use the podid as the volumehandle. or maybe just a random id (but consistent for the pods lifecycle).
For things that follow pod lifecycle, VolumeHandle doesn't really make sense for the user to specify. It is supposed to be unique, but you could have multiple instances of the same pod on the same host. If the user specified something, then it could conflict between the pods on the same host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I added VolumeHandlePrefix to make sure CSI volume ID is (optionally) unique for each volume in every pod.
- There is discussion in CSI community to include "Workload" specification in NodePublishRequest. Kubernetes will transport Pod name / namespace / UID and whatnot, so CSI driver can (optionally) tailor the volume for the specific pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the VolumeHandlePrefix come from? is it specified in the pod spec? Should it come from the driver registrar (https://github.com/kubernetes-csi/driver-registrar) sidecar instead? Then the packager of the csi driver could tell k8s which type it expects so the user won't have to specify it most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VolumeHandlePrefix comes in pod spec, pod.spec.volumes[x].csi.volumeHandlePrefix
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So under my current understanding of the spec, the theoretical csi emtpyDir example would need to be:
volumes:
- name: foo
csi:
driver: drivers.k8s.io/emptyDir-driver
volumeHandle: foo
volumeHandlePrefix: pod
- name: bar
csi:
driver: drivers.k8s.io/emptyDir-driver
volumeHandle: bar
volumeHandlePrefix: pod
Thats quite verbose. And a miss copied entry could have bad results. It really feels like volumeHandlePrefix or at least its default should come from the driver some how.
Alternately, could the VolumeHandlePrefix be made optional and default to Pod (should be a safe default) and VolumeHandle default to the volume's name and be optional (guaranteed to be unique within a pod). That would restore the ease of use for using any csi driver where you want to map volume lifecycle to the pod. and still allow for specifying the exact volumes you want to mount when you need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts?
Author: @jsafrane | ||
|
||
## Goal | ||
* Define API and high level design for in-line CSI volumes in Pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After reading through the spec once, there may be two different use cases wrapped up into one here.
- secrety like volumes. per pod volumes.
- more traditional volume drivers. volume create/delete/attach/etc.
While both need solving, the vast majority of difficult design/coding I think is around use case number 2.
Since use case number one may be much easier to implement, and would garner a bunch of drivers quickly should we consider breaking the spec into 2, to allow the first use case to be implemented faster then the second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation may come in two phases, however we should have solid design for both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. but just hoping to see at least part 1 implemented in 1.12. If the design of part 2 would push implementation out to 1.13, then maybe we should separate. If not, then lets keep it all together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we very well may miss the 1.13 window even for part 1 because we're still debating how to solve part 2. Can we please set aside part 2 for now?
Topology support in kubernetes is only provided at the PV layer. So inline CSI volumes won't be able to support topology. |
That's IMO ok, no in-line volume supports any volume topology. Anyone who creates a pod with topology-constrained volume must set the right node selectors and affinities in the pod itself. |
|
||
## Motivation | ||
Currently, CSI can be used only though PersistentVolume object. All other persistent volume sources support in-line volumes in Pods, CSI should be no exception. There are two main drivers: | ||
Currently, CSI can be used only though PersistentVolume object. All other persistent volume sources support in-line volumes in Pods, CSI should be no exception. There are three main drivers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/though/through/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
* Secrets for in-line volumes must be in the same namespace as the pod that contains the volume. Users can delete them before the volume is detached. We deliberately choose to let the external attacher to fail when such Secret cannot be found on detach time and keep the volume attached, reporting errors about missing Secrets to user. | ||
* Since access to in-line volumes can be configured by `PodSecurityPolicy` (see below), we expect that cluster admin gives access to CSI drivers that require secrets at detach time only to educated users that know they should not delete Secrets used in volumes. | ||
* Number of CSI drivers that require Secrets on detach is probably very limited. No in-tree Kubernetes volume plugin requires them on detach. | ||
* We will provide clear documentation that using in-line volumes with drivers that require credentials on detach may leave orphaned attached volumes that Kubernetes is not able to detach. It's up to the cluster admin to decide if using such CSI driver is worth it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, if it happens, what would be a recovery steps for a cluster admin to clear this condition, it cannot stay orphaned forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good compromise.
Maybe we should specify finer grained behaviour with respect to pod security policy though. Maybe that belongs in a different spec. Being able to white list particular csi drivers such as image-driver while making a driver such as nfs-driver, pv only could be handy. Looking at the pod security policy, 'volumes' is currently a string array so would only allow/disallow all of csi. Perhaps the psp volume list could ignore just csi though and use the ".volume.csi.driver" info instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. I just noticed it was already in one of the more recent versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbezverk, there already is metric of failed detaches and admin can look at VolumeAttachments
with failed detach.
Added PSP update to allow/deny attachable in-line CSI volumes, as they could abuse nodes by deleting Secrets needed for detach. |
6b5b1f8
to
e5dc5ee
Compare
Updates with new discussed changes
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Updated with |
* `AutomaticVolumeHandleGeneration` | ||
* `NoVolumeHandleGeneration` | ||
|
||
When the driver is configured with `CSIDriverSpec.VolumeHandleMode = AutomaticVolumeHandleGeneration` and the volumeHandle is not specified, the Kubelet will automatically generate the volume handle to be sent to the CSI driver. The generated value will be a combination of podUID and pod namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I don't think this is unique enough.
It should be possible to launch two different image driver volume instances in a pod with different images. But with podUID and pod namespace, it would end up being the same value for both volumes with this scheme.
What about podUID and Volume Name? I think podUID is unique even over different namespaces, and volume name is guaranteed unique in the pod spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks for pointing this out.
Driver string | ||
|
||
// VolumeHandle is the unique ID of the volume. It is the volume ID used in | ||
// all CSI calls, optionally with a prefix based on VolumeHandlePrefix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VolumeHandlePrefix isn't in the spec anymore.
/milestone v1.13 adding this to 1.13 milestone since it seems to be related to kubernetes/enhancements#603 |
@jsafrane @vladimirvivien @saad-ali @msau42 What is the status of this proposal and the associated PR? The last I looked at the API PR it had a significant compat problem. This spec has outstanding comments. Is it going to progress? When will it be ready for re-review? If it needs my attention, I need to know. :) |
We have had some setbacks in the PR. it is really difficult to cover all the use cases all in one PR I think. Its getting to be a very big patch set. Can we implement the emptydir style csi driver plugin stuff first, as it is much simpler to implement? It would allow for enough use case to start merging common plumbing stuff like the pod security policy framework. We can start finding the issues with the approach as its still alpha, and we can work on the finer details of the other use cases. I think getting emptydir style pod inline support is very close so could still maybe get into 1.13. we can work out the details for the rest of the volume use cases for 1.14 or 1.15. This would still allow a significant new use case to be covered such as vault style drivers, image distribution style drivers, a csi replacement for the native git driver, etc. Currently, these are just not possible. |
Hi @thockin There is progress. But we've been focusing on implementation which is revealing some issues we're working through here - kubernetes/kubernetes#68232 (which will change some of the design). If we cant get these issues resolved by Monday, this may not go in 1.13. |
I think we have come up with a plan to get a sold enough base implemented this weekend for the 1.13 alpha functionality. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Will there be any path to integration with local-ephemeral-storage metrics? Or thoughts if that'd make sense? |
In the initial alpha implementation, it will not be integrated, but before beta we should have a design. Thanks for bringing that up! |
(btw, this PR has been replaced by a kep. Latest update here) |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
This PR has been replaced by KEP - kubernetes/enhancements#790. Will be going as alpha in 1.15. |
@vladimirvivien @jsafrane if the KEP is a full replacement please close this PR |
@jsafrane I can't close it. |
/close |
@msau42: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Current CSI implementation supports only PersistentVolumes. This proposal adds support for in-line CSI volumes in pods.
/sig storage
/assign @davidz627 @saad-ali @vladimirvivien @liggitt