Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: support for in-line volumes in pods. #2273

Closed
wants to merge 7 commits into from

Conversation

jsafrane
Copy link
Member

Current CSI implementation supports only PersistentVolumes. This proposal adds support for in-line CSI volumes in pods.

/sig storage

/assign @davidz627 @saad-ali @vladimirvivien @liggitt

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 15, 2018
@k8s-ci-robot k8s-ci-robot added the sig/storage Categorizes an issue or PR as relevant to SIG Storage. label Jun 15, 2018
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 15, 2018
@k8s-ci-robot k8s-ci-robot requested review from childsb and saad-ali June 15, 2018 13:57
@k8s-ci-robot k8s-ci-robot added the kind/design Categorizes issue or PR as related to design. label Jun 15, 2018
// VolumeSource represents the source location of a volume to attach.
// Only CSIVolumeSource can be specified.
// +optional
VolumeSource *v1.VolumeSource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this data denormalized instead of referencing the pod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the pod can be deleted when external A/D controller needs to detach a volume. Current CSI implementation expects that PV is not deleted (and uses finalizes for that). I think it would be too much to add finalizers to pods.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what indicates whether an inline volume requires an attach/detach phase? many potential uses of inline CSI volumes only require mount/unmount

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the object references in v1.VolumeSource (secret, endpoints, etc) are ambiguous when this is embedded in an object located outside the source pod's namespace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what indicates whether an inline volume requires an attach/detach phase? many potential uses of inline CSI volumes only require mount/unmount

We currently have no answer here, all CSI volumes (both inline and PVs) must go through external attacher and have VolumeAttachment. It is probably for another proposal to solve this (and we need a smart idea how to do it - A/D controller currently does not know what drivers are deployed to the cluster and which of them need attach/detach).

the object references in v1.VolumeSource (secret, endpoints, etc) are ambiguous when this is embedded in an object located outside the source pod's namespace.

Touche!

  1. We could add pointer to pod to VolumeAttachment, but that complicates things when two pods share the same volume and the one that's referenced dies and user deletes its secrets. Having two pods sharing the same volume and having different Secrets looks quite bizarre though..

  2. We could copy the secrets to VolumeAttachment.

  3. We could change CSI not to require per-volume secrets on Detach time. IMO, all in-tree volume plugins don't need it. All of them need it either at attach or mount time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We could change CSI not to require per-volume secrets on Detach time. IMO, all in-tree volume plugins don't need it. All of them need it either at attach or mount time.

This may not be true when Cinder is updated to process secrets introduced in kubernetes/kubernetes#63826

@dims, do you expect that Secrets introduced there will be used to detach volumes? If so, how do you make sure that the Secrets are available at detach time? Especially for the in-line volume in a pod, user can delete the Secrets at any time.

}
```

* A/D controller **copies whole `VolumeSource`** from `Pod` into `VolumeAttachment`. This allows external CSI attacher to detach volumes for deleted pods without keeping any internal database of attached VolumeSources.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • what will create VolumeAttachment objects corresponding to pods?
  • what will clean up VolumeAttachment objects after pods are deleted?
  • how will the VolumeAttachment objects be named? is it derived from the pod namespace and name?
  • what is the behavior if two pods both contain identical VolumeSources?

Copy link
Member Author

@jsafrane jsafrane Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will create VolumeAttachment objects corresponding to pods?
what will clean up VolumeAttachment objects after pods are deleted?

A/D controller. There is no difference there from attaching PVs.

how will the VolumeAttachment objects be named? is it derived from the pod namespace and name?

It will be the same as PVs, i.e. CSI driver name + CSI volume handle + node name, all hashed to conform to naming rules.

what is the behavior if two pods both contain identical VolumeSources?

They both get the volume, assuming the volume is ReadWriteMany. And it will work even if one of the pods uses PV and the other has the same volume inlined.

Copy link
Member

@liggitt liggitt Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be the same as PVs, i.e. CSI driver name + CSI volume handle + node name, all hashed to conform to naming rules.

CSI volume handle in a PV is considered trustworthy, because creating a PV is a highly privileged operation. The entity creating the PV is assumed to create unique volume handles, right?

CSI volume handle in a pod is not particularly trustworthy. What happens if two pods in different namespaces (scheduled to the same node) both specify the same CSI driver and volume handle, but different VolumeAttributes, fstype, etc? which one wins and gets its content in the single VolumeAttachment object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one wins. Its the same as when two users create non-CSI in-line volume, such as Ceph RBD or AWS EBS.

Which brings another topic: we should extend PSP / SCC to allow/disallow individual CSI drivers inlined in pods, similar to Flex:

https://kubernetes.io/docs/concepts/policy/pod-security-policy/#flexvolume-drivers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added PSP change.

* Using whole `VolumeSource` allows us to re-use `VolumeAttachment` for any other in-line volume in the future. We provide validation that this `VolumeSource` contains only `CSIVolumeSource` to clearly state that only CSI is supported now.
* TBD: `CSIVolumeSource` would be enough...
* External CSI attacher must be extended to process either `PersistentVolumeName` or `VolumeSource`.
* Since in-line volume in a pod can refer to a secret in the same namespace as the pod, **external attacher must get permissions to read any Secrets in any namespace**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that determination should be made by the deployer of the attacher. it may not be expected to service all namespaces, and may not require user-specified attach secrets

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note about this.

@jsafrane
Copy link
Member Author

@liggitt, I got one idea about the secrets for in-line volumes.

  1. CSI plugin adds reference to pod to VolumeAttachment (as a field, exact API is pending for now)

  2. VolumeAttachment reads the VolumeSource to attach, reads referenced Secrets (using given namespace) and calls CSI to attach the volume.

  3. If attach succeeded, external attacher creates a new secret in the namespace where the attacher runs. It uses VolumeAttachment.Name as Secret.Name so it's unique and can be found easily.

    The secret must be copied by external attacher and not by A/D controller, because A/D controller does not know in which namespace the attacher runs.

  4. When the pod is deleted, the external attacher deletes the secret after successful detach.

This way, users may delete any Secrets in their namespaces, attacher keeps a copy. This needs to be carefully documented and admins must make sure that namespace with the attacher is safe enough. Would it be a way?

@jsafrane
Copy link
Member Author

hmm, and the external attacher should copy secrets also from PVs, in case the same volume is used inlined in another pod.

Currently, CSI can be used only though PersistentVolume object. All other persistent volume sources support in-line volumes in Pods, CSI should be no exception. There are two main drivers:
* We want to move away from in-tree volume plugins to CSI, as designed in a separate proposal https://github.com/kubernetes/community/pull/2199/. In-line volumes should use CSI too.
* CSI drivers can be used to provide Secrets-like volumes to pods, e.g. providing secrets from a remote vault. We don't want to force users to create PVs for each secret, we should allow to use them in-line in pods as regular Secrets or Secrets-like Flex volumes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You hint at it a bit, but might be good to explicitly add it as a motivation:

  • Reach feature parity with FlexVolume api and allow for its deprecation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new motivation item.


// VolumeHandle is the unique volume name returned by the CSI volume
// plugin’s CreateVolume to refer to the volume on all subsequent calls.
// Required.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a mechanism to make this optional. CSI allows a driver to not implement CreateVolume, in which case VolumeHandle could be generated by kubelet itself.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rephrase the comment.. The VolumeHandle is required field in the structure. And it is not required that this ID comes from CreateVolume, it's just a suggestion how to get the ID.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. I was confusing the two. even still, the use case I see for one of the drivers looks like the following. There isn't really any benefit in requiring VolumeHandle I think:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx:1.13-alpine
    ports:
    - containerPort: 80
    volumeMount:
    - name: data
      mountPath: /usr/share/nginx/html
  volumes:
  - name: data
    csi:
      driver: drivers.k8s.io/image-driver
      image: kfox1111/misc:test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no CSIVolumeSource.Image. If kfox1111/misc:test is unique volume ID (and it looks like it uniquely identifies source of the data to mount), it should be used as VolumeHandle.

Copy link

@kfox1111 kfox1111 Jun 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I haven't figured out the actual mapping to k8s. Reading through the pv->csi api spec, I think it actually should be:

volumes:
  - name: data
    csi:
      driver: drivers.k8s.io/image-driver
      volumeAttributes:
        image: kfox1111/misc:test

I have a prototype csi driver working here: https://github.com/kfox1111/drivers/blob/volume-driver/pkg/image/nodeserver.go#L143

It uniquely identifies the container image to use for the volume. the volume identifier though should match the pod's lifecycle. so the easiest thing I can think of is to use the podid as the volumehandle. or maybe just a random id (but consistent for the pods lifecycle).

For things that follow pod lifecycle, VolumeHandle doesn't really make sense for the user to specify. It is supposed to be unique, but you could have multiple instances of the same pod on the same host. If the user specified something, then it could conflict between the pods on the same host.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I added VolumeHandlePrefix to make sure CSI volume ID is (optionally) unique for each volume in every pod.
  2. There is discussion in CSI community to include "Workload" specification in NodePublishRequest. Kubernetes will transport Pod name / namespace / UID and whatnot, so CSI driver can (optionally) tailor the volume for the specific pod.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the VolumeHandlePrefix come from? is it specified in the pod spec? Should it come from the driver registrar (https://github.com/kubernetes-csi/driver-registrar) sidecar instead? Then the packager of the csi driver could tell k8s which type it expects so the user won't have to specify it most of the time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VolumeHandlePrefix comes in pod spec, pod.spec.volumes[x].csi.volumeHandlePrefix.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So under my current understanding of the spec, the theoretical csi emtpyDir example would need to be:

volumes:
  - name: foo
    csi:
      driver: drivers.k8s.io/emptyDir-driver
      volumeHandle: foo
      volumeHandlePrefix: pod
  - name: bar
    csi:
      driver: drivers.k8s.io/emptyDir-driver
      volumeHandle: bar
      volumeHandlePrefix: pod

Thats quite verbose. And a miss copied entry could have bad results. It really feels like volumeHandlePrefix or at least its default should come from the driver some how.

Alternately, could the VolumeHandlePrefix be made optional and default to Pod (should be a safe default) and VolumeHandle default to the volume's name and be optional (guaranteed to be unique within a pod). That would restore the ease of use for using any csi driver where you want to map volume lifecycle to the pod. and still allow for specifying the exact volumes you want to mount when you need to.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts?

Author: @jsafrane

## Goal
* Define API and high level design for in-line CSI volumes in Pod

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading through the spec once, there may be two different use cases wrapped up into one here.

  1. secrety like volumes. per pod volumes.
  2. more traditional volume drivers. volume create/delete/attach/etc.

While both need solving, the vast majority of difficult design/coding I think is around use case number 2.

Since use case number one may be much easier to implement, and would garner a bunch of drivers quickly should we consider breaking the spec into 2, to allow the first use case to be implemented faster then the second?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation may come in two phases, however we should have solid design for both cases.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. but just hoping to see at least part 1 implemented in 1.12. If the design of part 2 would push implementation out to 1.13, then maybe we should separate. If not, then lets keep it all together.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we very well may miss the 1.13 window even for part 1 because we're still debating how to solve part 2. Can we please set aside part 2 for now?

@msau42
Copy link
Member

msau42 commented Jun 19, 2018

Topology support in kubernetes is only provided at the PV layer. So inline CSI volumes won't be able to support topology.

@jsafrane
Copy link
Member Author

So inline CSI volumes won't be able to support topology.

That's IMO ok, no in-line volume supports any volume topology. Anyone who creates a pod with topology-constrained volume must set the right node selectors and affinities in the pod itself.

@jsafrane
Copy link
Member Author

jsafrane commented Jun 21, 2018

I extended the proposal dealing with security concerns raised by @liggitt. As result, cSIVolumeSource was extended with VolumeHandlePrefix, which generates unique volume IDs so users can't access ephemeral secret-like CSI volumes of each other.

It also answers questions raised by @kfox1111.


## Motivation
Currently, CSI can be used only though PersistentVolume object. All other persistent volume sources support in-line volumes in Pods, CSI should be no exception. There are two main drivers:
Currently, CSI can be used only though PersistentVolume object. All other persistent volume sources support in-line volumes in Pods, CSI should be no exception. There are three main drivers:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/though/through/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* Secrets for in-line volumes must be in the same namespace as the pod that contains the volume. Users can delete them before the volume is detached. We deliberately choose to let the external attacher to fail when such Secret cannot be found on detach time and keep the volume attached, reporting errors about missing Secrets to user.
* Since access to in-line volumes can be configured by `PodSecurityPolicy` (see below), we expect that cluster admin gives access to CSI drivers that require secrets at detach time only to educated users that know they should not delete Secrets used in volumes.
* Number of CSI drivers that require Secrets on detach is probably very limited. No in-tree Kubernetes volume plugin requires them on detach.
* We will provide clear documentation that using in-line volumes with drivers that require credentials on detach may leave orphaned attached volumes that Kubernetes is not able to detach. It's up to the cluster admin to decide if using such CSI driver is worth it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, if it happens, what would be a recovery steps for a cluster admin to clear this condition, it cannot stay orphaned forever.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good compromise.

Maybe we should specify finer grained behaviour with respect to pod security policy though. Maybe that belongs in a different spec. Being able to white list particular csi drivers such as image-driver while making a driver such as nfs-driver, pv only could be handy. Looking at the pod security policy, 'volumes' is currently a string array so would only allow/disallow all of csi. Perhaps the psp volume list could ignore just csi though and use the ".volume.csi.driver" info instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I just noticed it was already in one of the more recent versions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sbezverk, there already is metric of failed detaches and admin can look at VolumeAttachments with failed detach.

@jsafrane
Copy link
Member Author

Added PSP update to allow/deny attachable in-line CSI volumes, as they could abuse nodes by deleting Secrets needed for detach.

@jsafrane jsafrane force-pushed the csi-inline branch 2 times, most recently from 6b5b1f8 to e5dc5ee Compare June 22, 2018 09:33
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: saad-ali

If they are not already assigned, you can assign the PR to them by writing /assign @saad-ali in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jsafrane
Copy link
Member Author

Updated with CSIDriver.Spec.VolumeHandleMode to configure generation of VolumeHandles for inline volumes.

* `AutomaticVolumeHandleGeneration`
* `NoVolumeHandleGeneration`

When the driver is configured with `CSIDriverSpec.VolumeHandleMode = AutomaticVolumeHandleGeneration` and the volumeHandle is not specified, the Kubelet will automatically generate the volume handle to be sent to the CSI driver. The generated value will be a combination of podUID and pod namespace.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... I don't think this is unique enough.

It should be possible to launch two different image driver volume instances in a pod with different images. But with podUID and pod namespace, it would end up being the same value for both volumes with this scheme.

What about podUID and Volume Name? I think podUID is unique even over different namespaces, and volume name is guaranteed unique in the pod spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks for pointing this out.

Driver string

// VolumeHandle is the unique ID of the volume. It is the volume ID used in
// all CSI calls, optionally with a prefix based on VolumeHandlePrefix

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VolumeHandlePrefix isn't in the spec anymore.

@AishSundar
Copy link

/milestone v1.13

adding this to 1.13 milestone since it seems to be related to kubernetes/enhancements#603

@thockin
Copy link
Member

thockin commented Nov 9, 2018

@jsafrane @vladimirvivien @saad-ali @msau42

What is the status of this proposal and the associated PR? The last I looked at the API PR it had a significant compat problem. This spec has outstanding comments.

Is it going to progress? When will it be ready for re-review? If it needs my attention, I need to know. :)

@kfox1111
Copy link

kfox1111 commented Nov 9, 2018

We have had some setbacks in the PR. it is really difficult to cover all the use cases all in one PR I think. Its getting to be a very big patch set. Can we implement the emptydir style csi driver plugin stuff first, as it is much simpler to implement? It would allow for enough use case to start merging common plumbing stuff like the pod security policy framework. We can start finding the issues with the approach as its still alpha, and we can work on the finer details of the other use cases. I think getting emptydir style pod inline support is very close so could still maybe get into 1.13. we can work out the details for the rest of the volume use cases for 1.14 or 1.15. This would still allow a significant new use case to be covered such as vault style drivers, image distribution style drivers, a csi replacement for the native git driver, etc. Currently, these are just not possible.

@vladimirvivien
Copy link
Member

Hi @thockin

There is progress. But we've been focusing on implementation which is revealing some issues we're working through here - kubernetes/kubernetes#68232 (which will change some of the design). If we cant get these issues resolved by Monday, this may not go in 1.13.

@kfox1111
Copy link

kfox1111 commented Nov 9, 2018

I think we have come up with a plan to get a sold enough base implemented this weekend for the 1.13 alpha functionality.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 7, 2019
@corentone
Copy link

Will there be any path to integration with local-ephemeral-storage metrics? Or thoughts if that'd make sense?
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#local-ephemeral-storage

@msau42
Copy link
Member

msau42 commented Feb 22, 2019

In the initial alpha implementation, it will not be integrated, but before beta we should have a design. Thanks for bringing that up!

@msau42
Copy link
Member

msau42 commented Feb 22, 2019

(btw, this PR has been replaced by a kep. Latest update here)

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 24, 2019
@vladimirvivien
Copy link
Member

vladimirvivien commented Apr 12, 2019

This PR has been replaced by KEP - kubernetes/enhancements#790. Will be going as alpha in 1.15.

@davidz627
Copy link
Contributor

@vladimirvivien @jsafrane if the KEP is a full replacement please close this PR

@vladimirvivien
Copy link
Member

@jsafrane I can't close it.

@msau42
Copy link
Member

msau42 commented Apr 22, 2019

/close

@k8s-ci-robot
Copy link
Contributor

@msau42: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.