-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Investigate CSI driver for local volume #824
Comments
/area local-volume |
Yeah, we need fast fail for volumes that not belong to a node.
Maybe a new volume plugin interface for this, just like the way we did for |
A downside is that we cannot support multi backends (SSD, HHD...) in a same provisioner. |
For 1), we may be able to use the GetNodeInfo call + a flag to csi-external-provisioner to indicate that we're running the provisioner per topology instance, so CreateVolume should be filtered. For 5), here's the CSI call. From that we need to figure out how to represent capacity in K8s for non-node objects. We may need a new object for this because the topology keys returned by CSI can be anything. |
Why is this? The provisioner calls based on provisioner name, not storageclass, so a single plugin should be able to handle both ssds and hdds. |
Mainly duo to capacity report. capacity of ssd/hdd should be reported separately. While currently in CSI GetCapacity interface, we can only return a single value. Maybe we can consider to wrap the storage kinds as |
Ugh... That limitation might have been an oversight. Let's see if we can add it to csi |
I like the idea, the only concern is that people may get confused about the usage of topology in claims(SCs). The two are same only for local volume. So I guess it not so common. |
Here is the CSI issue to add storageclass info to GetCapacity: container-storage-interface/spec#55 (comment) For 1), it may be sufficient enough for the first phase to run a special external-provisioner only for local volumes. If more plugins need such a feature, we can consider a generic method later. |
StorageClass parameters are available in CSI GetCapacity arguments already: https://github.com/container-storage-interface/spec/blob/master/spec.md#getcapacity |
@jieyu has started an CSI compatible LVM Local Storage driver here: https://github.com/mesosphere/csilvm He's open to adding a Kubernetes directory for k8s specific deployment configs to that. In the future, we can consider moving that to the CSI org. |
I looked into the LVM driver @saad-ali mentioned above, it's overall a great job.
|
To support file path as input, I can think of two ways:
Personally I'm more inclined to the first approach. WDYT? |
@jieyu, would you be open if we made changes to the lvm driver to address the 3 requirements listed above? |
We're open to those suggestions. For 2), that's what specified in CSI spec. |
That's what I cared about most. If the field is designed for free capacity left, then it's not in line with our expectations. @msau42. maybe we can push the capacity response definition forward to include both total capacity and free capacity? For 3), in k8s, we have the case that multi pods are in need of a same volume. Thus, we need to mount the source path to a global path in For 1), in our production environment, we already met the scenario that in need of multi storage backends (ssh, hdd...) on a same node . |
Definitely open to suggestions, yes.
By "driver" do you mean "single instance of a plugin"? So multiple instances of the We support hdds or ssds by listing the individual devices that comprise the volume group as command-line options when launching the plugin instance. For two volume groups, We've found having a single VG per plugin instance to be a wonderful simplification to the code. It also makes error reporting in the CO simple as the CO can trivially tie any errors it receives to a specific volume group.
The pods use case seems very reasonable and I'd be very happy to have your help with supporting it. |
Hmm, have you an overall implementation besides the driver? i.e. provisioner, attacher. etc. For k8s implementation, I think if we have each driver handle a group, we'll have to deploy a provisioner and attacher for they each accordingly. And for failure recovery, the drivers will probably be deployed as pods(containers), rather than simply install and launch. So I think we'll have multi instances. Personally, I'd prefer the concept of "driver" to represent a higher level of distinction: it's of cinder, ceph or lvm, but not the groups behind lvm. |
In Mesos each instance of a CSI plugin corresponds to a Storage Local Resource Provider (SLRP) which takes care of managing that plugin instance's lifecycle and communicates with the plugin according to the CSI spec. I believe the SLRP corresponds to both provisioner/attacher, but I'm not very familiar with k8s.
That sounds likely, given that today we launch a SLRP per volume group (per CSI plugin instance, technically.)
Makes sense.
Agreed. It sounds like we're on the same page regarding terminology. @jieyu given your greater familiarity with k8s and Mesos, what do you think? |
@gpaul @lichuqiang i got a bit lost here. What's the concrete proposed changes regarding csilvm repo? |
The 3 points above, and discussion mainly focus on the 1st point, i.e. whether to support multi volume groups in a driver |
re: (2.), if i understand correctly you agree that the plugin implements (2.) correctly? (From #824 (comment)) re: (3.), that sounds very cool. I'd be happy for any help on this. re: (1.), I'm not yet convinced that this is necessary and it adds significant complexity to the plugin. I'm currently opposed to making the default That said, if we can keep the logic sufficiently well-separated while maintaining the very simple single-vg-per-process mode I think it will be OK. In that event, I'd like to generate two binaries: a simple, easy to debug and reason about one that manages a single volume group, the way we do currently, and a new one that adds the multiplexing layer on top of the core functionality exposed by the other. This does not mean the second will need to literally |
Take is as a compatible CSI driver, I can understand your concern. In a cluster with multiple types of storage (CSI and other storages that not implemented via CSI), the provisioner need to be able to tell if it's responsible for an incoming volume creation request. And if we have each driver handle a single group, that means the provisioner need to be aware of the driver info of the request. Unfortunately, currently the provisioner makes use of a shared controller lib, it filters requests according to plugin names ( like CSI, GCE, AWS...)and won't be aware of CSI-specific things. And also, I really don't like the idea of having each provisoner take care of a single driver. in k8s, the provisioner need to watch changes in data, a provisioner means an extra connection to apiserver, this might have effect on performance when the cluster is of large scale. |
Those are good points, thanks for explaining. In that case I'm leaning towards adding this according to the preliminary design I outlined in my comment:
Is this work something you would consider contributing? I would love to have good k8s support in the CSILVM plugin. |
Sounds good to me. So I think we don't need two separate binaries indeed, just let it decide whether to load the multiplexing layer according to flags upon startup. Is that what you mean? |
One more detail to iron out. IIUC, the lvmcsi driver accepts a list of devices and creates the vg for them. Can it handle situations like:
|
Also regarding 2), I think we need to have two capacities reported in CSI, currently available and total (including used). In general, Kubernetes does not use currently available capacity because it adds significant scheduling latency to have to wait for the available capacity to be updated. |
xref some relevant discussion regarding total vs. available capacity from |
The problem I see with only reporting available capacity, is that you will need to call GetCapacity before every CreateVolume request, and you won't have accurate available capacity reported if you have multiple CreateVolume requests outstanding in parallel. So I think we need to add back a total_capacity field so that COs have the option to be smarter with capacity tracking. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@lichuqiang are you still interested in this? |
/remove-lifecycle stale We are still interested in supporting this in Kubernetes. cc @cofyc who is doing some prototyping |
/assign |
@cofyc I am also interesting in doing a prototype deployment on k8s? Do you have anything which could be used to deploy csilvm on a non production cluster to play around with yet? |
@cofyc @jdef @lichuqiang I have been doing some experimenting with integrating csilvm with Kubernetes. I have found what may be a fairly significant issue. The behavior I am seeing is that the actual Linux LV is created on a random node in the K8s cluster which is running csilvm, not on the actual Node where the Pod gets scheduled. The good news is that if I manually steer the Pod to the node with the storage, it works, so some basics seem to be ok. I think what is going on is that the CreateVolume Controller gRPC call, not a Node call. For something like an EBS volume, this is fine, because any node an create an EBS volume and then a different node can mount it later. For Linux LVs, this is obviously a no go. I think if the Linux LV creation is moved from CreateVolume/DeleteVolume to NodeStageVolume/NodeUnStageVolume that this issue will be resolved. Does this make sense to you? Once concern I have, which I honestly don't see a away around, is that the GetCapacity call is at the Controller level and there doesn't seem to be any corresponding call at the Node level. Given this setup, I'm concerned that what might happen in Kubernetes is that a Pod could get scheduled to a Node and then Kubernetes will be unable to provision the local volume it wants because it made the scheduling decision without regard to the amount of local storage which is left on that node. |
For supporting local volumes, I think the controller services will have to be running in the daemonset too. And the csi-provisioner probably can't be used as is. It would have to be modified to only trigger provisioning if the selected node matches the node it's running on. But I think the main blocking issue, as you pointed out, is supporting the GetCapacity call. We don't have a way to represent capacity for general topology in Kubernetes, so that has to be designed. In addition, the current semantics of the CSI GetCapacity call is not architecturally compatible with how resources/capacity are surfaced and used by the Kubernetes scheduler. Some more discussion here. I unfortunately haven't had time to think deeply about this and how other non-local storage systems could be supported, and would greatly appreciate if someone could do some deeper investigation here. |
As a bit of background, the use case I an interested in getting working is an ephemeral local volume backed by a LVM on a Kubernetes Node. See https://kubernetes-csi.github.io/docs/ephemeral-local-volumes.html. If we configure the storage class using the lvm provisioner to have volumeBindingMode set to WaitForFirstConsumer, then it seems like all that is needed is for the NodeStageVolume to create the lv and format the file system and then NodePublishVolume to bind mount from the global location used in NodeStageVolume to the pod directory. Of course this still leaves the obvious hole here that the Kubernetes scheduler has no idea if there is actually enough storage on the node to ensure these Node gRPC calls have any chance of success, but I'm setting this issue to the side for a moment. |
I would very much like to push this forward. @matthias50 I don't think we should let volume creation piggy-back on NodeStageVolume, at least not in the long run. It sounds like a sensible short-term solution to your use case, but longer-term I think it will lead to confusion.
@msau42 I'd be happy to contribute research / design / code / general effort. I am slowly grokking how CSI and storage in general works in k8s (my relevant background is CSI in Mesos) so any pointers, links to open discussions or design docs relating to this work would be much appreciated. |
hi, @matthias50 @gpaul I was working in Kubernetes local storage and volume scheduling over the last year. If you have any questions about k8s, just ask me. My next goal is to support dynamic provisioning for local storage and other topology-aware storage types. We can work together. I'm reading the csilvm project and related discussions and docs and writing a demo to demonstrate the design. I'm also writing a proposal based on @msau42 's design. I'll share it when it's ready for review. |
That's great to hear, @cofyc! Looking forward to hearing more. |
@gpaul > @matthias50 I don't think we should let volume creation piggy-back on NodeStageVolume, at least not in the long run. It sounds like a sensible short-term solution to your use case, but longer-term I think it will lead to confusion. Do you have another suggestion as to a way to go about this? Right now, I don't see any other way to make sure that the volume is provisioned from the underlying storage on the node where it needs to be published. |
No I don't, sorry. I'm hoping a design crystallizes out of the discussion here: |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
This doc seems related, thought I don't see that it links to a related KEP: https://docs.google.com/document/d/1WtX2lRJjZ03RBdzQIZY3IOvmoYiF5JxDX35-SsCIAfg/edit?disco=AAAADMBYnPc&ts=5d15db36&usp=comment_email_document&usp_dm=false |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
cc @lichuqiang
The text was updated successfully, but these errors were encountered: