-
Notifications
You must be signed in to change notification settings - Fork 1.6k
OpenEBS node-disk-manager #736
Comments
@jsafrane I have a very good opinion about node disk manager effort from openebs. Thats one of the reason for me to actually involve in some discussions around this with OpenEBS folks so the mentions about gluster operator and such in the design proposal. In reality, node disk discovery and handling of these components is a @kmova @umamukkara @epowell |
I took a quick glance and it looks promising at providing an end to end disk management solution for distributed storage applications. The metrics and health monitoring aspects look very useful, and it should solve the issue of managing disks for DaemonSet-based providers. I'm trying to think about how this could be integrated with the local-pv-provisioner from two angles:
For both use cases, I think there are still some challenges around categorization of disks to StoragePools that would need to be ironed out. IIUC, ndm creates a Disk object for every block device in the system, so it would be up to the StoragePool implementation to further filter which Disks to use. And an implementation MUST filter out the disks, otherwise it could end up stepping on the root filesystem or other K8s volume plugins. Filtering using Disk.Vendor + Disk.Model may be sufficient if you want all similar disks to be in the same StoragePool. The challenges I see are about how to support more advanced disk configurations:
Local PV provisioner didn't solve this and instead required users to prep and categorize the disks beforehand. While I can see some simpler use cases being simplified by ndm, I'm not sure what is the best way to solve the more advanced ones. |
@msau42 @jsafrane Thanks for the review and inputs! @humblec and I have been discussing on how to keep the disk inventory and storage pool implementation generic so it can be used in multiple scenarios. We have made some progress on the following (will shortly update the design PR):
We definitely need more help/feedback in terms of advanced usecases and API design. |
@kmova I'm wondering if it would be simpler to use PVs as your disk inventory instead of a new Disk CRD object. The advantage is that you can reuse existing PVC/PV implementation to handle dynamic provisioning, and attaching of volumes to nodes. |
@kmova @msau42 I am trying to understand the slide titled "Complementing Local PV". Currently the local provisioner crawls through the discovery directory to find volumes to create PVs for. It appears that with NDM one could add another form of discovery where NDM uses its own discovery mechanism to create local PVs. This seems like a useful enhancement to me assuming its adding another mechanism and not replacing existing discovery mechanism. I guess the question about using Disk CRs, I would like to better understand what information it actually stores. I assume to support operations like unplugging and moving disks, the Disk CR stores more information than one would put in a Local PV. Its life cycle might also be a bit different from a PV as a result. If that is the case, then keeping the disk CR might make sense. Again, need to understand what the information in the CR is and how it is used. |
@msau42 - Using PV in place of a new Disk CR, I was getting into the following challenges:
Another consideration was from the usability/operations perspective. for example: the management tools around kubernetes like weave scope that can represent these disks as visual elements, with ability to blink, get iostats etc., |
@dhirajh This disk CR can store details like:
In addition, as part of dynamic attributes or monitored metrics:
|
@kmova I think it's still possible to use PVs for inventory management. You don't necessarily have to mount PVC directly in the spec. If it's in your future roadmap to support all kinds of volume types, such as cloud block storage, then using the PVC abstraction could be used to provide dynamic provisioning and disk attachment capabilities as well. |
@msau42 - IIUC the PVs can be created by the ndm and the additional disk attributes could be added under annotations (or may be under an extended spec?). For covering the case for kubernetes/kubernetes#58569 - the pod can still mount the "/dev" and the configuration can specify the PV objects it can use - which will have the path information. I like the idea of using the PVC abstractions for dynamic provisioning. How do we get the PVs attached to the node without adding them to a Deployment/App spec? |
Getting the PVs attached to the node is the hard part because it is tied to Pod scheduling. Having a Pod per PV is probably not going to scale, and you have to handle cases like the pod getting evicted. I'm not sure if leveraging the VolumeAttachement object would work, it may conflict/confuse the Attach/Detach controller. |
@msau42 I do think, expanding volumeattachment object for local storage/disk handling could complex things. Its good to have it on other/new api object or some custom CRDs like ndm currently has. If custom CRD for a disk object is not optimal , we may think about a new api object for disk/local storage handling IMO. @kmova I feel, we should also have |
@humblec - yes we can get the topology labels from the node where the disks are discovered and attach them to the Disk objects. Example:
In addition, I have included based on feedback to be able to fetch additional information that can describe how disks are attached - via internal bus, HBA, or SAS expanders etc., This information can be used while provisioning latency sensitive pools. |
Agree, I don't see a great way to handle attached disk types without always forcing some Pod to be on the node. I think Disk CRD could work fine if you only plan on supporting local disks. But since other volume types were mentioned in the roadmap, I was trying to envision how things like provisioning and attaching could be supported without having to reimplement volume plugins and much of the Kubernetes volume subsystem. |
As an alternative datapoint, I spoke a bit with @dhirajh about how they deploy Ceph in their datacenter. He mentioned that they use StatefulSets, and each replica (OSD) manages just one local PV. All replicas use the same class and capacity of disk, and instead, there is a higher level operator that manages multiple StatefulSets and balances them across fault domains (ie rack). This operator is in charge of making sure that capacity is equal across fault domains, and can scale up each StatefulSet when more Ceph capacity is requested. With this architecture, they don't need their Ceph pods to manage multiple disks, and a disk failure is contained to a single replica, so they can use PVCs directly. For cases where nodes have different amount of disks and capacity, the operator can create more StatefulSets to use them. |
Thanks @msau42 thats a good data point. I will add this into the design document. Along with this I will also gather additional details on usecases where the storage pods need multiple PVs and the expected behaviour when using SPDK to access disks. |
@msau42 @kmova IMO there are good amount of use cases where a storage pod need more than one local PV. For e.g# sometimes the storage pod have to keep its own metadata in one pv and other PV for data volumes or serving volume create requests. In other angle, one local PV may not be sufficient to serve all the PVC request comes from the kube user. Atleast in Gluster we support around 1000 Volumes from a 3 node gluster cluster. Just attaching one disk and carving out space from it may not be sufficient. |
In Gluster's case, also, it would be fairly heavy-weight to have one GlusterFS pod per device on a node. Nevermind that it would also limit per-node scale-out expansion, which is one of the core features of Gluster. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What's our opinion about OpenEBS node disk manager (NDM)?
https://github.com/openebs/node-disk-manager
openebs-archive/node-disk-manager#1
https://docs.google.com/presentation/d/1XcCWQL_WfhGzNjIlnL1b0kpiCvqKaUtEh9XXU2gypn4/
We could probably save some effort on both sides if we cooperate. For example, NDM's StoragePool idea looks like our LVM-based dynamic provisioner. And I personally like automated discovery of local disks that I'd need to deploy Gluster or Ceph on top of local PVs.
To me it seems that NDM is trying to solve similar use case as we, it's only more focused on the installation / discovery of the devices to consume as PVs, while Kubernetes focuses on the runtime aspects how to use the local devices (i.e. schedule and run pods). IMO, it would make sense to merge NDM with our local provisioner or at least make the integration as easy as possible for both sides.
/area local-volume
@ianchakeres @msau42 @davidz627 @cofyc @dhirajh @humblec ?
(did I forget anyone?)
The text was updated successfully, but these errors were encountered: