Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetCapacityResponse should contain "total capacity" #301

Open
saad-ali opened this issue Nov 6, 2018 · 9 comments
Open

GetCapacityResponse should contain "total capacity" #301

saad-ali opened this issue Nov 6, 2018 · 9 comments

Comments

@saad-ali
Copy link
Member

saad-ali commented Nov 6, 2018

GetCapacityResponse should contain "total capacity" in addition to available_capacity so that caller can make decisions about provisioning.

@gnufied
Copy link
Contributor

gnufied commented Nov 6, 2018

But GetCapacityResponse is a controller call, without attaching the volume somewhere it may be tough for storageProvider to say how much is available capacity. For example - I don't recall AWS/EBS api supports returning available capacity when we describe a volume.

@msau42
Copy link

msau42 commented Nov 6, 2018

This GetCapacity response is used for determining how much capacity is available for provisioning. It's not for reporting how much capacity is used/available in a single volume.

Some more detail on the motivation for being able to report a total capacity. If a plugin only reports current available capacity, that adds limitations on the performance of the controller using that information:

  • Additional scheduling latency because you need to wait for the RPC round trip for the available capacity to be updated after a CreateVolume() call
  • Parallelism of scheduling and provisioning is limited due to this round trip dependency
  • Attempts to cache requested capacity for "in-flight" CreateVolume operations is challenging when you consider plugin restarts. It's not clear from an observer how much of the reported available capacity included the outstanding requests or not.

@jdef
Copy link
Member

jdef commented Nov 6, 2018

Thanks for providing additional context @msau42.

It's still not clear to me how including the "total" capacity helps resolve these things. If the CO can't reason about "available" capacity due to parallel operations then it's not obvious to me how having the "total" capacity helps: the CO is still in the same position re: being unable to reason about operations executing in parallel or that may/may-not have completed after a plugin restart.

Given that storage provisioning/quota policy parameters could likely be the governance of the backend storage system itself (and invisible to the CO) I think that relying on "stable" cached values for "total" capacity is probably fraught w/ error for some set of backends. I suppose the same could be said of "available" capacity - caching this value for very long might not be a very good idea.

@msau42
Copy link

msau42 commented Nov 6, 2018

With total capacity reported, the CO can keep track of what volumes it has created and what is outstanding. Plugin restarts are fine because that base number doesn't change, and the rest of the information can be persisted and reconstructed as needed. However, with only available capacity, we can't tell how many of the volumes we know about are accounted for in the reported capacity.

This does have the limitation that the total capacity reported is completely allocated to the CO cluster and not shared with other clusters or allocated out of band.

@msau42
Copy link

msau42 commented Nov 7, 2018

Let me try to convey the difficulty with an example.

  1. When the plugin is being initialized, we can query it for the available capacity, and it returns 100 GB out of 500.
  2. Say there are some 50 volume creation operations all in flight
  3. At the same time an administrator decides to add more available capacity to the storage backend.
  4. Also at the same time, volumes are getting deleted and their capacity will be added back to the pool.

As a CO, when I periodically query the plugin for available capacity, how do I know which operations have been accounted for in the number that the plugin gives me? There is a timing delay where the CO's view can be out of sync with the plugin's view.

If there was a total capacity field, then it doesn't matter what the plugin's view is of 2) or 4). The CO can calculate available capacity based only on its view of the allocated volumes and operations in flight. When it queries the plugin for capacity, a change in total capacity means something like 3) occurred, which is also not as frequent of an event as 2) or 4)

Let me know if this makes any more sense.

@jdef
Copy link
Member

jdef commented Nov 7, 2018 via email

@jhdedrick
Copy link

Agree with jdef. Storage class parameters will become richer over time, and it will be the job of the backend to optimally map volume requests to available "generalized capacity" - by which I mean not just storage capacity, but IOPs, network bandwidth, and many other constrained resources. Trying to report capacity as a single value isn't going to be meaningful.

@cofyc
Copy link

cofyc commented May 24, 2019

The question is whether the capacity can be cached by the CO for volume scheduling or not.

From my understanding, there are two things here:

  1. Adhere to one storage allocation way in one storage class

To calculate the available capacity, we need to know the way how the storage (e.g. VG, ceph image pool) is consumed (e.g. linear, raid1 or filesystems with a fixed ratio of the filesystem to block size).

If there are multiple ways to consume the backend storage for one storage class. The avaiable_capacity cannot be calculated for the storage class too.

If there is only one way to allocate the volume from a storage class, then we can report the capacity for this storage class. The CO can cache them to make scheduling decisions.

I think this is the only way to do volume scheduling, otherwise, the capacity cannot be calculated and used by the CO. The CO can only assume all nodes have enough capacity, which is the status of current volume scheduling in Kubernetes.

The storage driver or plugin can support carving multiple types of volumes (e.g. linear and raid1 volumes in LVM), but for each storage class, it should support only one allocation type.

  1. It's best to report total capacity to improve the experience of volume scheduling

Described by @msau42 here. Without total capacity per topology segment per storage class (for local storage, each node is a topology segment), the scheduler may make bad decisions when its state is out of sync of the storage backends.

It is possible to recover by rescheduling. However, the scheduler may not choose the best-fit node when its state is out of sync (e.g. the storage of best-fit node is occupied by terminating PVs).

As in 1) I clarified if we have available capacity per topology segment per storage class, there is only one allocation way for this storage class. We can calculate the total capacity in most cases.

For linear volumes, it's easy. For raid1 volumes, LVM will allocate some space for metadata and the ratio compared to total space is not fixed. For these kinds of volumes, we can reserve some space for metadata e.g. <volume count limit> * copies * sizeof(extent). The number of copies determines the allocation way which cannot be updated after the storage class is created. The volume count limit can be hard-coded or specified in storage class parameters too.

To report total capacity, there are two ways:

  • Return total capacity in avaiable_capacity field when a special parameter is given. Only one of available or total capacity is needed for volume scheduling.
  • Add total_capacity field

I suggest adding total_capacity field.

I think the total_capacity is the total allocatable capacity for the backend storage by using the allocation way specified in GetCapacity.

For each storage class, if we adhere to one allocation way in one storage class, we can cache the reported available capacity for storage classes in CO to do dynamic volume provisioning. If the driver has a way to report total capacity, the experience will be better.

@gpaul
Copy link

gpaul commented May 24, 2019

If there is only one way to allocate the volume from a storage class, then we can report the capacity for this storage class. The CO can cache them to make scheduling decisions.

Right, in Mesos we cache this result for brief periods, and requery every CSI plugin instance, for every StorageClass (we call them profiles), every couple of seconds (10s or 30s, I can't recall) to remain reasonably up-to-date.

The storage driver or plugin can support carving multiple types of volumes (e.g. linear and raid1 volumes in LVM), but for each storage class, it should support only one allocation type.

This makes sense: a StorageClass definition is transformed into GetCapacityRequest.Parameters. Multiple StorageClass'es result in the CO sending multiple GetCapacity RPCs to the same CSI plugin instance and getting back a potentially different value for available_capacity for each StorageClass.

For each storage class, if we adhere to one allocation way in one storage class, we can cache the reported available capacity for storage classes in CO to do dynamic volume provisioning. If the driver has a way to report total capacity, the experience will be better.

It sounds like the design you are proposing assumes the following limitation: available capacity for a given StorageClass cannot change at runtime other than through creating or deleting a volume of that StorageClass.

This is problematic.

  1. Given two StorageClasses: StorageClass[raid1] and StorageClass[raid0]. Creating a volume of StorageClass[raid1] results in a RAID-1 LV being created on a specific VG. That implicitly reduces the available_capacity (or used capacity, if you rely on total_capacity, instead) of StorageClass[raid0].

  2. This implicit relation between StorageClass'es and the Create/DeleteVolume RPCs needs to live somewhere.

2.1 It can live in the CSI plugin. In Mesos we chose to delegate that knowledge to the CSI plugin: we do no capacity calculation and instead reissue the GetCapacity RPC for every StorageClass, for every CSI plugin instance, at regular intervals.

2.2 It can live in the CO. The CO can encode knowledge of how Create/DeleteVolume influences the available_capacity of a StorageClass as well as related StorageClasses.

You can try and sidestep the issue by restricting every instance of a CSI plugin (e.g., every LVM Volume Group) to a single StorageClass. In that case, you will still make incorrect calculations of available_capacity as the per-volume overhead is not accounted for: 1x10GiB volume uses less storage than 2x5GiB volumes, since the 2 volumes each have some metadata overhead in addition to their 5GiB available volume size. The reason is that the CreateVolume RPC does not interpret the volume size as "the amount by how much available_capacity was reduced", but rather "what is the addressable size of the resulting volume".

I imagine that Create/DeleteVolume alone are also not enough to model correctly: Create/DeleteSnapshot and the soon-to-be-introduced volume resize functionality also have unexpected impact on capacity.

It is still possible to encode all this wisdom into the CO, but it would have to be done for every kind of CSI plugin, which defeats some of the purpose of the CSI specification.

Another issue with the CO calculating available_capacity as total_capacity - volumesizes, which I think is more fundamental, is what to do in the case of a CSI plugin that performs inline data compression. In that case, not even the CSI plugin will know how much available_capacity will be left after it creates a volume as not all data compresses equally well.

I think the proper solution must be to issue GetCapacity calls and to accept the reality that the CO does not have a perfect view of the amount of available capacity at any given time, but devise strategies to reduce that delta.

One such strategy is to periodically poll CSI plugins for available_capacity for every StorageClass. This does not scale well with the number of StorageClasses.

Such a strategy could be tempered by only requesting available capacity if some capacity-changing RPC like Create/Delete/ResizeVolume or Create/DeleteSnapshot has been performed against that CSI plugin instance since available capacity was last requested.

This has the disadvantage of CO state becoming outdated in the case where the CSI plugin instance's backing storage increases/decreases out-of-band, such as when the administrator extends the LVM VG or adds more disks to a Ceph installation, etc. Perhaps that's OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants