Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storage Capacity Constraints for Pod Scheduling #1353

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Nov 4, 2019

This KEP explains how CSI drivers can expose how much capacity a storage system has available via the API server. This information can then be used by the Kubernetes Pod scheduler to make more intelligent decisions when placing:

  • pods with ephemeral inline volumes
  • pods with persistent volumes that haven't been been created yet (WaitForFirstConsumer binding mode)

Enhancement issue: #1472

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 4, 2019
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Nov 4, 2019
@pohly pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from fb6c15e to 89ec735 Compare November 4, 2019 06:55
@pohly
Copy link
Contributor Author

pohly commented Nov 4, 2019

@cdickmann just created #1347, which is a similar proposal but with a different API and different goals. We learned about each others work last week and agreed that the best approach for reconciling the two proposals would be to create KEP PRs and then use the KEP process to discuss them.

The goal has to be to define one API extension which works for both purposes (Kubernetes scheduler here, more intelligent operators in #1347).

@pohly pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from 89ec735 to 2560793 Compare November 4, 2019 07:00
@cdickmann
Copy link

If I compare this KEP to ours (#1347, hopefully my CLA issue is resolved today, waiting for tech support) then I see some key differences:

  • We assume that storage can have more diverse topology, e.g. tracking individual disks or disk groups in a host, or tracking multiple independent storage arrays attached to the same node. In all these cases, the capacity is really on a StoragePool and a node has (or doesn't have) access to it. IMHO that is more powerful and a superset of what is proposed here.
  • Beyond capacity, IMHO we may need more fields in the future, e.g. health and other properties of a StoragePool.
  • This KEP proposed K8s scheduler changes to take advantage of the additional available information.

I feel it is useful to separate two topics:

  • Surfacing up information about the storage. @pohly Do you see anything that you would like K8s to surface up which isn't covered by our KEP?
  • Having K8s leverage them for scheduling. We decided to scope this out of our KEP. @pohly Could you imagine building your scheduler logic on top of the StoragePool exposed by our KEP?

@pohly
Copy link
Contributor Author

pohly commented Nov 4, 2019

IMHO that is more powerful and a superset of what is proposed here.

True. The downside is that it is also more complicated and can't be implemented without extensions to the CSI standard. The proposal here was meant to work with just the existing mechanisms (CSI topology, v1.NodeSelector to represent that in Kubernetes).

Beyond capacity, IMHO we may need more fields in the future, e.g. health and other properties of a StoragePool.

Agreed.

Do you see anything that you would like K8s to surface up which isn't covered by our KEP?
Could you imagine building your scheduler logic on top of the StoragePool exposed by our KEP?

I still need to think about your proposal and how the Kubernetes scheduler itself could make use of it with a CSI driver that hasn't been modified.

@pohly
Copy link
Contributor Author

pohly commented Nov 4, 2019

IMHO that is more powerful and a superset of what is proposed here.

True.

After going through your proposal once more I am not so sure about that anymore. It seems to be focused exclusively on a flow where a PVC is somehow tied to a storage pool and then gets provisioned with CreateVolume first, before any pod using that volume.

The generic late-binding case (Pod uses PVC which refers to a storage class and WaitForFirstConsumer is used) cannot be handled because that generic PVC doesn't reference a storage pool, so Kubernetes doesn't have the information about storage capacity that it needs for choosing a node for the pod. Ephemeral inline volumes also cannot be handled because those don't involve CreateVolume.

@pohly
Copy link
Contributor Author

pohly commented Nov 5, 2019

@pohly Could you imagine building your scheduler logic on top of the StoragePool exposed by our KEP?

As explained in my previous comment, #1353 (comment), I think the answer is no. Let me turn the question around, can you imagine basing your extension on the revised API in this KEP (see below)? 😄

Obviously the original API was too focused on just capacity tracking. Thanks to your KEP additional use cases became clearer and I now tried to come up with a more general API that can be extended to also cover those - see e0e7c43. At this point, it's a 1:1 renaming of what I had before plus some additional flexibility regarding what information must be provided. In this PR I'd prefer to keep it at that level to ensure that it remains small enough to make progress.

If you think that this goes in the right direction, then I could try to come up with a revision of your KEP that is based on this one.

This is a problem for deployments that do support it, but haven’t been
able to provide the necessary information yet. TODO: extend
CSIDriverInfo with a field that tells the scheduler to retry later
when information is missing? Avoid the situation by (Somehow? How?!)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have found a solution for this:

  • the driver deployment creates the top-level storage info object
  • external-provisioner just updates it while the object exists
  • removing the driver also removes the object

This way the scheduler knows that if there is an object, it has to schedule based on the information contained in it. The information might still be stale or incomplete, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even simpler: add a CSIDriver.Status field with all the dynamic information.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched over to that. Conceptually it makes sense to me. But other APIs also would work.

@pohly pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from 16a69b7 to fc24335 Compare November 12, 2019 12:25
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 13, 2019
@cdickmann
Copy link

I think between our 2 KEPs we are converging. Our use cases have some overlap, but are not 100% the same. Yet, at the core of our solutions we have a concept of a StoragePool, which represents a logical grouping of storage which is accessible by a set of nodes, and which has metadata like capacity. Patrick's KEP has a nice idea where capacity is reported per-storage-class because the storage class can actually change how much capacity gets used, and I agree with this idea. We can build our KEP on top of the same K8s API that Patrick is proposing, even though we likely need to extend it, and we also need CSI changes.

Patrick covers the following cases that we do not need for our use case, but that don't conflict:

  • K8s scheduler changes that consume the storage pool capacity
  • Concepts of and pseudo-storage-classes

Our KEP covers the following which isn't in Patrick's KEP:

  • Decouple Node and StoragePool topology, e.g. offer the ability to present multiple StoragePools per node, and StoragePools-to-Compute accessibility which don't follow the general node topology. This in turn requires our KEP to amend CSI.
  • Motivation and guidance around how K8s operators of Storage applications would leverage StoragePools for their own placement
  • A way of facilitating an application choosing which StoragePool to provision against, as it will consider more than just capacity

So we can easily converge here, and to me the path forward comes down to community guidance on:

  • Should we make a single KEP? Or should we make one core KEP which introduces StoragePool, and 2 extensions for the 2 use cases? Or just go with the 2 KEPs we have right now? I have no strong preference.
  • Should StoragePool be a first class K8s object (like we suggest), or be embedded in a larger CSIDriver K8s object (like Patrick suggets)? I have no strong preference
  • Should we put the new K8s logic into a new sidecar or into external provisioner? I have no strong preference.

Maybe @saad-ali , @thockin, and others can give their guidance on these questions?

// cluster. That subset can be identified either via NodeTopology or
// NodeList, but not both. If neither is set, the pool is assumed
// to be available in the entire cluster.
type CSIStoragePool struct {
Copy link
Contributor

@NickrenREN NickrenREN Nov 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For local storage in a very large cluster, is it too long for users to read the information in one API object ? It seems we need to have a CSIStoragePool struct for each node.
And also, is it reasonable to organize the storage pool by storage class ? I agree it is good for network storage, but it seems not that good for local storage .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For local storage in a very large cluster, is it too long for users to read the information in one API object ?

That may become an issue, yes.

It seems we need to have a CSIStoragePool struct for each node.

But that's then not meant to be readable by users, is it? As discussed in #1347 (comment), the name of those objects most likely will have to be generated such that they don't mean anything.

What about storing the information as proposed in this KEP, but then not show it in kubectl describe? kubectl describe pods does that for spec.affinity. It's still available to users when using kubectl get -o yaml. I'm not sure how that is implemented and whether it is a good idea, but I at least wanted to bring up... 🤷‍♂️

And also, is it reasonable to organize the storage pool by storage class ? I agree it is good for network storage, but it seems not that good for local storage .

I don't follow here. Are you saying that local storage can ignore storage classes? That's not necessarily the case. The example where available usable storage depends on storage class parameters is LVM with striped vs. mirrored mode selected via the storage class. That is local storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's then not meant to be readable by users, is it?

Yes, exactly.

What about storing the information as proposed in this KEP, but then not show it in kubectl describe? kubectl describe pods does that for spec.affinity. It's still available to users when using kubectl get -o yaml

I guess users may want to know the topology info just like capacity and allocatable in node status.

Are you saying that local storage can ignore storage classes? That's not necessarily the case. The example where available usable storage depends on storage class parameters is LVM with striped vs. mirrored mode selected via the storage class. That is local storage.

Yeah, actually, for local storage, we just need one storage class with some necessary parameters in the cluster. If so, as proposed here, each node with local storage is a storage pool.... Their storageclass may be all the same. I am not sure if it is reasonable, just feel weird.
It seems StorageClass and StoragePool does not have strict mapping relation. Why do we need to add StorageClass in StoragePool object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or could we limit the StoragePool more strictly, and make it 1:1 mapping relationship with StorageClass ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess users may want to know the topology info just like capacity and allocatable in node status.

So you are arguing that we should put storage information into the Node.Status? That only makes sense for one particular use case (local storage). I don't think we should develop a special solution just for that.

Yeah, actually, for local storage, we just need one storage class with some necessary parameters in the cluster.

That may be the case for some drivers, but not all. We need a solution that is flexible enough to cover different scenarios. A driver with multiple different storage classes and capacity that depends on the storage class is one of those scenarios.

If a driver does not need that flexibility, it can produce information with just storageClassName: <fallback>, as in the "local storage" example:

status:
  storage:
  - classes:
    - capacity: 256G
      storageClassName: <fallback>
    name: node-1
    nodeList:
    - node-1

Or could we limit the StoragePool more strictly, and make it 1:1 mapping relationship with StorageClass ?

Isn't that the special case above?

Copy link
Contributor Author

@pohly pohly Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially user-friendly when using kubectl describe...

But how does the user find the right object to call kubectl describe for? How user-friendly is that part going to be?

Since our cluster is quite large, we care about the performance a lot.

That part is indeed crucial. But it also depends on the usage pattern of the performance-critical parts. So let's consider that aspect for the worst case, node-local storage. With the current proposal (CSIDriver.Status), there will be one producer per node, sending updates to its part of the status via a patch operation. There's one consumer (the scheduler). It should better not create a watch because otherwise it'll constantly get updates even when there's nothing to schedule. Instead, it can retrieve the CSIDriver each time it is needed. I'm assuming here that scheduling pods with storage is going to be rare compared to scheduling pods without storage.

With CSIStoragePool, the writers will send a complete Status update for one object. There's still only one consumer (or two in your case?). That scheduler now must retrieve all objects with a certain CSIStoragePool.Spec.DriverName value each time it wants to schedule a pod with a volume using that driver. Because individual objects are smaller, a watch might make more sense here, although the argument about getting updates when they aren't needed still holds.

Does that sound right?

It's not obvious to me which approach will lead to better performance. I was hoping that some apiserver experts have already measured something like this and could share their experience. If that doesn't happen, then I have to proceed with prototyping both APIs and writing some microbenchmark to get some real numbers.

Copy link
Contributor

@NickrenREN NickrenREN Nov 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how does the user find the right object to call kubectl describe for? How user-friendly is that part going to be?

For local storage, its (CSIStoragePool) name can be set to node name

Does that sound right?

hmm, sorry for the unclear expression, what i want to clarify here is: We modify the k8s scheduler part. We create several schedulers and make them work in parallel. Each scheduler is only responsible for one partition (pods with specific NodeSelector and partial nodes with specific label). And schedulers can not share resources(nodes).
For predicate functions, we need to make sure Allocatable storage >= Pod(PVC) Request.
If we store all local storage info in one object, all schedulers need to look up to it, therefore, we need a lock or something like that. But if we create separate objects, each scheduler only care about partial CSIStoragePool objects, since scheduelers can not share nodes resource, they can work in parallel and do not need locks.

I know the solution is not general enough. Schedulers will have to share some other informations. But at least for local storage, it is ok. For network storage, we can shard the storage pools manually and make it fit for multiple schedulers too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For local storage, its (CSIStoragePool) name can be set to node name

This only works under the assumption that only one CSI driver is providing capacity information for that node. But indeed, if this assumption holds, then individual objects may work better. Thanks for clarifying this.

Copy link
Member

@msau42 msau42 Nov 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke with Jordan and he agrees that if the objects will end up scaling with the number of nodes, then it is preferable to have a separate object. The main issues with a single object:

  • any update to that single object, no matter how small, ends up sending the whole API object to all the watchers. This will not scale in large clusters.
  • when an informer gets an update, it gets the whole object, you can't see only the small delta that has changed, so you end up needing to iterate through the entire object to process a small update.

Separating it out into separate objects means more work on the consumer end, especially to build up caches, but it substantially reduces API server load in large clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these points apply to a usage pattern that does not necessarily apply for the implementation outlined in this KEP. But as @NickrenREN explained, they might want to use the API in a different way where they do apply, so I'm fine with switching to a individual objects. I'll update the KEP accordingly.

@pohly pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch 3 times, most recently from fb9e14b to 0401cb1 Compare November 22, 2019 14:29
@msau42
Copy link
Member

msau42 commented Nov 27, 2019

@kubernetes/sig-storage-pr-reviews
/assign @msau42 @jsafrane

Before we start going into API details, I think we should focus on consolidating the two keps. And I think it would first be good to get agreement on the use cases and user journeys before we dive into API.

@pohly pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from 07a1619 to 5c43a13 Compare May 13, 2020 20:06
@pohly
Copy link
Contributor Author

pohly commented May 13, 2020

Can the discussions about ephemeral inline volumes all moved to this new KEP #1701? In this KEP, just add one line to reference that new KEP? I think that will make this KEP more focused and easier to read.

Agreed, it allowed me to remove quite a few special cases. Now we just need to finish "generic inline volumes"... 😁

- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check the ones already done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which ones I can consider done. I've marked those where I am relatively sure.


### Non-Goals

* Only CSI drivers will be supported.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this sentence is written, it is a "Goal", not "Non-Goal":).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. I always struggle with avoiding double-negation in this section.


- Gather feedback from developers and users
- Integration with [Cluster Autoscaler](https://github.com/kubernetes/autoscaler)
- Generic solution for identifying storage pools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reference this KEP here (#1347)?

Also add that we need to figure out how to do spreading. Maybe it is covered by storage pools but it should be called out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@xing-yang
Copy link
Contributor

Here's recording for the meeting on 5/13: https://www.youtube.com/watch?v=Xj7ebO-Hqrw

capacity information, store it in the API server, and then use that
information in the scheduler. That information then flows
through different components:
1. storage backend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit/storage/Storage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


### API

#### CSIStorageCapacity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this name is not right and CSIStoragePlacement is a better name, but we can discuss about it during the implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Placement" of pods or storage? The former is what the API is used for, the later sounds odd. But yes, let's discuss separately.

//
// +optional
Capacity *resource.Quantity
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeTopology, StorageClassName, Capacity should still be in the Status as they can change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consider "Capacity" as the only mutable field. If topology or storage classes change, then new objects get created and those that are no longer relevant get removed.

This is part of the move away from CSIStoragePool. That entity had varying topology, but CSIStorageCapacity objects are defined by the input parameters of GetCapacity (driver, storage class, topology) and provide the output value (capacity).


// The storage class name of the StorageClass which provided the
// additional parameters for the GetCapacity call.
StorageClassName string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required or optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it is required. I was proposing an optimization earlier ("Storage class parameters that never affect capacity") where it would have been optional, but that was seen as too complicated.

// might get added. If not set, the scheduler will simply ignore the
// CSIStorageCapacity object.
//
// +optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this field be required? Without capacity, how is CSIStorageCapacity useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to enable future extensions. Once we introduce "total capacity" and "maximum volume size", this API would become:

Capacity *resource.Quantity
TotalCapacity *resource.Quantity
MaximumVolumeSize *resource.Quantity

At least one will be required then.

Let me change the "optional" into "required", because right now it is.

5GiB each, so accepting the node for two such volumes could be a false
positive.

More promising might be to add prioritization of nodes based on how
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By saying "prioritization of nodes", it seems that you are talking about local storage here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Networked storage might also be available only on a subset of the cluster. So prioritization may also be applicable there.


* **Is the rollout accompanied by any deprecations and/or removals of features,
APIs, fields of API types, flags, etc.?**
Even if applying deprecation policies, they may still surprise some users.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question is by introducing this CSIStorageCapacity feature, will any existing features get deprecated or removed.

* **How can an operator determine if the feature is in use by workloads?**
Ideally, this should be a metrics. Operations against Kubernetes API (e.g.
checking if there are objects with field X set) may be last resort. Avoid
logs or events for this purpose.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they can find out this feature is being used by querying CSIStorageCapacity objects?

- Usage description:
- Impact of its outage on the feature:
- Impact of its degraded performance or high error rates on the feature:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you have copied some paragraphs in the example KEP but didn't make any changes to them. I think if they are not relevant, then they should just be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I copied everything but left the part that says "must be completed when targeting beta graduation" for later without making changes there. I can take that out.


[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this readiness section needs more work. If the example paragraphs copied from the example KEP are not relevant, please remove them. Either write down your answers related to this feature or leave them blank if not relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the headlines and added comments saying "Will be added before the transition to beta."

@xing-yang
Copy link
Contributor

The verify failure is from "./hack/verify-kep-metadata.sh" not typo. That means there's some formatting issue with kep.yaml.

type CSIStorageCapacitySpec struct {
// The CSI driver that provides the storage.
// This must be the string returned by the CSI GetPluginName() call.
DriverName string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the driver name is redundant here. It was a left-over from the time when storage class was optional (CSI ephemeral volumes, the optimization for "storage class does not affect capacity").

But after having thought a bit about it, I think it is worthwhile keeping because it enables the listing of all objects for one driver, something that the external-provisioner has to do.

// Depending on how the driver is implemented, this might be the total
// size of the available storage which is only available when allocating
// multiple smaller volumes ("total available capacity") or the
// actual size that a volume may have ("maximum volume size").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

to rely on that information when making scheduling decisions that
involve volumes that need to be created by the driver.

If not set, the scheduler makes such decisions without considering
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or set to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good catch.

@pohly
Copy link
Contributor Author

pohly commented May 15, 2020

I've updated the KEP, please have another look.

jurisdication) availability may still be limited to a subset of the
nodes in a cluster.

#### Custom schedulers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this use case is about being able to leverage the new API to build a custom scheduler extension if the current implementation doesn't suit a plugin's needs? If that's the case, the you may need to disable the built-in plugin? @ahg-g is that possible through a config, or would the scheduler have to be rebuilt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When choosing a custom scheduler for a pod through the PodSpec.SchedulerName, only the custom scheduler is going to dispatch the pod, so there's no need to disable the feature in the default scheduler. It's still going to be enabled in case that some other pod uses the default scheduler and has volumes, though.

matches the storage pool(s) that it has access to, with granularity
that matches the most restrictive pool.

For example, if the driver runs in a node with zone/region/rack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this example. Can we have something more concrete?

Let's start with a simple example. Say we have a storage driver that supports zonal disks, and the cluster has nodes across many zones a, b, and c. Each NodeGetInfo call from the driver daemonset is going to return topology "zone" = "a", "b", or "c", and report it in the CSINode object.

I believe this example is trying to convey what to do if a driver returns different topology keys on different nodes? For example, for nodeA, it returns "zone" = "a", but for nodeB, it returns ("zone" = "A" AND "region" = "X").

Can we just treat nodeB's topology as another unique topology value different from nodeA's topology, and expect the driver to be able to report an appropriate value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose the storage is local to a region. There's storage with label region=west and storage with region=east. Now in each region, we have data centers. As long as NodeGetInfo returns region=east resp. region=west for each node depending on where it is located, the proposed algorithm works.

Now suppose that in addition to storage in each region, each data center also has storage. Some storage classes select the region storage, some others the faster storage in the datacenter.

The storage in a New York data center then has "region=east,center=NewYork" as topology, and NodeGetInfo for a node in that center has to return that.

The proposed algorithm will then still work: it will use region+center for the topologies and create CSIStorageCapacity objects with that topology also for region storage.

What isn't okay is when NodeGetInfo only reports region, because that is not detailed enough for the storage in the centers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I think the original example is confusing because it seems like zone encompasses a region, which is the reverse of what most cloud providers do. Can we change the example to use a more clear hierarchy, like region and rack?


#### With central controller, generic solution

For all other cases, a new CSI call to enumerate storage topology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is very vague. Is there an example of such a use case that won't work with above two? Let's leave it out of the proposal or put it under open issues if we don't have a concrete plan here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take it out until we identify a driver where the proposed approach doesn't work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's remove it for now

- compute the union of the topology segments from these
`CSINodeDriver` entries.

For each entry in that union, one `CSIStorageCapacity` object is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the name of the object generated to ensure uniqueness? Does it need to be deterministic so we can update it later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name can't be determinstic. We have to use a UUID for the name, similar to how PVs are named.

Updating is still possible because external-provisioner has to know about all objects for its driver and then can match by topology and storage class name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PVs names are deterministically generated so that if we call CreateVolume multiple times for the same PVC, it will get called with the same name.

Searching through every object for every predicate call, or every update in external-provisioner is not ideal, especially for a local volume driver that's going to have an entry per node. Is there a way we cache the mappings (maybe keyed by node) to improve the lookups?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PVs names are deterministically generated so that if we call CreateVolume multiple times for the same PVC, it will get called with the same name.

That relies on the unique ID of the PVC. We have no such UID here, just a bunch of attributes. We could hash them, but then we'll have to deal with hash collisions.

Internally, both external-provisioner and kube-scheduler will have to use caching and more efficient data structures to find the information that they need.

name + nodes + storage class) and then update that one
instead. Obsolete objects need to be removed.

To ensure that `CSIStorageCapacity` objects get removed when the driver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if topologies disappear while the driver is still running? For example, an entire rack gets decomissioned? There probably still has to be some logic that scans the current topologies from CSINode and reconciles with existing StorageCapacity objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's one of the situation where external-provisioner has to remove stale objects ("While external-provisioner runs, it needs to update and potentially potentially delete CSIStorageCapacity objects: ... when nodes change (for central provisioning)").

While external-provisioner runs, it needs to update and potentially
delete `CSIStorageCapacity` objects:
- when nodes change (for central provisioning)
- when storage classes change (for persistent volumes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "for persistent volumes" mean in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me remove it. It's a left-over from the time when ephemeral volumes were treated as a special case.

delete `CSIStorageCapacity` objects:
- when nodes change (for central provisioning)
- when storage classes change (for persistent volumes)
- when volumes were created or deleted (for central provisioning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how come this is not a concern for local volume provisioning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. I'll remove the condition.

The lookup sequence will be:
- find the `CSIDriver` object for the driver
- check whether it has `CSIDriver.spec.storageCapacity` enabled
- find all `CSIStorageCapacity` objects that have the right spec
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would there be multiple valid objects, or just one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should only be one, otherwise there's conflicting information. We can make the scheduler check for that error, if we want.

capacity that is too small is considered unusable at the moment and
ignored.

Each volume gets checked separately, independently of other volumes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a best effort, can we at least sum up the capacities for multiple volumes in the same Pod? This can at least reduce the chance of running into partial binding failures.

As a local volume example, if the node says it has "50GB" left, and a pod requests two local volumes of "50GB" each, then we should be fairly confident that the pod definitely cannot fit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That requires summing up the capacity of all the requested volumes and making a decision at the same time? But the create volume requests may not all come at exactly the same time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a local volume example, if the node says it has "50GB" left, and a pod requests two local volumes of "50GB" each, then we should be fairly confident that the pod definitely cannot fit.

This makes assumptions about the semantic of "capacity", something that we wanted to avoid. It also assumes that "capacity" == "total capacity". It's erring on the wrong side if the driver reports "maximum volume size".

Let's first build something that doesn't use heuristics and then check how well it works before introducing additional logic, okay?


### CSI drivers without topology support

To simplify the implementation of external-provisioner, [topology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused. Topology is a capability in CSI so drivers have to opt-in to support this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A certain @msau42 said that we can assume that CSI drivers support topology when they want to use storage capacity tracking 😝

I added this text in commit 86cf87d and removed some external-provisioner modes in response to your comment in https://github.com/kubernetes/enhancements/pull/1353/files#r369745671:

I'm not sure we should have so many different modes of operation. This will make maintaining the provisioner very complex.

I think we should try to simplify things and require that the plugin supports Topology in order to use this feature, and only have two modes of operation "local" and
"not local".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that part makes sense. I'm confused by the later sentence that talks about a driver implementing topology when they don't need it. What is the use case for a driver wanting the capacity feature when they don't support topology? The scheduler is not going to make any different decisions if capacity is equal across all nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I see what you mean. Indeed, what the heck was I thinking of here? I don't remember. Let me remove the second sentence and if we find a use-case for it again, we can add back something better.

pohly added a commit to pohly/enhancements that referenced this pull request May 18, 2020
This is an attempt to capture all points that were raised during
review of kubernetes#1353.
Copy link
Member

@msau42 msau42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has the KEP been updated with all the comments? When I look at the diff I still see the older text.


### CSI drivers without topology support

To simplify the implementation of external-provisioner, [topology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that part makes sense. I'm confused by the later sentence that talks about a driver implementing topology when they don't need it. What is the use case for a driver wanting the capacity feature when they don't support topology? The scheduler is not going to make any different decisions if capacity is equal across all nodes.


#### With central controller, generic solution

For all other cases, a new CSI call to enumerate storage topology
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's remove it for now

approach that integration into Kubernetes is managed as part of the
CSI driver deployment, ideally without having to modify the CSI driver
itself. This will work without changes for some storage systems while
others may have to support [a new CSI API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this part about a generic solution until we have a more concrete use case.


#### CSIDriver.spec.storageCapacity

A new field `storageCapacity` of type `boolean` with default `false`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this new field? Or would the presence of the CSIStorageCapacity object be sufficient?

I guess without the field, then rollout would potentially not be as smooth? With the field, you could upgrade the DaemonSet to the new version and only flip the driver bit when there is capacity reported across all nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this new field? Or would the presence of the CSIStorageCapacity object be sufficient?

Suppose there is not a single CSIStorageCapacity for the driver: is that because the driver does not support the feature, or because capacity information has not been created yet (driver still starting up)? The scheduler cannot distinguish between the two cases.

If it assumes that "no objects means no capacity", then it'll stop scheduling for drivers which don't support the feature.

If it assumes that "no objects means feature not supported", then it'll schedule instead of waiting for information. That will work eventually (it's just another variant of "acting on incomplete or out-dated information" and the recovery mechanisms need to deal with it), though, so perhaps that flag can indeed be dropped?

I guess without the field, then rollout would potentially not be as smooth? With the field, you could upgrade the DaemonSet to the new version and only flip the driver bit when there is capacity reported across all nodes.

I was thinking of a simpler deployment where the driver just gets installed once through a set of yamls and then Kubernetes itself waits for the driver to provide some capacity information, without an admin or operator implementing such an additional post-install step.

Copy link
Contributor Author

@pohly pohly Jun 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now think that we don't need the new field, without any loss of functionality. Here's what we can do instead:

  • Support the manual creation of CSIStorageCapacity objects by ensuring that external-provisioner only deletes objects that it also created. This was implied by the design, but not called out explicitly yet. I expect that this will be useful also for the autoscaler because then a cluster admin can use such manually created objects to define (local) storage that currently non-existent nodes will have access to.
  • The scheduler blocks pod scheduling if it finds at least one CSIStorageCapacity object for the CSI driver and none that has enough capacity.
  • CSI driver deployments create one CSIStorageCapacity object, for example with the CSI driver name as name, if and only if the deployment is going to produce storage capacity information. That object can define zero available storage, topology doesn't matter. All that matters is that the object exists and has the CSI driver name.

@msau42: what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may make upgrading a local storage driver in a large cluster difficult, say 5000 nodes. While the rolling update is happening, the scheduler will go from 5000 node choices quickly down to a handful. This could cause a lot of scheduling failures while this upgrade is happening. Ideally, we would want some way to switch over the behavior after the upgrade is complete.

Copy link
Contributor Author

@pohly pohly Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42: This is about upgrading from a driver deployment which has no support for storage capacity to one which does, right? In that case, storage capacity constraints will kick in once the first CSIStorageCapacity object appears, and you are worried that it would be better to let the driver continue with creating such objects for a while.

Let me augment my proposal to cover this: to opt into storage capacity scheduling, a driver deployment must create a CSIStorageCapacity with the driver name as object name. Whoever deploys the driver can then decide to create that object immediately or wait a bit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this today and agreed that the CSIDriverSpec.StorageCapacity is the better API, so we'll keep the KEP as originally proposed and merged.


A new field `storageCapacity` of type `boolean` with default `false`
in
[CSIDriver.spec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.16/#csidriverspec-v1beta1-storage-k8s-io)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's v1 as of 1.18.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the link.

because the reported capacity is smaller than the size of a
volume. This will work better when CSI drivers implement `GetCapacity`
such that they consider constraints like fragmentation and report the
size that the largest volume can have at the moment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reference the CSI open issue. We need to resolve this ambiguity before beta.

matches the storage pool(s) that it has access to, with granularity
that matches the most restrictive pool.

For example, if the driver runs in a node with zone/region/rack
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I think the original example is confusing because it seems like zone encompasses a region, which is the reverse of what most cloud providers do. Can we change the example to use a more clear hierarchy, like region and rack?

This is a slight deviation from the CSI spec, which defines that
"`accessible_topology` specifies where the *node* is accessible
from". For node-local storage there is no difference, but for
network-attached storage there might be one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the problem is for network-attached storage. Could you give an example? This KEP is only exposing capacity information per topology where topology of the node aligns with the storage system's topology.

We're not handling scenarios where the storage system has a topology independent of Kubernetes nodes.

@msau42
Copy link
Member

msau42 commented May 18, 2020

Nevermind, I think I was looking at an older diff

@pohly
Copy link
Contributor Author

pohly commented May 19, 2020

Various replies, GitHub no longer let's me reply directly to the comments:

Let's reference the CSI open issue. We need to resolve this ambiguity before beta.

Link added.

I think the original example is confusing because it seems like zone encompasses a region, which is the reverse of what most cloud providers do. Can we change the example to use a
more clear hierarchy, like region and rack?

Okay, done.

I'm not sure what the problem is for network-attached storage. Could you give an example? This KEP is only exposing capacity information per topology where topology of the node aligns with the storage system's topology.

I added a remark for the example that "If it uses region/rack, then the proposed approach below will
still work, but at the cost of creating redundant CSIStorageCapacity objects" and removed the comment about "deviation from the CSI spec". I still feel that we expect more from the driver than required by the spec, but it's not in violation of the spec, so it's fine.

@pohly pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from c742793 to 0a3853f Compare May 19, 2020 12:58
A proposal for increasing the chance that the Kubernetes scheduler
picks a suitable node for pods using volumes that have capacity
limits.
@pohly pohly force-pushed the storage-capacity-constraints-for-pod-scheduling branch from 0a3853f to 556497a Compare May 19, 2020 13:03
@pohly
Copy link
Contributor Author

pohly commented May 19, 2020

I've squashed the commits now

  • because we are (hopefully!) close to merging
  • this PR has reached exactly 50 commits, which is a nice, round number 😁

Today's changes were:

diff --git a/keps/sig-storage/1472-storage-capacity-tracking/README.md b/keps/sig-storage/1472-storage-capacity-tracking/README.md
index c2b874e..5898bb5 100644
--- a/keps/sig-storage/1472-storage-capacity-tracking/README.md
+++ b/keps/sig-storage/1472-storage-capacity-tracking/README.md
@@ -480,5 +480,5 @@ status:
 A new field `storageCapacity` of type `boolean` with default `false`
 in
-[CSIDriver.spec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.16/#csidriverspec-v1beta1-storage-k8s-io)
+[CSIDriver.spec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#csidriverspec-v1-storage-k8s-io)
 indicates whether a driver deployment will create `CSIStorageCapacity`
 objects with capacity information and wants the Kubernetes scheduler
@@ -506,5 +506,5 @@ The CSI spec up to and including the current version 1.2 just
 specifies that ["the available
 capacity"](https://github.com/container-storage-interface/spec/blob/314ac542302938640c59b6fb501c635f27015326/lib/go/csi/csi.pb.go#L2548-L2554)
-is to be returned by the driver. It is left open whether that means
+is to be returned by the driver. It is [left open](https://github.com/container-storage-interface/spec/issues/432) whether that means
 that a volume of that size can be created. This KEP uses the reported
 capacity to rule out pools which clearly have insufficient storage
@@ -540,13 +540,11 @@ matches the storage pool(s) that it has access to, with granularity
 that matches the most restrictive pool.
 
-For example, if the driver runs in a node with zone/region/rack
-topology and has access to one per-zone pool and one per-zone/region
-pool, then the driver should report the zone/region of the second pool
-as its topology.
-
-This is a slight deviation from the CSI spec, which defines that
-"`accessible_topology` specifies where the *node* is accessible
-from". For node-local storage there is no difference, but for
-network-attached storage there might be one.
+For example, if the driver runs in a node with region/rack topology
+and has access to per-region storage as well as per-rack storage, then
+the driver should report topology with region/rack as its keys. If it
+only has access to per-region storage, then it should just use region
+as key. If it uses region/rack, then the proposed approach below will
+still work, but at the cost of creating redundant `CSIStorageCapacity`
+objects.
 
 Assuming that a CSI driver meets the above requirement and enables
@@ -969,7 +967,5 @@ scheduler extender varies between clusters.
 To simplify the implementation of external-provisioner, [topology
 support](https://kubernetes-csi.github.io/docs/topology.html) is
-expected from a CSI driver. A driver which does not really need
-topology support can add it simply by always returning the same static
-`NodeGetInfoResponse.AccessibleTopology`.
+expected from a CSI driver.

Full history before squashing is here: https://github.com/pohly/enhancements/commits/storage-capacity-2020-05-19

@msau42
Copy link
Member

msau42 commented May 19, 2020

/lgtm
/approve
Thanks for working through all the comments and adapting all the feedback!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 19, 2020
@msau42
Copy link
Member

msau42 commented May 19, 2020

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 19, 2020
@k8s-ci-robot k8s-ci-robot merged commit 4c9fc3c into kubernetes:master May 19, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.