-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage Capacity Constraints for Pod Scheduling #1353
Storage Capacity Constraints for Pod Scheduling #1353
Conversation
fb6c15e
to
89ec735
Compare
@cdickmann just created #1347, which is a similar proposal but with a different API and different goals. We learned about each others work last week and agreed that the best approach for reconciling the two proposals would be to create KEP PRs and then use the KEP process to discuss them. The goal has to be to define one API extension which works for both purposes (Kubernetes scheduler here, more intelligent operators in #1347). |
89ec735
to
2560793
Compare
keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md
Outdated
Show resolved
Hide resolved
If I compare this KEP to ours (#1347, hopefully my CLA issue is resolved today, waiting for tech support) then I see some key differences:
I feel it is useful to separate two topics:
|
True. The downside is that it is also more complicated and can't be implemented without extensions to the CSI standard. The proposal here was meant to work with just the existing mechanisms (CSI topology,
Agreed.
I still need to think about your proposal and how the Kubernetes scheduler itself could make use of it with a CSI driver that hasn't been modified. |
After going through your proposal once more I am not so sure about that anymore. It seems to be focused exclusively on a flow where a PVC is somehow tied to a storage pool and then gets provisioned with The generic late-binding case (Pod uses PVC which refers to a storage class and |
As explained in my previous comment, #1353 (comment), I think the answer is no. Let me turn the question around, can you imagine basing your extension on the revised API in this KEP (see below)? 😄 Obviously the original API was too focused on just capacity tracking. Thanks to your KEP additional use cases became clearer and I now tried to come up with a more general API that can be extended to also cover those - see e0e7c43. At this point, it's a 1:1 renaming of what I had before plus some additional flexibility regarding what information must be provided. In this PR I'd prefer to keep it at that level to ensure that it remains small enough to make progress. If you think that this goes in the right direction, then I could try to come up with a revision of your KEP that is based on this one. |
keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md
Outdated
Show resolved
Hide resolved
This is a problem for deployments that do support it, but haven’t been | ||
able to provide the necessary information yet. TODO: extend | ||
CSIDriverInfo with a field that tells the scheduler to retry later | ||
when information is missing? Avoid the situation by (Somehow? How?!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have found a solution for this:
- the driver deployment creates the top-level storage info object
- external-provisioner just updates it while the object exists
- removing the driver also removes the object
This way the scheduler knows that if there is an object, it has to schedule based on the information contained in it. The information might still be stale or incomplete, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even simpler: add a CSIDriver.Status field with all the dynamic information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched over to that. Conceptually it makes sense to me. But other APIs also would work.
16a69b7
to
fc24335
Compare
keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md
Outdated
Show resolved
Hide resolved
keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md
Outdated
Show resolved
Hide resolved
keps/sig-storage/20191031-storage-capacity-constraints-for-pod-scheduling.md
Outdated
Show resolved
Hide resolved
I think between our 2 KEPs we are converging. Our use cases have some overlap, but are not 100% the same. Yet, at the core of our solutions we have a concept of a StoragePool, which represents a logical grouping of storage which is accessible by a set of nodes, and which has metadata like capacity. Patrick's KEP has a nice idea where capacity is reported per-storage-class because the storage class can actually change how much capacity gets used, and I agree with this idea. We can build our KEP on top of the same K8s API that Patrick is proposing, even though we likely need to extend it, and we also need CSI changes. Patrick covers the following cases that we do not need for our use case, but that don't conflict:
Our KEP covers the following which isn't in Patrick's KEP:
So we can easily converge here, and to me the path forward comes down to community guidance on:
Maybe @saad-ali , @thockin, and others can give their guidance on these questions? |
// cluster. That subset can be identified either via NodeTopology or | ||
// NodeList, but not both. If neither is set, the pool is assumed | ||
// to be available in the entire cluster. | ||
type CSIStoragePool struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For local storage in a very large cluster, is it too long for users to read the information in one API object ? It seems we need to have a CSIStoragePool struct for each node.
And also, is it reasonable to organize the storage pool by storage class ? I agree it is good for network storage, but it seems not that good for local storage .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For local storage in a very large cluster, is it too long for users to read the information in one API object ?
That may become an issue, yes.
It seems we need to have a CSIStoragePool struct for each node.
But that's then not meant to be readable by users, is it? As discussed in #1347 (comment), the name of those objects most likely will have to be generated such that they don't mean anything.
What about storing the information as proposed in this KEP, but then not show it in kubectl describe
? kubectl describe pods
does that for spec.affinity
. It's still available to users when using kubectl get -o yaml
. I'm not sure how that is implemented and whether it is a good idea, but I at least wanted to bring up... 🤷♂️
And also, is it reasonable to organize the storage pool by storage class ? I agree it is good for network storage, but it seems not that good for local storage .
I don't follow here. Are you saying that local storage can ignore storage classes? That's not necessarily the case. The example where available usable storage depends on storage class parameters is LVM with striped vs. mirrored mode selected via the storage class. That is local storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that's then not meant to be readable by users, is it?
Yes, exactly.
What about storing the information as proposed in this KEP, but then not show it in kubectl describe? kubectl describe pods does that for spec.affinity. It's still available to users when using kubectl get -o yaml
I guess users may want to know the topology info just like capacity
and allocatable
in node status.
Are you saying that local storage can ignore storage classes? That's not necessarily the case. The example where available usable storage depends on storage class parameters is LVM with striped vs. mirrored mode selected via the storage class. That is local storage.
Yeah, actually, for local storage, we just need one storage class with some necessary parameters in the cluster. If so, as proposed here, each node with local storage is a storage pool.... Their storageclass may be all the same. I am not sure if it is reasonable, just feel weird.
It seems StorageClass and StoragePool does not have strict mapping relation. Why do we need to add StorageClass in StoragePool object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or could we limit the StoragePool more strictly, and make it 1:1 mapping relationship with StorageClass ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess users may want to know the topology info just like capacity and allocatable in node status.
So you are arguing that we should put storage information into the Node.Status
? That only makes sense for one particular use case (local storage). I don't think we should develop a special solution just for that.
Yeah, actually, for local storage, we just need one storage class with some necessary parameters in the cluster.
That may be the case for some drivers, but not all. We need a solution that is flexible enough to cover different scenarios. A driver with multiple different storage classes and capacity that depends on the storage class is one of those scenarios.
If a driver does not need that flexibility, it can produce information with just storageClassName: <fallback>
, as in the "local storage" example:
status:
storage:
- classes:
- capacity: 256G
storageClassName: <fallback>
name: node-1
nodeList:
- node-1
Or could we limit the StoragePool more strictly, and make it 1:1 mapping relationship with StorageClass ?
Isn't that the special case above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Especially user-friendly when using kubectl describe...
But how does the user find the right object to call kubectl describe
for? How user-friendly is that part going to be?
Since our cluster is quite large, we care about the performance a lot.
That part is indeed crucial. But it also depends on the usage pattern of the performance-critical parts. So let's consider that aspect for the worst case, node-local storage. With the current proposal (CSIDriver.Status
), there will be one producer per node, sending updates to its part of the status via a patch operation. There's one consumer (the scheduler). It should better not create a watch because otherwise it'll constantly get updates even when there's nothing to schedule. Instead, it can retrieve the CSIDriver
each time it is needed. I'm assuming here that scheduling pods with storage is going to be rare compared to scheduling pods without storage.
With CSIStoragePool
, the writers will send a complete Status
update for one object. There's still only one consumer (or two in your case?). That scheduler now must retrieve all objects with a certain CSIStoragePool.Spec.DriverName
value each time it wants to schedule a pod with a volume using that driver. Because individual objects are smaller, a watch might make more sense here, although the argument about getting updates when they aren't needed still holds.
Does that sound right?
It's not obvious to me which approach will lead to better performance. I was hoping that some apiserver experts have already measured something like this and could share their experience. If that doesn't happen, then I have to proceed with prototyping both APIs and writing some microbenchmark to get some real numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But how does the user find the right object to call kubectl describe for? How user-friendly is that part going to be?
For local storage, its (CSIStoragePool) name can be set to node name
Does that sound right?
hmm, sorry for the unclear expression, what i want to clarify here is: We modify the k8s scheduler part. We create several schedulers and make them work in parallel. Each scheduler is only responsible for one partition (pods with specific NodeSelector and partial nodes with specific label). And schedulers can not share resources(nodes).
For predicate functions, we need to make sure Allocatable storage >= Pod(PVC) Request.
If we store all local storage info in one object, all schedulers need to look up to it, therefore, we need a lock or something like that. But if we create separate objects, each scheduler only care about partial CSIStoragePool objects, since scheduelers can not share nodes resource, they can work in parallel and do not need locks.
I know the solution is not general enough. Schedulers will have to share some other informations. But at least for local storage, it is ok. For network storage, we can shard the storage pools manually and make it fit for multiple schedulers too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For local storage, its (CSIStoragePool) name can be set to node name
This only works under the assumption that only one CSI driver is providing capacity information for that node. But indeed, if this assumption holds, then individual objects may work better. Thanks for clarifying this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke with Jordan and he agrees that if the objects will end up scaling with the number of nodes, then it is preferable to have a separate object. The main issues with a single object:
- any update to that single object, no matter how small, ends up sending the whole API object to all the watchers. This will not scale in large clusters.
- when an informer gets an update, it gets the whole object, you can't see only the small delta that has changed, so you end up needing to iterate through the entire object to process a small update.
Separating it out into separate objects means more work on the consumer end, especially to build up caches, but it substantially reduces API server load in large clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these points apply to a usage pattern that does not necessarily apply for the implementation outlined in this KEP. But as @NickrenREN explained, they might want to use the API in a different way where they do apply, so I'm fine with switching to a individual objects. I'll update the KEP accordingly.
fb9e14b
to
0401cb1
Compare
07a1619
to
5c43a13
Compare
Agreed, it allowed me to remove quite a few special cases. Now we just need to finish "generic inline volumes"... 😁 |
keps/sig-storage/1472/README.md
Outdated
- [ ] "Implementation History" section is up-to-date for milestone | ||
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] | ||
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check the ones already done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure which ones I can consider done. I've marked those where I am relatively sure.
keps/sig-storage/1472/README.md
Outdated
|
||
### Non-Goals | ||
|
||
* Only CSI drivers will be supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way this sentence is written, it is a "Goal", not "Non-Goal":).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I always struggle with avoiding double-negation in this section.
keps/sig-storage/1472/README.md
Outdated
|
||
- Gather feedback from developers and users | ||
- Integration with [Cluster Autoscaler](https://github.com/kubernetes/autoscaler) | ||
- Generic solution for identifying storage pools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reference this KEP here (#1347)?
Also add that we need to figure out how to do spreading. Maybe it is covered by storage pools but it should be called out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Here's recording for the meeting on 5/13: https://www.youtube.com/watch?v=Xj7ebO-Hqrw |
keps/sig-storage/1472/README.md
Outdated
capacity information, store it in the API server, and then use that | ||
information in the scheduler. That information then flows | ||
through different components: | ||
1. storage backend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit/storage/Storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
keps/sig-storage/1472/README.md
Outdated
|
||
### API | ||
|
||
#### CSIStorageCapacity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this name is not right and CSIStoragePlacement is a better name, but we can discuss about it during the implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Placement" of pods or storage? The former is what the API is used for, the later sounds odd. But yes, let's discuss separately.
keps/sig-storage/1472/README.md
Outdated
// | ||
// +optional | ||
Capacity *resource.Quantity | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeTopology, StorageClassName, Capacity should still be in the Status as they can change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I consider "Capacity" as the only mutable field. If topology or storage classes change, then new objects get created and those that are no longer relevant get removed.
This is part of the move away from CSIStoragePool
. That entity had varying topology, but CSIStorageCapacity
objects are defined by the input parameters of GetCapacity
(driver, storage class, topology) and provide the output value (capacity).
keps/sig-storage/1472/README.md
Outdated
|
||
// The storage class name of the StorageClass which provided the | ||
// additional parameters for the GetCapacity call. | ||
StorageClassName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this required or optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it is required. I was proposing an optimization earlier ("Storage class parameters that never affect capacity") where it would have been optional, but that was seen as too complicated.
keps/sig-storage/1472/README.md
Outdated
// might get added. If not set, the scheduler will simply ignore the | ||
// CSIStorageCapacity object. | ||
// | ||
// +optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this field be required? Without capacity, how is CSIStorageCapacity useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is meant to enable future extensions. Once we introduce "total capacity" and "maximum volume size", this API would become:
Capacity *resource.Quantity
TotalCapacity *resource.Quantity
MaximumVolumeSize *resource.Quantity
At least one will be required then.
Let me change the "optional" into "required", because right now it is.
keps/sig-storage/1472/README.md
Outdated
5GiB each, so accepting the node for two such volumes could be a false | ||
positive. | ||
|
||
More promising might be to add prioritization of nodes based on how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By saying "prioritization of nodes", it seems that you are talking about local storage here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Networked storage might also be available only on a subset of the cluster. So prioritization may also be applicable there.
keps/sig-storage/1472/README.md
Outdated
|
||
* **Is the rollout accompanied by any deprecations and/or removals of features, | ||
APIs, fields of API types, flags, etc.?** | ||
Even if applying deprecation policies, they may still surprise some users. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question is by introducing this CSIStorageCapacity feature, will any existing features get deprecated or removed.
keps/sig-storage/1472/README.md
Outdated
* **How can an operator determine if the feature is in use by workloads?** | ||
Ideally, this should be a metrics. Operations against Kubernetes API (e.g. | ||
checking if there are objects with field X set) may be last resort. Avoid | ||
logs or events for this purpose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they can find out this feature is being used by querying CSIStorageCapacity objects?
keps/sig-storage/1472/README.md
Outdated
- Usage description: | ||
- Impact of its outage on the feature: | ||
- Impact of its degraded performance or high error rates on the feature: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you have copied some paragraphs in the example KEP but didn't make any changes to them. I think if they are not relevant, then they should just be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I copied everything but left the part that says "must be completed when targeting beta graduation" for later without making changes there. I can take that out.
keps/sig-storage/1472/README.md
Outdated
|
||
[supported limits]: https://git.k8s.io/community//sig-scalability/configs-and-limits/thresholds.md | ||
[existing SLIs/SLOs]: https://git.k8s.io/community/sig-scalability/slos/slos.md#kubernetes-slisslos | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this readiness section needs more work. If the example paragraphs copied from the example KEP are not relevant, please remove them. Either write down your answers related to this feature or leave them blank if not relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the headlines and added comments saying "Will be added before the transition to beta."
The verify failure is from "./hack/verify-kep-metadata.sh" not typo. That means there's some formatting issue with kep.yaml. |
keps/sig-storage/1472/README.md
Outdated
type CSIStorageCapacitySpec struct { | ||
// The CSI driver that provides the storage. | ||
// This must be the string returned by the CSI GetPluginName() call. | ||
DriverName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, the driver name is redundant here. It was a left-over from the time when storage class was optional (CSI ephemeral volumes, the optimization for "storage class does not affect capacity").
But after having thought a bit about it, I think it is worthwhile keeping because it enables the listing of all objects for one driver, something that the external-provisioner has to do.
keps/sig-storage/1472/README.md
Outdated
// Depending on how the driver is implemented, this might be the total | ||
// size of the available storage which is only available when allocating | ||
// multiple smaller volumes ("total available capacity") or the | ||
// actual size that a volume may have ("maximum volume size"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
keps/sig-storage/1472/README.md
Outdated
to rely on that information when making scheduling decisions that | ||
involve volumes that need to be created by the driver. | ||
|
||
If not set, the scheduler makes such decisions without considering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or set to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch.
I've updated the KEP, please have another look. |
keps/sig-storage/1472/README.md
Outdated
jurisdication) availability may still be limited to a subset of the | ||
nodes in a cluster. | ||
|
||
#### Custom schedulers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this use case is about being able to leverage the new API to build a custom scheduler extension if the current implementation doesn't suit a plugin's needs? If that's the case, the you may need to disable the built-in plugin? @ahg-g is that possible through a config, or would the scheduler have to be rebuilt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When choosing a custom scheduler for a pod through the PodSpec.SchedulerName
, only the custom scheduler is going to dispatch the pod, so there's no need to disable the feature in the default scheduler. It's still going to be enabled in case that some other pod uses the default scheduler and has volumes, though.
keps/sig-storage/1472/README.md
Outdated
matches the storage pool(s) that it has access to, with granularity | ||
that matches the most restrictive pool. | ||
|
||
For example, if the driver runs in a node with zone/region/rack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this example. Can we have something more concrete?
Let's start with a simple example. Say we have a storage driver that supports zonal disks, and the cluster has nodes across many zones a, b, and c. Each NodeGetInfo call from the driver daemonset is going to return topology "zone" = "a", "b", or "c", and report it in the CSINode object.
I believe this example is trying to convey what to do if a driver returns different topology keys on different nodes? For example, for nodeA, it returns "zone" = "a", but for nodeB, it returns ("zone" = "A" AND "region" = "X").
Can we just treat nodeB's topology as another unique topology value different from nodeA's topology, and expect the driver to be able to report an appropriate value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose the storage is local to a region. There's storage with label region=west and storage with region=east. Now in each region, we have data centers. As long as NodeGetInfo returns region=east resp. region=west for each node depending on where it is located, the proposed algorithm works.
Now suppose that in addition to storage in each region, each data center also has storage. Some storage classes select the region storage, some others the faster storage in the datacenter.
The storage in a New York data center then has "region=east,center=NewYork" as topology, and NodeGetInfo for a node in that center has to return that.
The proposed algorithm will then still work: it will use region+center for the topologies and create CSIStorageCapacity
objects with that topology also for region storage.
What isn't okay is when NodeGetInfo only reports region, because that is not detailed enough for the storage in the centers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. I think the original example is confusing because it seems like zone encompasses a region, which is the reverse of what most cloud providers do. Can we change the example to use a more clear hierarchy, like region and rack?
keps/sig-storage/1472/README.md
Outdated
|
||
#### With central controller, generic solution | ||
|
||
For all other cases, a new CSI call to enumerate storage topology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is very vague. Is there an example of such a use case that won't work with above two? Let's leave it out of the proposal or put it under open issues if we don't have a concrete plan here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take it out until we identify a driver where the proposed approach doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's remove it for now
keps/sig-storage/1472/README.md
Outdated
- compute the union of the topology segments from these | ||
`CSINodeDriver` entries. | ||
|
||
For each entry in that union, one `CSIStorageCapacity` object is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the name of the object generated to ensure uniqueness? Does it need to be deterministic so we can update it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name can't be determinstic. We have to use a UUID for the name, similar to how PVs are named.
Updating is still possible because external-provisioner has to know about all objects for its driver and then can match by topology and storage class name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PVs names are deterministically generated so that if we call CreateVolume multiple times for the same PVC, it will get called with the same name.
Searching through every object for every predicate call, or every update in external-provisioner is not ideal, especially for a local volume driver that's going to have an entry per node. Is there a way we cache the mappings (maybe keyed by node) to improve the lookups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PVs names are deterministically generated so that if we call CreateVolume multiple times for the same PVC, it will get called with the same name.
That relies on the unique ID of the PVC. We have no such UID here, just a bunch of attributes. We could hash them, but then we'll have to deal with hash collisions.
Internally, both external-provisioner
and kube-scheduler
will have to use caching and more efficient data structures to find the information that they need.
keps/sig-storage/1472/README.md
Outdated
name + nodes + storage class) and then update that one | ||
instead. Obsolete objects need to be removed. | ||
|
||
To ensure that `CSIStorageCapacity` objects get removed when the driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if topologies disappear while the driver is still running? For example, an entire rack gets decomissioned? There probably still has to be some logic that scans the current topologies from CSINode and reconciles with existing StorageCapacity objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's one of the situation where external-provisioner has to remove stale objects ("While external-provisioner runs, it needs to update and potentially potentially delete CSIStorageCapacity
objects: ... when nodes change (for central provisioning)").
keps/sig-storage/1472/README.md
Outdated
While external-provisioner runs, it needs to update and potentially | ||
delete `CSIStorageCapacity` objects: | ||
- when nodes change (for central provisioning) | ||
- when storage classes change (for persistent volumes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "for persistent volumes" mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me remove it. It's a left-over from the time when ephemeral volumes were treated as a special case.
keps/sig-storage/1472/README.md
Outdated
delete `CSIStorageCapacity` objects: | ||
- when nodes change (for central provisioning) | ||
- when storage classes change (for persistent volumes) | ||
- when volumes were created or deleted (for central provisioning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how come this is not a concern for local volume provisioning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. I'll remove the condition.
keps/sig-storage/1472/README.md
Outdated
The lookup sequence will be: | ||
- find the `CSIDriver` object for the driver | ||
- check whether it has `CSIDriver.spec.storageCapacity` enabled | ||
- find all `CSIStorageCapacity` objects that have the right spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be multiple valid objects, or just one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should only be one, otherwise there's conflicting information. We can make the scheduler check for that error, if we want.
keps/sig-storage/1472/README.md
Outdated
capacity that is too small is considered unusable at the moment and | ||
ignored. | ||
|
||
Each volume gets checked separately, independently of other volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a best effort, can we at least sum up the capacities for multiple volumes in the same Pod? This can at least reduce the chance of running into partial binding failures.
As a local volume example, if the node says it has "50GB" left, and a pod requests two local volumes of "50GB" each, then we should be fairly confident that the pod definitely cannot fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That requires summing up the capacity of all the requested volumes and making a decision at the same time? But the create volume requests may not all come at exactly the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a local volume example, if the node says it has "50GB" left, and a pod requests two local volumes of "50GB" each, then we should be fairly confident that the pod definitely cannot fit.
This makes assumptions about the semantic of "capacity", something that we wanted to avoid. It also assumes that "capacity" == "total capacity". It's erring on the wrong side if the driver reports "maximum volume size".
Let's first build something that doesn't use heuristics and then check how well it works before introducing additional logic, okay?
keps/sig-storage/1472/README.md
Outdated
|
||
### CSI drivers without topology support | ||
|
||
To simplify the implementation of external-provisioner, [topology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Topology is a capability in CSI so drivers have to opt-in to support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A certain @msau42 said that we can assume that CSI drivers support topology when they want to use storage capacity tracking 😝
I added this text in commit 86cf87d and removed some external-provisioner modes in response to your comment in https://github.com/kubernetes/enhancements/pull/1353/files#r369745671:
I'm not sure we should have so many different modes of operation. This will make maintaining the provisioner very complex.
I think we should try to simplify things and require that the plugin supports Topology in order to use this feature, and only have two modes of operation "local" and
"not local".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that part makes sense. I'm confused by the later sentence that talks about a driver implementing topology when they don't need it. What is the use case for a driver wanting the capacity feature when they don't support topology? The scheduler is not going to make any different decisions if capacity is equal across all nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I see what you mean. Indeed, what the heck was I thinking of here? I don't remember. Let me remove the second sentence and if we find a use-case for it again, we can add back something better.
This is an attempt to capture all points that were raised during review of kubernetes#1353.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has the KEP been updated with all the comments? When I look at the diff I still see the older text.
keps/sig-storage/1472/README.md
Outdated
|
||
### CSI drivers without topology support | ||
|
||
To simplify the implementation of external-provisioner, [topology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that part makes sense. I'm confused by the later sentence that talks about a driver implementing topology when they don't need it. What is the use case for a driver wanting the capacity feature when they don't support topology? The scheduler is not going to make any different decisions if capacity is equal across all nodes.
keps/sig-storage/1472/README.md
Outdated
|
||
#### With central controller, generic solution | ||
|
||
For all other cases, a new CSI call to enumerate storage topology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's remove it for now
keps/sig-storage/1472/README.md
Outdated
approach that integration into Kubernetes is managed as part of the | ||
CSI driver deployment, ideally without having to modify the CSI driver | ||
itself. This will work without changes for some storage systems while | ||
others may have to support [a new CSI API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this part about a generic solution until we have a more concrete use case.
keps/sig-storage/1472/README.md
Outdated
|
||
#### CSIDriver.spec.storageCapacity | ||
|
||
A new field `storageCapacity` of type `boolean` with default `false` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this new field? Or would the presence of the CSIStorageCapacity object be sufficient?
I guess without the field, then rollout would potentially not be as smooth? With the field, you could upgrade the DaemonSet to the new version and only flip the driver bit when there is capacity reported across all nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this new field? Or would the presence of the CSIStorageCapacity object be sufficient?
Suppose there is not a single CSIStorageCapacity
for the driver: is that because the driver does not support the feature, or because capacity information has not been created yet (driver still starting up)? The scheduler cannot distinguish between the two cases.
If it assumes that "no objects means no capacity", then it'll stop scheduling for drivers which don't support the feature.
If it assumes that "no objects means feature not supported", then it'll schedule instead of waiting for information. That will work eventually (it's just another variant of "acting on incomplete or out-dated information" and the recovery mechanisms need to deal with it), though, so perhaps that flag can indeed be dropped?
I guess without the field, then rollout would potentially not be as smooth? With the field, you could upgrade the DaemonSet to the new version and only flip the driver bit when there is capacity reported across all nodes.
I was thinking of a simpler deployment where the driver just gets installed once through a set of yamls and then Kubernetes itself waits for the driver to provide some capacity information, without an admin or operator implementing such an additional post-install step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now think that we don't need the new field, without any loss of functionality. Here's what we can do instead:
- Support the manual creation of CSIStorageCapacity objects by ensuring that external-provisioner only deletes objects that it also created. This was implied by the design, but not called out explicitly yet. I expect that this will be useful also for the autoscaler because then a cluster admin can use such manually created objects to define (local) storage that currently non-existent nodes will have access to.
- The scheduler blocks pod scheduling if it finds at least one CSIStorageCapacity object for the CSI driver and none that has enough capacity.
- CSI driver deployments create one CSIStorageCapacity object, for example with the CSI driver name as name, if and only if the deployment is going to produce storage capacity information. That object can define zero available storage, topology doesn't matter. All that matters is that the object exists and has the CSI driver name.
@msau42: what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may make upgrading a local storage driver in a large cluster difficult, say 5000 nodes. While the rolling update is happening, the scheduler will go from 5000 node choices quickly down to a handful. This could cause a lot of scheduling failures while this upgrade is happening. Ideally, we would want some way to switch over the behavior after the upgrade is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msau42: This is about upgrading from a driver deployment which has no support for storage capacity to one which does, right? In that case, storage capacity constraints will kick in once the first CSIStorageCapacity
object appears, and you are worried that it would be better to let the driver continue with creating such objects for a while.
Let me augment my proposal to cover this: to opt into storage capacity scheduling, a driver deployment must create a CSIStorageCapacity
with the driver name as object name. Whoever deploys the driver can then decide to create that object immediately or wait a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this today and agreed that the CSIDriverSpec.StorageCapacity
is the better API, so we'll keep the KEP as originally proposed and merged.
keps/sig-storage/1472/README.md
Outdated
|
||
A new field `storageCapacity` of type `boolean` with default `false` | ||
in | ||
[CSIDriver.spec](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.16/#csidriverspec-v1beta1-storage-k8s-io) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's v1 as of 1.18.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the link.
keps/sig-storage/1472/README.md
Outdated
because the reported capacity is smaller than the size of a | ||
volume. This will work better when CSI drivers implement `GetCapacity` | ||
such that they consider constraints like fragmentation and report the | ||
size that the largest volume can have at the moment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reference the CSI open issue. We need to resolve this ambiguity before beta.
keps/sig-storage/1472/README.md
Outdated
matches the storage pool(s) that it has access to, with granularity | ||
that matches the most restrictive pool. | ||
|
||
For example, if the driver runs in a node with zone/region/rack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. I think the original example is confusing because it seems like zone encompasses a region, which is the reverse of what most cloud providers do. Can we change the example to use a more clear hierarchy, like region and rack?
keps/sig-storage/1472/README.md
Outdated
This is a slight deviation from the CSI spec, which defines that | ||
"`accessible_topology` specifies where the *node* is accessible | ||
from". For node-local storage there is no difference, but for | ||
network-attached storage there might be one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the problem is for network-attached storage. Could you give an example? This KEP is only exposing capacity information per topology where topology of the node aligns with the storage system's topology.
We're not handling scenarios where the storage system has a topology independent of Kubernetes nodes.
Nevermind, I think I was looking at an older diff |
Various replies, GitHub no longer let's me reply directly to the comments:
Link added.
Okay, done.
I added a remark for the example that "If it uses region/rack, then the proposed approach below will |
c742793
to
0a3853f
Compare
A proposal for increasing the chance that the Kubernetes scheduler picks a suitable node for pods using volumes that have capacity limits.
0a3853f
to
556497a
Compare
I've squashed the commits now
Today's changes were:
Full history before squashing is here: https://github.com/pohly/enhancements/commits/storage-capacity-2020-05-19 |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, pohly The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
This KEP explains how CSI drivers can expose how much capacity a storage system has available via the API server. This information can then be used by the Kubernetes Pod scheduler to make more intelligent decisions when placing:
WaitForFirstConsumer
binding mode)Enhancement issue: #1472