-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scale up: tests and support for pods with volumes #3887
Conversation
// unbound PVC with late binding can be scheduled only if the | ||
// node candidates are known to have capacity available. | ||
// This can be configured for node candidates by giving them | ||
// special labels (TODO: how?) and then manually creating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the key conceptual question that I have: when cluster autoscaler creates node candidates, is it configurable which node labels those have?
This is not just relevant for the CSIStorageCapacity object below, but also for topology-aware pod scheduling. Suppose a PV has been created with topology constraints as provided by a CSI driver. How can the scheduler know that the node candidate is in the same topology segment as the existing volume? That's based on node labels, so those must be get set somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CA will copy an existing node to use as a node template. Any topology labels should be copied as well (CA assumes a single NodeGroup is made of nodes that are identical for scheduling purposes, ie. they're in one zone).
For scale from 0 it's up to cloudprovider TemplateNodeInfo to provide a template node that has all the relevant labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I need to read up on node groups. Do you perhaps have a link to documentation about how to configure those?
CA will copy an existing node to use as a node template.
How does it pick that node?
Any topology labels should be copied as well (CA assumes a single NodeGroup is made of nodes that are identical for scheduling purposes, ie. they're in one zone).
Hmm, I can imagine that it might be more complex. For example, the same zone (= data center) might be able to provide machines with different hardware features (for example, optional CPU instructions like AVX) and those features might be exposed as node labels.
When considering node-local storage, each node is in its own topology zone. Cloning the labels of an existing node would be wrong for the labels set for a CSI driver on that node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I need to read up on node groups. Do you perhaps have a link to documentation about how to configure those?
A NodeGroup is an internal cluster autoscaler concept for a resizable set of identical nodes. How it maps to underlying infrastructure will be different for each cloud provider (ex. on AWS a NodeGroup is an ASG, on GCP it's a ManagedInstanceGroup) and the way to set it up will also differ by cloudprovider. The instructions for most providers can be found in cluster-autoscaler/cloudprovider/<provider_name>/README.md.
How does it pick that node?
Randomly. The assumption is that a NodeGroup is a set of identical nodes, so it doesn't matter which one CA uses as a template for representing future nodes.
Hmm, I can imagine that it might be more complex. For example, the same zone (= data center) might be able to provide machines with different hardware features (for example, optional CPU instructions like AVX) and those features might be exposed as node labels.
Two nodes using different hardware should be part of different NodeGroups. Cluster Autoscaler can autoscale cluster with multiple NodeGroups.
A very high level description of how CA works is:
- A cluster is made of different NodeGroups, which represent some sort of resizable collection of VMs in underlying cloud. All CA ever does to actuate its decisions is resize NodeGroup. The expectation is that underlying cloud will in that case provide more identical VMs and those VMs will self-register in the cluster.
- A cluster can be made of multiple NodeGroups. Some or all of those may be autoscaled.
- If multiple NodeGroups are autoscaled CA will run a separate binpacking simulation for each of those.
- To perform a simulation it will create an in-memory Node object that represents a new node that can be added to the cluster. It will than use embedded scheduler code to in-memory "schedule" pending pods on this node. CA will keep adding more in-memory nodes (copies of original template node) until all the pending pods can be "scheduled". The outcome of this simulation is how many nodes would need to be added to a particular NodeGroup to make all the pending pods schedulable.
- Once the simulation is done CA will reset the simulation by deleting all in-memory nodes and erasing any "scheduling" decisions it took. It will than run simulation for next NodeGroup. This process will be repeated for each NodeGroup in the cluster.
- A configurable heuristic ("expander") will look at results of simulations and pick one of them. CA will increase target size of selected NodeGroup (ie. ask underlying cloud for more VMs).
When considering node-local storage, each node is in its own topology zone. Cloning the labels of an existing node would be wrong for the labels set for a CSI driver on that node.
I think I'm missing some context here. Is the CSIStorageCapacity object for a node or for some class of identical nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note how the algorithm described above doesn't work with features like local PVs (since scheduler code takes PVs from informer and not snapshot, there is no way for CA to inject an in-memory PV object that represents the local volume on in-memory node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm missing some context here. Is the CSIStorageCapacity object for a node or for some class of identical nodes?
That depends on the storage system. For network-attached storage, it may be for multiple nodes in a certain zone because all those nodes share the same underlying storage. For local storage, each node has its own storage and once the node is up, will have CSIStorageCapacity objects that represent storage on that node.
If all nodes in a NodeGroup are alike and have the same amount of storage capacity, then one CSIStorageCapacity object can describe those nodes. We just have to take care that objects for nodes that don't exist yet and nodes that exits are not in conflict with each other. That can be achieved by labeling the node candidates differently - not something that CA currently supports, but it doesn't sound hard to add.
This is then using the existing APIs of the volume binder (informers for CSIStorageCapacity, PVC, Node, CSIDriver, etc.). If we change those APIs, then other solutions are also possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note how the algorithm described above doesn't work with features like local PVs (since scheduler code takes PVs from informer and not snapshot, there is no way for CA to inject an in-memory PV object that represents the local volume on in-memory node).
This is where we need different solutions for different kind of storage systems. For local storage managed by a CSI driver with dynamic provisioning, there won't be any such local PVs and there won't be a need to inject fake PVs. Instead, we need to inform the scheduler about capacity on the non-existent nodes. I was focusing on that approach.
I don't know which approach will be more common in clusters where CA will be used. 🤷
// objects that were added when creating the fake client. Without | ||
// this wait, those objects won't be visible via the informers when | ||
// the test runs. | ||
synced := informerFactory.WaitForCacheSync(stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about this without timeout. I've seen cases of WaitForCacheSync() hanging ~forever.
I don't think unittests prove much in this case. I think CA integration with PVs is fundamentally broken:
None of this will be a problem for a unittest, given that we can easily control what's in the informer during test and we only run once so the problems with isolation won't have a chance to manifest. I think ultimate solution involves refactoring relevant code in k/k to expose interfaces that allow control over internal state (similarly to how snapshot finally gave CA a way to control list of pods and nodes perceived by scheduler finally giving us a chance to fix problems with InterPodAffinity) and than making CA understand volumes enough to account for them in simulations. Unfortunately this is unlikely to happen anytime soon, so in the meantime coming up with a list of cases that are and aren't supported by CA would be great, but I'd like a more comprehensive proof than a unittest for those cases. |
I'm starting to agree. I was asked to show how CSIStorageCapacity will affect CA, but the problem seems to be much bigger than just that. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale I intend to come back to this by setting up some real environment with cluster autoscaler and some CSI driver where I can try out this idea. |
I ran an experiment with a real cluster on Azure. The cluster with 1.21.1 was created with:
When it runs, it has two nodes from different pools:
The worker pool is set up for autoscaling (the tags above). I ran the cluster autoscaler under a debugger like this:
The config has:
Authorization was based on the |
With the Azure cluster above, I can reproduce the problem that the autoscaler does not scale up when pods need additional storage. I used the CSI hostpath driver with fake 100Gi of "fast" storage per node, deployed from https://github.com/pohly/csi-driver-host-path/commits/cluster-autoscaler such that each worker pool node has one instance of the driver.
Then I created this example stateful set:
When I created that, one pod ran. When scaling up (
The |
The key question was what the simulated node looks like. Here it is:
The incorrect value has the effect that CSIStorageCapacity objects for the existing node
|
I added a command line option that allows replacing labels in template nodes:
Now an administrator can tell the autoscaler that new nodes in that node group will have new storage available in case that they get created anew:
And with that, autoscaler scales up:
The CSI driver pod starts first, creates the new volume, and then the example app also starts on the new node. It's not all perfect yet, though:
I'll debug both further, but in the meantime let me share the modifications that I have right now. @MaciekPytel: is this going in the right direction? |
7c2b8c0
to
fcd9bb2
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pohly The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
fcd9bb2
to
a692379
Compare
cluster-autoscaler/core/scale_up.go
Outdated
@@ -322,7 +323,7 @@ func computeExpansionOption(context *context.AutoscalingContext, podEquivalenceG | |||
// false if it didn't and error if an error occurred. Assumes that all nodes in the cluster are | |||
// ready and in sync with instance groups. | |||
func ScaleUp(context *context.AutoscalingContext, processors *ca_processors.AutoscalingProcessors, clusterStateRegistry *clusterstate.ClusterStateRegistry, unschedulablePods []*apiv1.Pod, | |||
nodes []*apiv1.Node, daemonSets []*appsv1.DaemonSet, nodeInfos map[string]*schedulerframework.NodeInfo, ignoredTaints taints.TaintKeySet) (*status.ScaleUpStatus, errors.AutoscalerError) { | |||
nodes []*apiv1.Node, daemonSets []*appsv1.DaemonSet, nodeInfos map[string]*schedulerframework.NodeInfo, ignoredTaints taints.TaintKeySet, labelReplacements replace.Replacements) (*status.ScaleUpStatus, errors.AutoscalerError) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignoredTaints
and labelReplacements
always get passed around together. I was wondering whether it would be worth creating a struct that holds both. Then if something else needs to be added in the future, the change will be more limited.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new parameter promptly lead to several merge conflicts. While cleaning those up I'll implement this idea.
labelReplacements = func() *replace.Replacements { | ||
repl := &replace.Replacements{} | ||
flag.Var(repl, "replace-labels", "Specifies one or more regular expression replacements of the form ;<regexp>;<replacement>; (any other character as separator also allowed) which get applied one after the other to labels of a node to form a template node. Labels are represented as a single string with <key>=<value>. If the key is empty after replacement, the label gets removed.") | ||
return repl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming and the format of the parameter is up for debate. Once we have an agreement, I'll add documentation and tests in utils/replace
.
This is following how regex replace is specified for sed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added documentation to the FAQ.md.
That happens after the first new node comes online. At that time, the CSI driver is not running yet and thus the node is assumed to have no storage. When autoscaler runs a scale up check during that time, it concludes that it still has no suitable nodes for the pod which needed a new volume, therefore it asks for another one. That stops once the CSI driver is up and the pending pod gets scheduled. Perhaps we need a "node start up delay", i.e. some additional delay after nodes appear before they are considered online?
I just wasn't waiting long enough... it gets removed again after the 10m scale-down-delay-after-add. |
a692379
to
a92adf0
Compare
48cc845
to
7f840da
Compare
Regarding CustomResourcesProcessor - that works if we can tell apart a node that is waiting for CSI driver from one that is not (ie. we need to be able to tell if a given node is fully initialized or not). For example in case of GPU we assume any node that will eventually have GPU will be started with a provider-specific "gpu label" (for example in GCE |
The criteria could be
|
One concern that was brought up is that an admin needs to configure this. To recap, setting this up depends on knowledge about:
An extension of the CSIDriver object might help with the first point, but I am not sure whether the resulting API can be generic enough to be useful for anything besides autoscaler. The CSI driver cannot help with the second point in "scale from zero" scenarios where the driver isn't running on any node of a node pool. If running on some node already, then perhaps it could publish information for that node as if it was unused in a separate CSIStorageCapacity object. How the autoscaler really works in a cluster depends on its version and who built it - there's not exactly "one" autoscaler, right? So the last point would depend on standardizing on one way of configuring it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The criteria could be
- there is CSIStorageCapacity configured for a template node derived from the node, i.e. the admin expects to get storage capacity
- there is no CSIStorageCapacity object for the actual node
I implemented that in a new commit, with a slightly more generalized check: there might be more than one expected CSIStorageCapacity
object for a node when there are several CSI drivers and/or storage classes.
|
||
// This isn't particularly useful at the moment. Perhaps we can accept | ||
// a context and then our caller can decide about a suitable deadline. | ||
stopChannel := make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same behavior as before in NewSchedulerBasedPredicateChecker
, i.e. it's not worse - but also not better.
|
||
// CleanUp frees resources. | ||
CleanUp() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In contrast to CustomResourceProcessor
there's nothing specific to CSI in this interface. A better solution might be to introduce a generic FilterReadyNodes
interface and then support more than one of those in AutoscalingProcessors
.
CustomResourcesProcessor
then becomes an extension of that new interface which just adds GetNodeResourceTargets
.
@@ -309,15 +318,21 @@ func buildAutoscaler() (core.Autoscaler, error) { | |||
autoscalingOptions := createAutoscalingOptions() | |||
kubeClient := createKubeClient(getKubeConfig()) | |||
eventsKubeClient := createKubeClient(getKubeConfig()) | |||
informerFactory := informers.NewSharedInformerFactory(kubeClient, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As described in the commit message, creating and starting the informer factory needs to be moved up the call chain because both scheduler predicate and processor both need a CSIStorageCapacity informer and ideally should use the same one.
18122b2
to
28c124b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i read through this again and it generally makes sense to me. i'm not an expert on the CSI system, but i think what is documented here would work with the clusterapi provider.
the one thing i'm not sure about is if we would need more information to be available about the individual instances behind the nodes in the node group. i don't think we would as it seems the labels and taints are doing most of the work here.
cc @jingxu97 |
Whether a pod has unbound volumes influences scheduling decisions and thus the scale up decisions in cluster autoscaler. These three new test cases cover: - a pod with an unbound pvc using late binding -> can scale up - the same with storage capacity feature enabled -> cannot scale up without CSIStorageCapacity - the same with manually configured CSIStorageCapacity -> can scale up
The assumption that all node labels except for the hostname label can be copied verbatim does not hold for CSI drivers which manage local storage: those drivers have a topology label where the value also depends on the hostname. It might be the same as the Kubernetes hostname, but that cannot be assumed. To solve this, search/replace with regular expressions can be defined to modify those labels. This then can be used to inform the autoscaler about available capacity on new nodes: --replace-labels ';^topology.hostpath.csi/node=aks-workerpool.*;topology.hostpath.csi/node=aks-workerpool-template;' kubectl apply -f - <<EOF apiVersion: storage.k8s.io/v1beta1 kind: CSIStorageCapacity metadata: name: aks-workerpool-fast-storage namespace: kube-system capacity: 100Gi maximumVolumeSize: 100Gi nodeTopology: matchLabels: # This never matches a real node, only the node templates # inside cluster-autoscaler. topology.hostpath.csi/node: aks-workerpool-template storageClassName: csi-hostpath-fast EOF
When a new node becomes ready, a CSI driver is not going to be running on it immediately. This can cause the cluster autoscaler to scale up once more because of pending pods that can run on that new node once the driver is ready. The actual check is about CSIStorageCapacity. By comparing the published information about the new node against the information for a template node, we can determine whether the CSI driver is done with starting up on the node. The new CSI processor needs information about existing CSIStorageCapacity objects in the cluster, just like the scheduler predicate. Both can share the same informer. For that to work, managing the informer factory must be moved up the call chain so that the setup code for both can use the same factory.
28c124b
to
47cd1f2
Compare
@MaciekPytel: I would like to propose this PR for merging. I can't think of any simpler way of solving this because of the inherent complexity outlined above and no-one else has come up with any suggestions either. The alternative to this PR is to explain to users that autoscaling doesn't work together with storage capacity tracking. But I think the proposal here is better than giving up on the combination completely. If there is anything that I can do to make this PR acceptable, then please let me know. |
@pohly: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I replied on kubernetes/enhancements#3233 for high-level issues I see with this approach. This approach can technically work for some use-cases, but it relies on cluster admin having in-depth knowledge of CA and I don't think it scales into large clusters. I really think we should think 'how can we change CSIStorageCapacity API / implementation in k8s to facilitate good autoscaling story', rather than just think 'how to make CA work with the current implementation'. |
The conclusion in kubernetes/enhancements#3233 was that this PR here is not sufficient for autoscaling and therefore should not be merged. /close To recap:
This can only be achieved by making some assumptions about how a CSI driver works and by a CSI driver providing more information than currently required. Before someone picks this up, we need:
|
@pohly: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Whether a pod has unbound volumes influences scheduling decisions and
thus the scale up decisions in cluster autoscaler.
This PR:
These three new test cases cover:
without CSIStorageCapacity
Using this in practice depends on having different CSI topology labels for template nodes and additional configuration for each CSI driver and its storage clases. This is something that cluster administrators must take care of.