Skip to content

Commit

Permalink
cluster-autoscaler: support modifying node labels
Browse files Browse the repository at this point in the history
The assumption that all node labels except for the hostname label can be copied
verbatim does not hold for CSI drivers which manage local storage: those
drivers have a topology label where the value also depends on the hostname. It
might be the same as the Kubernetes hostname, but that cannot be assumed.

To solve this, search/replace with regular expressions can be defined to modify
those labels. This then can be used to inform the autoscaler about available
capacity on new nodes:

   --replace-labels ';^topology.hostpath.csi/node=aks-workerpool.*;topology.hostpath.csi/node=aks-workerpool-template;'

   kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1beta1
kind: CSIStorageCapacity
metadata:
  name: aks-workerpool-fast-storage
  namespace: kube-system
capacity: 100Gi
maximumVolumeSize: 100Gi
nodeTopology:
  matchLabels:
    # This never matches a real node, only the node templates
    # inside cluster-autoscaler.
    topology.hostpath.csi/node: aks-workerpool-template
storageClassName: csi-hostpath-fast
EOF
  • Loading branch information
pohly committed Sep 21, 2021
1 parent 9b533c3 commit 48cc845
Show file tree
Hide file tree
Showing 9 changed files with 241 additions and 23 deletions.
94 changes: 94 additions & 0 deletions cluster-autoscaler/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ this document:
* [How can I prevent Cluster Autoscaler from scaling down a particular node?](#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node)
* [How can I configure overprovisioning with Cluster Autoscaler?](#how-can-i-configure-overprovisioning-with-cluster-autoscaler)
* [How can I enable/disable eviction for a specific DaemonSet](#how-can-i-enabledisable-eviction-for-a-specific-daemonset)
* [How can I enable autoscaling for Pods with volumes?](#how-can-i-enable-autoscaling-for-pods-with-volumes)
* [Internals](#internals)
* [Are all of the mentioned heuristics and timings final?](#are-all-of-the-mentioned-heuristics-and-timings-final)
* [How does scale-up work?](#how-does-scale-up-work)
Expand Down Expand Up @@ -461,6 +462,99 @@ This annotation has no effect on pods that are not a part of any DaemonSet.
****************
### How can I enable autoscaling for Pods with volumes?
For network-attached storage, autoscaling works as long as the storage system
does not run out of space for new volumes. Cluster Autoscaler has no support
for automatically increasing storage pools when that happens.
For storage that is local to nodes the situation is different. Solutions
depend on the specific scenario.
#### Dynamic provisioning with immediate binding
When volumes are provisioned dynamically through a storage class with immediate
[binding](https://kubernetes.io/docs/concepts/storage/storage-classes/#volume-binding-mode),
then scheduling and thus the code in Cluster Autoscaler just waits for the
volumes to be created. If that depends on creating new nodes, then scale up is
not triggered. It's better to use the `WaitForFirstConsumer` binding mode.
#### Dynamic provisioning with delayed binding
When the binding mode is `WaitForFirstConsumer`, volume provisioning starts
when the first Pod tries to use a PersistentVolumeClaim. The scheduler is
involved in choosing a node candidate and then the provisioner tries to create
the volume on that node.
When the Cluster Autoscaler considers whether a Pod waiting for such a volume
could run on a new node, the outcome depends on whether storage capacity
tracking, a beta feature for CSI drivers since Kubernetes 1.21, is enabled or
disabled. When disabled, the volume binder will assume that all nodes are
suitable. This may cause the Cluster Autoscaler to create new nodes from a pool
that doesn't actually have local storage.
If it is enabled, then additional configuration of the Cluster Autoscaler is
needed to inform it how much storage new nodes of a pool will have. This must
be done for each CSI driver and each storage class of that driver. Suppose
there is a `csi-lvm-fast` storage class for a fictional LVM CSI driver and a
node pool where node names are `aks-workerpool-<unique id>`. The following
command will create a CSIStorageCapacity object that states that new nodes
will have a certain amount of free storage:
```
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1beta1
kind: CSIStorageCapacity
metadata:
# The name does not matter. It just has to be unique
# inside the namespace.
name: aks-workerpool-csi-lvm-fast-storage
# Other namespaces also work. As this object is owned by the
# cluster administrator, kube-system is a good choice.
namespace: kube-system
# Capacity and maximumVolumeSize must be the same as what the CSI driver
# will report for new nodes with unused storage.
capacity: 100Gi
maximumVolumeSize: 100Gi
nodeTopology:
matchLabels:
# This never matches a real node, only the node templates
# inside Cluster Autoscaler.
topology.hostpath.csi/node: aks-workerpool-template
storageClassName: csi-lvm-fast
EOF
```
Now Cluster Autoscaler must be configured to change the node template labels
such that the volume capacity check looks at that object instead of the one for
the node from which the template was created. This is done with a command line
flag that enables regular expression matching and replacement for labels,
similar to `sed -e s/foo-.*/foo-template/`:
```
--replace-labels ';^topology.lvm.csi/node=aks-workerpool.*;topology.lvm.csi/node=aks-workerpool-template;'
```
`topology.lvm.csi/node` in this example is the label that gets added when the
LVM CSI driver is registered on a node by kubelet. For local storage, the value
of that label is usually the host name, which is what must be modified.
For scaling up from zero, the `topology.lvm.csi/node=aks-workerpool-template`
label must be added to the configuration for the node pool. How to do this
depends on the cloud provider.
TODO: describe how to avoid over-provisioning
#### Static provisioning
When using something like the [Local Persistence Volume Static
Provisioner](https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner),
new PersistentVolumes are created when new nodes are added. Those
PersistentVolumes then may be used to satisfy unbound volume claims that
prevented Pods from running earlier.
Cluster Autoscaler currently has no support for this scenario.
# Internals
### Are all of the mentioned heuristics and timings final?
Expand Down
7 changes: 7 additions & 0 deletions cluster-autoscaler/config/autoscaling_options.go
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ package config

import (
"time"

"k8s.io/autoscaler/cluster-autoscaler/utils/replace"
)

// GpuLimits define lower and upper bound on GPU instances of given type in cluster
Expand Down Expand Up @@ -145,6 +147,11 @@ type AutoscalingOptions struct {
MaxBulkSoftTaintTime time.Duration
// IgnoredTaints is a list of taints to ignore when considering a node template for scheduling.
IgnoredTaints []string
// LabelReplacements is a list of regular expressions and their replacement that get applied
// to labels of existing nodes when creating node templates. The string that the regular
// expression is matched against is "<key>=<value>". If the value part is empty or missing
// after the transformation, the tag gets removed.
LabelReplacements replace.Replacements
// BalancingExtraIgnoredLabels is a list of labels to additionally ignore when comparing if two node groups are similar.
// Labels in BasicIgnoredLabels and the cloud provider-specific ignored labels are always ignored.
BalancingExtraIgnoredLabels []string
Expand Down
7 changes: 3 additions & 4 deletions cluster-autoscaler/core/scale_up.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ import (
"time"

"k8s.io/autoscaler/cluster-autoscaler/core/utils"
"k8s.io/autoscaler/cluster-autoscaler/utils/taints"

appsv1 "k8s.io/api/apps/v1"
apiv1 "k8s.io/api/core/v1"
Expand Down Expand Up @@ -322,7 +321,7 @@ func computeExpansionOption(context *context.AutoscalingContext, podEquivalenceG
// false if it didn't and error if an error occurred. Assumes that all nodes in the cluster are
// ready and in sync with instance groups.
func ScaleUp(context *context.AutoscalingContext, processors *ca_processors.AutoscalingProcessors, clusterStateRegistry *clusterstate.ClusterStateRegistry, unschedulablePods []*apiv1.Pod,
nodes []*apiv1.Node, daemonSets []*appsv1.DaemonSet, nodeInfos map[string]*schedulerframework.NodeInfo, ignoredTaints taints.TaintKeySet) (*status.ScaleUpStatus, errors.AutoscalerError) {
nodes []*apiv1.Node, daemonSets []*appsv1.DaemonSet, nodeInfos map[string]*schedulerframework.NodeInfo, nodeTransformation *utils.NodeTransformation) (*status.ScaleUpStatus, errors.AutoscalerError) {
// From now on we only care about unschedulable pods that were marked after the newest
// node became available for the scheduler.
if len(unschedulablePods) == 0 {
Expand Down Expand Up @@ -496,7 +495,7 @@ func ScaleUp(context *context.AutoscalingContext, processors *ca_processors.Auto

// If possible replace candidate node-info with node info based on crated node group. The latter
// one should be more in line with nodes which will be created by node group.
mainCreatedNodeInfo, err := utils.GetNodeInfoFromTemplate(createNodeGroupResult.MainCreatedNodeGroup, daemonSets, context.PredicateChecker, ignoredTaints)
mainCreatedNodeInfo, err := utils.GetNodeInfoFromTemplate(createNodeGroupResult.MainCreatedNodeGroup, daemonSets, context.PredicateChecker, nodeTransformation)
if err == nil {
nodeInfos[createNodeGroupResult.MainCreatedNodeGroup.Id()] = mainCreatedNodeInfo
} else {
Expand All @@ -510,7 +509,7 @@ func ScaleUp(context *context.AutoscalingContext, processors *ca_processors.Auto
}

for _, nodeGroup := range createNodeGroupResult.ExtraCreatedNodeGroups {
nodeInfo, err := utils.GetNodeInfoFromTemplate(nodeGroup, daemonSets, context.PredicateChecker, ignoredTaints)
nodeInfo, err := utils.GetNodeInfoFromTemplate(nodeGroup, daemonSets, context.PredicateChecker, nodeTransformation)

if err != nil {
klog.Warningf("Cannot build node info for newly created extra node group %v; balancing similar node groups will not work; err=%v", nodeGroup.Id(), err)
Expand Down
14 changes: 9 additions & 5 deletions cluster-autoscaler/core/static_autoscaler.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ import (
"k8s.io/autoscaler/cluster-autoscaler/config"
"k8s.io/autoscaler/cluster-autoscaler/context"
core_utils "k8s.io/autoscaler/cluster-autoscaler/core/utils"
coreutils "k8s.io/autoscaler/cluster-autoscaler/core/utils"
"k8s.io/autoscaler/cluster-autoscaler/estimator"
"k8s.io/autoscaler/cluster-autoscaler/expander"
"k8s.io/autoscaler/cluster-autoscaler/metrics"
Expand Down Expand Up @@ -75,7 +76,7 @@ type StaticAutoscaler struct {
processors *ca_processors.AutoscalingProcessors
processorCallbacks *staticAutoscalerProcessorCallbacks
initialized bool
ignoredTaints taints.TaintKeySet
nodeTransformation coreutils.NodeTransformation
}

type staticAutoscalerProcessorCallbacks struct {
Expand Down Expand Up @@ -159,7 +160,10 @@ func NewStaticAutoscaler(
processors: processors,
processorCallbacks: processorCallbacks,
clusterStateRegistry: clusterStateRegistry,
ignoredTaints: ignoredTaints,
nodeTransformation: coreutils.NodeTransformation{
IgnoredTaints: ignoredTaints,
LabelReplacements: opts.LabelReplacements,
},
}
}

Expand Down Expand Up @@ -277,7 +281,7 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) errors.AutoscalerError
return typedErr.AddPrefix("Initialize ClusterSnapshot")
}

nodeInfosForGroups, autoscalerError := a.processors.TemplateNodeInfoProvider.Process(autoscalingContext, readyNodes, daemonsets, a.ignoredTaints)
nodeInfosForGroups, autoscalerError := a.processors.TemplateNodeInfoProvider.Process(autoscalingContext, readyNodes, daemonsets, &a.nodeTransformation)
if autoscalerError != nil {
klog.Errorf("Failed to get node infos for groups: %v", autoscalerError)
return autoscalerError.AddPrefix("failed to build node infos for node groups: ")
Expand Down Expand Up @@ -429,7 +433,7 @@ func (a *StaticAutoscaler) RunOnce(currentTime time.Time) errors.AutoscalerError
scaleUpStart := time.Now()
metrics.UpdateLastTime(metrics.ScaleUp, scaleUpStart)

scaleUpStatus, typedErr = ScaleUp(autoscalingContext, a.processors, a.clusterStateRegistry, unschedulablePodsToHelp, readyNodes, daemonsets, nodeInfosForGroups, a.ignoredTaints)
scaleUpStatus, typedErr = ScaleUp(autoscalingContext, a.processors, a.clusterStateRegistry, unschedulablePodsToHelp, readyNodes, daemonsets, nodeInfosForGroups, &a.nodeTransformation)

metrics.UpdateDurationFromStart(metrics.ScaleUp, scaleUpStart)

Expand Down Expand Up @@ -739,7 +743,7 @@ func (a *StaticAutoscaler) obtainNodeLists(cp cloudprovider.CloudProvider) ([]*a
// our normal handling for booting up nodes deal with this.
// TODO: Remove this call when we handle dynamically provisioned resources.
allNodes, readyNodes = a.processors.CustomResourcesProcessor.FilterOutNodesWithUnreadyResources(a.AutoscalingContext, allNodes, readyNodes)
allNodes, readyNodes = taints.FilterOutNodesWithIgnoredTaints(a.ignoredTaints, allNodes, readyNodes)
allNodes, readyNodes = taints.FilterOutNodesWithIgnoredTaints(a.nodeTransformation.IgnoredTaints, allNodes, readyNodes)
return allNodes, readyNodes, nil
}

Expand Down
28 changes: 21 additions & 7 deletions cluster-autoscaler/core/utils/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,19 @@ import (
"k8s.io/autoscaler/cluster-autoscaler/utils/errors"
"k8s.io/autoscaler/cluster-autoscaler/utils/gpu"
"k8s.io/autoscaler/cluster-autoscaler/utils/labels"
"k8s.io/autoscaler/cluster-autoscaler/utils/replace"
"k8s.io/autoscaler/cluster-autoscaler/utils/taints"
schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"
)

// NodeTransformation contains settings for creating node templates.
type NodeTransformation struct {
IgnoredTaints taints.TaintKeySet
LabelReplacements replace.Replacements
}

// GetNodeInfoFromTemplate returns NodeInfo object built base on TemplateNodeInfo returned by NodeGroup.TemplateNodeInfo().
func GetNodeInfoFromTemplate(nodeGroup cloudprovider.NodeGroup, daemonsets []*appsv1.DaemonSet, predicateChecker simulator.PredicateChecker, ignoredTaints taints.TaintKeySet) (*schedulerframework.NodeInfo, errors.AutoscalerError) {
func GetNodeInfoFromTemplate(nodeGroup cloudprovider.NodeGroup, daemonsets []*appsv1.DaemonSet, predicateChecker simulator.PredicateChecker, nodeTransformation *NodeTransformation) (*schedulerframework.NodeInfo, errors.AutoscalerError) {
id := nodeGroup.Id()
baseNodeInfo, err := nodeGroup.TemplateNodeInfo()
if err != nil {
Expand All @@ -55,7 +62,7 @@ func GetNodeInfoFromTemplate(nodeGroup cloudprovider.NodeGroup, daemonsets []*ap
}
fullNodeInfo := schedulerframework.NewNodeInfo(pods...)
fullNodeInfo.SetNode(baseNodeInfo.Node())
sanitizedNodeInfo, typedErr := SanitizeNodeInfo(fullNodeInfo, id, ignoredTaints)
sanitizedNodeInfo, typedErr := SanitizeNodeInfo(fullNodeInfo, id, nodeTransformation)
if typedErr != nil {
return nil, typedErr
}
Expand Down Expand Up @@ -102,9 +109,9 @@ func DeepCopyNodeInfo(nodeInfo *schedulerframework.NodeInfo) (*schedulerframewor
}

// SanitizeNodeInfo modify nodeInfos generated from templates to avoid using duplicated host names
func SanitizeNodeInfo(nodeInfo *schedulerframework.NodeInfo, nodeGroupName string, ignoredTaints taints.TaintKeySet) (*schedulerframework.NodeInfo, errors.AutoscalerError) {
func SanitizeNodeInfo(nodeInfo *schedulerframework.NodeInfo, nodeGroupName string, nodeTransformation *NodeTransformation) (*schedulerframework.NodeInfo, errors.AutoscalerError) {
// Sanitize node name.
sanitizedNode, err := sanitizeTemplateNode(nodeInfo.Node(), nodeGroupName, ignoredTaints)
sanitizedNode, err := sanitizeTemplateNode(nodeInfo.Node(), nodeGroupName, nodeTransformation)
if err != nil {
return nil, err
}
Expand All @@ -123,19 +130,26 @@ func SanitizeNodeInfo(nodeInfo *schedulerframework.NodeInfo, nodeGroupName strin
return sanitizedNodeInfo, nil
}

func sanitizeTemplateNode(node *apiv1.Node, nodeGroup string, ignoredTaints taints.TaintKeySet) (*apiv1.Node, errors.AutoscalerError) {
func sanitizeTemplateNode(node *apiv1.Node, nodeGroup string, nodeTransformation *NodeTransformation) (*apiv1.Node, errors.AutoscalerError) {
newNode := node.DeepCopy()
nodeName := fmt.Sprintf("template-node-for-%s-%d", nodeGroup, rand.Int63())
newNode.Name = nodeName
newNode.Labels = make(map[string]string, len(node.Labels))
for k, v := range node.Labels {
if k != apiv1.LabelHostname {
if nodeTransformation != nil {
// Removing labels is not supported, only
// modifying them.
k, v = nodeTransformation.LabelReplacements.ApplyToPair(k, v)
}
newNode.Labels[k] = v
} else {
newNode.Labels[k] = nodeName
}
}
newNode.Name = nodeName
newNode.Spec.Taints = taints.SanitizeTaints(newNode.Spec.Taints, ignoredTaints)
if nodeTransformation != nil {
newNode.Spec.Taints = taints.SanitizeTaints(newNode.Spec.Taints, nodeTransformation.IgnoredTaints)
}
return newNode, nil
}

Expand Down
10 changes: 9 additions & 1 deletion cluster-autoscaler/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ import (
"k8s.io/autoscaler/cluster-autoscaler/simulator"
"k8s.io/autoscaler/cluster-autoscaler/utils/errors"
kube_util "k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes"
"k8s.io/autoscaler/cluster-autoscaler/utils/replace"
"k8s.io/autoscaler/cluster-autoscaler/utils/units"
"k8s.io/autoscaler/cluster-autoscaler/version"
kube_client "k8s.io/client-go/kubernetes"
Expand Down Expand Up @@ -170,7 +171,13 @@ var (
regional = flag.Bool("regional", false, "Cluster is regional.")
newPodScaleUpDelay = flag.Duration("new-pod-scale-up-delay", 0*time.Second, "Pods less than this old will not be considered for scale-up.")

ignoreTaintsFlag = multiStringFlag("ignore-taint", "Specifies a taint to ignore in node templates when considering to scale a node group")
ignoreTaintsFlag = multiStringFlag("ignore-taint", "Specifies a taint to ignore in node templates when considering to scale a node group")
labelReplacements = func() *replace.Replacements {
repl := &replace.Replacements{}
flag.Var(repl, "replace-labels", "Specifies one or more regular expression replacements of the form ;<regexp>;<replacement>; (any other character as separator also allowed) which get applied one after the other to labels of a node to form a template node. Labels are represented as a single string with <key>=<value>. If the key is empty after replacement, the label gets removed.")
return repl
}()

balancingIgnoreLabelsFlag = multiStringFlag("balancing-ignore-label", "Specifies a label to ignore in addition to the basic and cloud-provider set of labels when comparing if two node groups are similar")
awsUseStaticInstanceList = flag.Bool("aws-use-static-instance-list", false, "Should CA fetch instance types in runtime or use a static list. AWS only")
concurrentGceRefreshes = flag.Int("gce-concurrent-refreshes", 1, "Maximum number of concurrent refreshes per cloud object type.")
Expand Down Expand Up @@ -249,6 +256,7 @@ func createAutoscalingOptions() config.AutoscalingOptions {
Regional: *regional,
NewPodScaleUpDelay: *newPodScaleUpDelay,
IgnoredTaints: *ignoreTaintsFlag,
LabelReplacements: *labelReplacements,
BalancingExtraIgnoredLabels: *balancingIgnoreLabelsFlag,
KubeConfigPath: *kubeConfigFile,
NodeDeletionDelayTimeout: *nodeDeletionDelayTimeout,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ import (
"k8s.io/autoscaler/cluster-autoscaler/simulator"
"k8s.io/autoscaler/cluster-autoscaler/utils/errors"
kube_util "k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes"
"k8s.io/autoscaler/cluster-autoscaler/utils/taints"
schedulerframework "k8s.io/kubernetes/pkg/scheduler/framework"

klog "k8s.io/klog/v2"
Expand All @@ -51,7 +50,7 @@ func (p *MixedTemplateNodeInfoProvider) CleanUp() {
}

// Process returns the nodeInfos set for this cluster
func (p *MixedTemplateNodeInfoProvider) Process(ctx *context.AutoscalingContext, nodes []*apiv1.Node, daemonsets []*appsv1.DaemonSet, ignoredTaints taints.TaintKeySet) (map[string]*schedulerframework.NodeInfo, errors.AutoscalerError) {
func (p *MixedTemplateNodeInfoProvider) Process(ctx *context.AutoscalingContext, nodes []*apiv1.Node, daemonsets []*appsv1.DaemonSet, nodeTransformation *utils.NodeTransformation) (map[string]*schedulerframework.NodeInfo, errors.AutoscalerError) {
// TODO(mwielgus): This returns map keyed by url, while most code (including scheduler) uses node.Name for a key.
// TODO(mwielgus): Review error policy - sometimes we may continue with partial errors.
result := make(map[string]*schedulerframework.NodeInfo)
Expand All @@ -78,7 +77,7 @@ func (p *MixedTemplateNodeInfoProvider) Process(ctx *context.AutoscalingContext,
if err != nil {
return false, "", err
}
sanitizedNodeInfo, err := utils.SanitizeNodeInfo(nodeInfo, id, ignoredTaints)
sanitizedNodeInfo, err := utils.SanitizeNodeInfo(nodeInfo, id, nodeTransformation)
if err != nil {
return false, "", err
}
Expand Down Expand Up @@ -122,7 +121,7 @@ func (p *MixedTemplateNodeInfoProvider) Process(ctx *context.AutoscalingContext,

// No good template, trying to generate one. This is called only if there are no
// working nodes in the node groups. By default CA tries to use a real-world example.
nodeInfo, err := utils.GetNodeInfoFromTemplate(nodeGroup, daemonsets, ctx.PredicateChecker, ignoredTaints)
nodeInfo, err := utils.GetNodeInfoFromTemplate(nodeGroup, daemonsets, ctx.PredicateChecker, nodeTransformation)
if err != nil {
if err == cloudprovider.ErrNotImplemented {
continue
Expand Down
Loading

0 comments on commit 48cc845

Please sign in to comment.