[CENG-663] VPA PCI labels #78

tedcm · 2023-02-16T20:25:42Z

Which component this PR applies to?

Vertical-pod-autoscaler

What type of PR is this?

/kind documentation

What this PR does / why we need it:

adds "pci" labels to vpa PRs, requiring at least one non-contributing reviewer.

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

This is meant to isolate, contain and wrap access to our custom resource (storageclass/local-data) and label (/local-storage) in a single place, separated in our processors/datadog/ namespace. Avoid spreading local names and constraints over all the code base.

Main functional change compared to upstream's is the "storageclass/local-data" hack, meant to support pods claiming local-data (no-provisioner) volumes. filterOutSchedulable (+test) is mostly that of upstream (copied for later modifications), lightly modified to compose with older clusters where taint based eviction isn't enabled. The frontend is meant to hook in future sub-processors for pods. This gives a foothold for some local-only improvements, without touching upstream code; plans includes: * Runaway upscale prevention * Schedulable metrics (pending pods pressure, long pending pods) * Pods labels selectors (for dedicated autoscaler instances) * Possibly: unpriorize pods from cronjobs

Hooking it in the less intrusive way we could.

Goals of that PodListProcessor are twofold: * Lower presure on runonce loops by evaluating long pending pods less frequently * More importantly: free autoscaler cycles to recover from scaledown cooldown, so nodes created for pods causing infinite upscales get gc'ed rather than filling a cluster (and those pods are slowed down) The delay penalty could be made progressive (eg. 2m then 5m then 10m etc), but for now a static value makes evaluating benefits and impacts easier. Metrics to come in follow-up PR.

This is meant to replace a patch that was setting new nodes as NotReady until their lvp pod was there, with something bound to and contained in our podsListProcessor, not touching autoscaler core or cloudproviders at all anymore. Decision is entirely based of the local-data:true label: no guessing or per cloud provider instances types allowlists needed anymore. Instead of setting them NotReady, the new nodes that just joined are now considered as schedulable for pods requesting local-data once they are ready, which naturaly prevents spurious re-upscales. For now the change is restricted to new local-data nodes that just became ready for less than 5mn, as we're assessing wider impact. The downside is we need to modify the clusterSnapshot content before filterOutSchedulable runs scheduler predicates with those nodes, which happens later, also in our own podsListProcessor.

And stop trying to make that a nodeinfo processor: this was an attempt to follow upstream suggestions, but excessively intrusive (not rebase friendly) for a feature we might keep localy/forked for a long time. We're also submiting a "nodeinfos provider processor" to upstream, which (if accepted) will help integrate that kind of changes much more cleanly (eg. not breaking tests, not touching core/). For now, let's assume we might not have an upstream processor entry point for a long time.

The new option `--node-infos-processor-podtemplate` will be use to enable the support of the PodTemplate processor. The PodTemplate processor will be here to extra from PodTemplate resources Pod that should be considered as Daemonset Pod. This solution will allow custom Daemonset controllers to have there workloads considered as a Daemonset workload.

The podTemplateProcess watches `PodTemplates` with a specific label on any namespace. From a `PodTemplate` the processor generates `Pod` that will be considered as Daemonset Pod by the cluster-autoscaler.

The PodTemplate processor is plug inside the Datadog NodeInfosProcessors to benefit from the cache mecanism to limit the simulation overhead processing. It also limit the possible merge conflict with the upstream cluster-autoscaler code base.

Various cloudproviders' `NodeGroupForNode()` implementations (including aws, azure, and gce) can returns a `nil` error _and_ a `nil` nodegroup. Eg. we're seeing AWS returning that on failed upscales on live clusters. Checking that `deleteCreatedNodesWithErrors` doesn't return an error is not enough to safely dereference the nodegroup (as returned by `NodeGroupForNode()`) by calling nodegroup.Id(). In that situation, logging and returning early seems the safest option, to give various caches (eg. clusterstateregistry's and cloud provider's) the opportunity to eventually converge.

Brings a few recent Standard_L*s_v3, Standard_HB120 and Standard_NC* instances types.

/!\ This is an unfortunately unavoidable change to vendor/, not meant to stay indefinitely. Implementation tries to be non intrusive and to avoids refactoring, to ease future rebases; that change should be removed when the cluster-autoscaler don't need to support Kubernetes clusters < k8s v1.24 anymore. The spreadtopology constraints' skew accounting changed slightly, which can lead cluster-autoscaler (>= 1.24) to leave pending pods on k8s clusters < 1.24: When evaluating nodes options for a pending pod having topology spread constraints, Kubernetes used (and continues) to inventory all the possible topology domains (eg. zones us-east-1a, us-east-1b, etc which it will try to use in a balanced way, with respect to the configured skew) by listing nodes running pods matching the provided labelSelector, and filtering out those that don't pass the tested pod's nodeAffinities. But when computing the number of instances per topology domain to evaluate skew, the Kubernetes scheduler (< 1.24) used to count all nodes having pods that matchs the labelSelector, irrespective of their conformance to the tested pod's nodeaffinity. This changed with Kubernetes commit 935cbc8e625e6f175a44e4490fecc7a25edf6d45 (refactored later on) which I think is part of k8s v1.24: now the scheduler also filters out nodes that don't match the tested pod nodeAffinities when counting pods per topology domain (computing the skew). Since the cluster-autoscaler 1.24 uses upstream's scheduler framework, it inherited that behaviour, and this can lead to diverging evaluations vs k8s scheduler (if the cluster's scheduler is < 1.24): one node could be considered as schedulable by the autoscaler (not triggering an upscale) while k8s scheduler would consider it wouldn't satisfy skew constraints. One example that can trigger that situation would be a deployment changing it's affinities (eg. to move to a new set of nodes) while older pods/nodes are already at maximum skew tolerance (eg. slightly unbalanced). For instance in that situation: We have a deployment configured like so: ``` labels: app: myapp replicas: 4 topologySpreadConstraints: - labelSelector: matchLabels: app: myapp maxSkew: 1 topologyKey: zone whenUnsatisfiable: DoNotSchedule nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: nodeset operator: In values: [foo] ``` At first the application could end-up distributed on nodeset=foo nodes as such: ``` node1 nodeset=foo zone=1 pod-1 app=myapp node2 nodeset=foo zone=2 pod-2 app=myapp node3 nodeset=foo zone=3 pod-3 app=myapp node4 nodeset=foo zone=1 # again pod-4 app=myapp node5 nodeset=bar zone=1 not used yet because doesn't match nodeaffinity node6 nodeset=bar zone=2 unschedulable (eg. full, or cordoned, ...) node7 nodeset=bar zone=3 unschedulable (eg. full, or cordoned, ...) ``` Then the application's affinity is updated to eg. `values: [bar]` and creates a new pod, part of its rollout (or is upscaled). With the older scheduler `podtopologyspread` predicate, we'd count: ``` zone 1: 2 app=myapp pods zone 2: 1 app=myapp pod zone 3: 1 app=myapp pod ``` so we can't use node5 on zone 1, because we're already hitting `maxSkew: 1` budget (have one excess pod) on that zone: we need a nodeset=bar upscale on zone 2 or 3. While the newer scheduler would only count pods running on nodeset=bar to compute skew, which would give: ``` zone 1: 0 app=myapp pods zone 2: 0 app=myapp pods zone 3: 0 app=myapp pods ``` which means the new pod can use any node, including the already available node5: no need for an upscale.

The skewer's library cache is re-created at every call, which causes pressure on Azure API, and slows down the cluster-autoscaler startup time by two minutes on my small (120 nodes, 300 VMSS) test cluster. This was hitting the API twice on cache miss to look for non-promo instance types (even when the instance name doesn't ends with "_Promo").

First draft to support lvm storage (topolvm)

This commit adds the possibility to define extended resources for a node group on GCE, so that the cluster-autoscaler can account for them when taking scaling decisions. This is done through the `extended_resources` key inside the AUTOSCALER_ENV_VARS variable set on a MIG template. Signed-off-by: Mayeul Blanzat <[email protected]>

…add more tests * Malformed extended resource definition should not fail the template building function. Instead, log the error and ignore extended resources * Remove useless existence check * Add tests around the extractExtendedResourcesFromKubeEnv function * Add a test case to verify that malformed extended resource definition does not fail the template build function Signed-off-by: Mayeul Blanzat <[email protected]>

…esource-support-in-gce Cherry-pick: add extended resource support in GCE

There's a small window between the time the ASG list is refreshed (happens every 1mn), and the time expired or new instance-types cache entries are fetched again from ASG's LaunchConfigurations or LaunchTemplates. An ASG's LC might have been replaced during that window; in which case attempts to refresh that ASG instance-type would use the stale LC name we got when we last ASGs list, possibly deleted since then. DescribeLaunchConfigurations would not err if some of the provided LaunchConfigurationNames are missing from the result set. Which is fine as we can cache what we could retrieve, and try again/converge the missing entries once we retry with a refreshed ASG list (at most 1mn in the future), avoiding collecting everything again (expansive API calls). The issue is getInstanceTypesForAsgs() (the only place we call getInstanceTypeByLaunchConfigNames (-> DescribeLaunchConfigurations) from, itself called for missing cache entries) would set entries for each ASGs, irrespective of getInstanceTypeByLaunchConfigNames() resultset size; so we can end up caching empty ("") instance types. This causes getAsgTemplate failures ('ASG %q uses the unknown EC2 instance type ""') and degenerates to the cluster-autoscaler aborting its main loop cycle for as long as the bogus entries remains in cache. On that topic: getInstanceTypeForAsg was swallowing getInstanceTypesForAsgs error message, which doesn't help with diagnostics.

Expander requests' payloads can be rather heavy under upscale pressure, as they're compounding all candidates options and unschedulable pods that could fit each options. Expander responses are a subset of the requests' payload items. We're allowing ourself to send arbitrary payload sizes (gRPC `defaultClientMaxSendMessageSize` is `math.MaxInt32`), but we're prone to drop expander servers responses to the floor, due to the `4MiB` `defaultClientMaxReceiveMessageSize`. The arbitrary 128MiB value is meant to be huge (enough to support eg. several dozen fat 1MiB pods) but not unlimited. Let me know if you'd rather see that turned to be a command line flag, or an other value. Also logging the possible gRPC call errors, as that of great help to diagnose that kind of issues.

bpineau and others added 23 commits May 31, 2022 17:07

[local] Hook in Datadog's podlistprocessor

8411596

Hooking it in the less intrusive way we could.

[local] Hook in template only nodeinfos provider

4874d6d

[local] Add NodeInfos PodTemplate processor implementation

2fedbb3

The podTemplateProcess watches `PodTemplates` with a specific label on any namespace. From a `PodTemplate` the processor generates `Pod` that will be considered as Daemonset Pod by the cluster-autoscaler.

Update Azure instance-types

3c061e6

Brings a few recent Standard_L*s_v3, Standard_HB120 and Standard_NC* instances types.

[local] Support topolvm/openebs storage for scaling decisions

c248329

Merge pull request #52 from DataDog/dhenkel/support-lvm-storage

8448683

First draft to support lvm storage (topolvm)

Merge pull request #64 from DataDog/mayeul/cherry-pick/add-extended-r…

b447a03

…esource-support-in-gce Cherry-pick: add extended resource support in GCE

Create labeler.yml

55e6ce7

Create labeler.yml

2e02a68

tedcm requested a review from lallydd February 16, 2023 20:25

tedcm self-assigned this Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CENG-663] VPA PCI labels #78

[CENG-663] VPA PCI labels #78

tedcm commented Feb 16, 2023

[CENG-663] VPA PCI labels #78

Are you sure you want to change the base?

[CENG-663] VPA PCI labels #78

Conversation

tedcm commented Feb 16, 2023

Which component this PR applies to?

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: