Cherry pick #3437 onto 1.18 - Avoid unwanted VMSS VMs caches invalidation #3521

marwanad · 2020-09-16T19:22:28Z

Cherry picks #3437

`fetchAutoAsgs()` is called at regular intervals, fetches a list of VMSS, then call `Register()` to cache each of those. That registration function will tell the caller wether that vmss' cache is outdated (when the provided VMSS, supposedly fresh, is different than the one held in cache) and will replace existing cache entry by the provided VMSS (which in effect will require a forced refresh since that ScaleSet struct is passed by fetchAutoAsgs with a nil lastRefresh time and an empty instanceCache). To detect changes, `Register()` uses an `reflect.DeepEqual()` between the provided and the cached VMSS. Which does always find them different: cached VMSS were enriched with instances lists (while the provided one is blank, fresh from a simple vmss.list call). That DeepEqual is also fragile due to the compared structs containing mutexes (that may be held or not) and refresh timestamps, attributes that shoudln't be relevant to the comparison. As a consequence, all Register() calls causes indirect cache invalidations and a costly refresh (VMSS VMS List). The number of Register() calls is directly proportional to the number of VMSS attached to the cluster, and can easily triggers ARM API throttling. With a large number of VMSS, that throttling prevents `fetchAutoAsgs` to ever succeed (and cluster-autoscaler to start). ie.: ``` I0807 16:55:25.875907 153 azure_scale_set.go:344] GetScaleSetVms: starts I0807 16:55:25.875915 153 azure_scale_set.go:350] GetScaleSetVms: scaleSet.Name: a-testvmss-10, vmList: [] E0807 16:55:25.875919 153 azure_scale_set.go:352] VirtualMachineScaleSetVMsClient.List failed for a-testvmss-10: &{true 0 2020-08-07 17:10:25.875447854 +0000 UTC m=+913.985215807 azure cloud provider throttled for operation VMSSVMList with reason "client throttled"} E0807 16:55:25.875928 153 azure_manager.go:538] Failed to regenerate ASG cache: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled" F0807 16:55:25.875934 153 azure_cloud_provider.go:167] Failed to create Azure Manager: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled" goroutine 28 [running]: ``` From [`ScaleSet` struct attributes](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L74-L89) (manager, sizes, mutexes, refreshes timestamps) only sizes are relevant to that comparison. `curSize` is not strictly necessary, but comparing it will provide early instance caches refreshs.

feiskyer

/lgtm
/approve

k8s-ci-robot · 2020-09-17T00:07:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: feiskyer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/azure/OWNERS~~ [feiskyer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bpineau and others added 2 commits September 16, 2020 12:17

call in the nodegroup API to avoid type assertion errors

e146e3e

k8s-ci-robot added cncf-cla: yes size/M labels Sep 16, 2020

k8s-ci-robot requested review from feiskyer and nilo19 September 16, 2020 19:22

marwanad closed this Sep 16, 2020

marwanad reopened this Sep 16, 2020

feiskyer reviewed Sep 17, 2020

View reviewed changes

k8s-ci-robot assigned feiskyer Sep 17, 2020

k8s-ci-robot added the lgtm label Sep 17, 2020

k8s-ci-robot added the approved label Sep 17, 2020

k8s-ci-robot merged commit 3e51cc7 into kubernetes:cluster-autoscaler-release-1.18 Sep 17, 2020

marwanad deleted the cherry-pick-3437-1.18 branch September 18, 2020 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry pick #3437 onto 1.18 - Avoid unwanted VMSS VMs caches invalidation #3521

Cherry pick #3437 onto 1.18 - Avoid unwanted VMSS VMs caches invalidation #3521

marwanad commented Sep 16, 2020

feiskyer left a comment

k8s-ci-robot commented Sep 17, 2020

Cherry pick #3437 onto 1.18 - Avoid unwanted VMSS VMs caches invalidation #3521

Cherry pick #3437 onto 1.18 - Avoid unwanted VMSS VMs caches invalidation #3521

Conversation

marwanad commented Sep 16, 2020

feiskyer left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Sep 17, 2020