Avoid unwanted VMSS VMs caches invalidations #3437

bpineau · 2020-08-18T13:28:02Z

fetchAutoAsgs() is called at regular intervals, fetches a list of VMSS, then call Register() to cache each of those. That registration function will tell the caller wether that vmss' cache is outdated (when the provided VMSS, supposedly fresh, is different than the one held in cache) and will replace existing cache entry by the provided VMSS (which in effect will require a forced refresh since that ScaleSet struct is passed by fetchAutoAsgs with a nil lastRefresh time and an empty instanceCache).

To detect changes, Register() uses an reflect.DeepEqual() between the provided and the cached VMSS. Which does always find them different: cached VMSS were enriched with instances lists (while the provided one is blank, fresh from a simple vmss.list call). That DeepEqual is also fragile due to the compared structs containing mutexes (that may be held or not) and refresh timestamps, attributes that shouldn't be relevant to the comparison.

As a consequence, all Register() calls causes indirect cache invalidations and a costly refresh (VMSS VMS List). The number of Register() calls is directly proportional to the number of VMSS attached to the cluster, and can easily triggers ARM API throttling.

With a large number of VMSS, that throttling prevents fetchAutoAsgs to ever succeed (and cluster-autoscaler to start). ie.:

I0807 16:55:25.875907     153 azure_scale_set.go:344] GetScaleSetVms: starts
I0807 16:55:25.875915     153 azure_scale_set.go:350] GetScaleSetVms: scaleSet.Name: a-testvmss-10, vmList: []
E0807 16:55:25.875919     153 azure_scale_set.go:352] VirtualMachineScaleSetVMsClient.List failed for a-testvmss-10: &{true 0 2020-08-07 17:10:25.875447854 +0000 UTC m=+913.985215807 azure cloud provider throttled for operation VMSSVMList with reason "client throttled"}
E0807 16:55:25.875928     153 azure_manager.go:538] Failed to regenerate ASG cache: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled"
F0807 16:55:25.875934     153 azure_cloud_provider.go:167] Failed to create Azure Manager: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled"

From ScaleSet struct attributes (manager, sizes, mutexes, refreshes timestamps) only sizes are relevant to that comparison. curSize is not strictly necessary, but comparing it will provide early instance caches refreshes (please tell me if you'd rather not have this here).

Example on a cluster with 219 attached VMSS (prev version was unthrottled at 12:29, modified CA rolled out at 12:38):

`fetchAutoAsgs()` is called at regular intervals, fetches a list of VMSS, then call `Register()` to cache each of those. That registration function will tell the caller wether that vmss' cache is outdated (when the provided VMSS, supposedly fresh, is different than the one held in cache) and will replace existing cache entry by the provided VMSS (which in effect will require a forced refresh since that ScaleSet struct is passed by fetchAutoAsgs with a nil lastRefresh time and an empty instanceCache). To detect changes, `Register()` uses an `reflect.DeepEqual()` between the provided and the cached VMSS. Which does always find them different: cached VMSS were enriched with instances lists (while the provided one is blank, fresh from a simple vmss.list call). That DeepEqual is also fragile due to the compared structs containing mutexes (that may be held or not) and refresh timestamps, attributes that shoudln't be relevant to the comparison. As a consequence, all Register() calls causes indirect cache invalidations and a costly refresh (VMSS VMS List). The number of Register() calls is directly proportional to the number of VMSS attached to the cluster, and can easily triggers ARM API throttling. With a large number of VMSS, that throttling prevents `fetchAutoAsgs` to ever succeed (and cluster-autoscaler to start). ie.: ``` I0807 16:55:25.875907 153 azure_scale_set.go:344] GetScaleSetVms: starts I0807 16:55:25.875915 153 azure_scale_set.go:350] GetScaleSetVms: scaleSet.Name: a-testvmss-10, vmList: [] E0807 16:55:25.875919 153 azure_scale_set.go:352] VirtualMachineScaleSetVMsClient.List failed for a-testvmss-10: &{true 0 2020-08-07 17:10:25.875447854 +0000 UTC m=+913.985215807 azure cloud provider throttled for operation VMSSVMList with reason "client throttled"} E0807 16:55:25.875928 153 azure_manager.go:538] Failed to regenerate ASG cache: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled" F0807 16:55:25.875934 153 azure_cloud_provider.go:167] Failed to create Azure Manager: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled" goroutine 28 [running]: ``` From [`ScaleSet` struct attributes](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L74-L89) (manager, sizes, mutexes, refreshes timestamps) only sizes are relevant to that comparison. `curSize` is not strictly necessary, but comparing it will provide early instance caches refreshs.

k8s-ci-robot · 2020-08-18T13:28:10Z

Welcome @bpineau!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

nilo19 · 2020-08-18T14:10:15Z

/lgtm

marwanad · 2020-08-18T17:40:05Z

@bpineau Thanks for this fix! On that note, the implementation of the caches within azure_scale_set.go isn't ideal because essentially every scaling group will have its own cache of the size, which makes the batched List VMSS calls that retrieve the size redundant. I'm planning on redesigning this flow .

/lgtm
/approve

k8s-ci-robot · 2020-08-18T17:40:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marwanad

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/azure/OWNERS~~ [marwanad]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bpineau · 2020-08-19T15:06:01Z

Thanks @marwanad !
Very true, that makes two sources of truth for vmss sizes.

One good aspect of the VMSS List (as op. to the many VMSS VMs Lists) is it's constant time (one call every vmssCacheTTL seconds), while we're suffering from the many VMSS VM List calls, O(number of vmss).
In that sense, it's a cheaper way to check for VMSSes freshness. Is that something informing the redesign?

I'm working on adding an optional jitter to spread those vmssvm.list calls, should help us survive until the redesign lands in.

marwanad · 2020-08-19T16:43:17Z

@bpineau

Yup, so the redesign will address VMSS LIST and have the size source of truth in a single global cache. This is going to be O(1) every vmssCacheTTL. Today, unfortunately it's O(numberOfVMSS) every vmssCacheTTL.

This acts as guard to the VMSS VM cache in a sense as well because if the sizes are the same, there's no need to fetch new VM instances.

bpineau · 2020-08-19T17:39:30Z

@marwanad Amazing! That redesign will be very useful for us, and a great relieve!

Cherry pick #3437 onto 1.18 - Avoid unwanted VMSS VMs caches invalidation

Cherry pick #3437 onto 1.19 - Avoid unwanted VMSS VMs caches invalidations

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 18, 2020

k8s-ci-robot requested review from marwanad and nilo19 August 18, 2020 13:28

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Aug 18, 2020

k8s-ci-robot assigned nilo19 Aug 18, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 18, 2020

k8s-ci-robot assigned marwanad Aug 18, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 18, 2020

k8s-ci-robot merged commit 5b69017 into kubernetes:master Aug 18, 2020

This was referenced Sep 16, 2020

Cherry pick #3437 onto 1.19 - Avoid unwanted VMSS VMs caches invalidations #3520

Merged

Cherry pick #3437 onto 1.18 - Avoid unwanted VMSS VMs caches invalidation #3521

Merged

k8s-ci-robot added a commit that referenced this pull request Sep 17, 2020

Merge pull request #3521 from marwanad/cherry-pick-3437-1.18

3e51cc7

Cherry pick #3437 onto 1.18 - Avoid unwanted VMSS VMs caches invalidation

k8s-ci-robot added a commit that referenced this pull request Sep 17, 2020

Merge pull request #3520 from marwanad/cherry-pick-3437-1.19

bba156c

Cherry pick #3437 onto 1.19 - Avoid unwanted VMSS VMs caches invalidations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid unwanted VMSS VMs caches invalidations #3437

Avoid unwanted VMSS VMs caches invalidations #3437

bpineau commented Aug 18, 2020 •

edited

Loading

k8s-ci-robot commented Aug 18, 2020

nilo19 commented Aug 18, 2020

marwanad commented Aug 18, 2020

k8s-ci-robot commented Aug 18, 2020

bpineau commented Aug 19, 2020

marwanad commented Aug 19, 2020

bpineau commented Aug 19, 2020

Avoid unwanted VMSS VMs caches invalidations #3437

Avoid unwanted VMSS VMs caches invalidations #3437

Conversation

bpineau commented Aug 18, 2020 • edited Loading

k8s-ci-robot commented Aug 18, 2020

nilo19 commented Aug 18, 2020

marwanad commented Aug 18, 2020

k8s-ci-robot commented Aug 18, 2020

bpineau commented Aug 19, 2020

marwanad commented Aug 19, 2020

bpineau commented Aug 19, 2020

bpineau commented Aug 18, 2020 •

edited

Loading