gce: concurrent zonal List()s + opportunistic basename fill #4058

bpineau · 2021-05-05T15:08:06Z

FetchAllMigs (unfiltered InstanceGroupManagers.List) is costly as it's not
bounded to MIGs attached to the current cluster, but rather lists all MIGs
in the project/zone, and therefore equally affects all clusters in that
project/zone. Running the calls concurrently over the region's zones (so at
most, 4 concurrent API calls, about once per minute) contains that impact.

findMigsInRegion might be scoped to the current cluster (name pattern),
but also benefits from the same improvement, as it's also costly and
called at each refreshInterval (1mn).

Also: we're calling out GCE mig.Get() API again for each MIG (at ~300ms per
API call, in my tests), sequentially and with the global cache lock held
(when updateClusterState -> ...-> GetMigForInstance kicks in). Yet we
already get that bit of information (MIG's basename) from any other
mig.Get or mig.List call, like the one fetching target sizes. Leveraging
this helps significantly on large fleets (for instance this shaves 8mn
startup time on the large cluster I tested on).

FetchAllMigs (unfiltered InstanceGroupManagers.List) is costly as it's not bounded to MIGs attached to the current cluster, but rather lists all MIGs in the project/zone, and therefore equally affects all clusters in that project/zone. Running the calls concurrently over the region's zones (so at most, 4 concurrent API calls, about once per minute) contains that impact. findMigsInRegion might be scoped to the current cluster (name pattern), but also benefits from the same improvement, as it's also costly and called at each refreshInterval (1mn). Also: we're calling out GCE mig.Get() API again for each MIG (at ~300ms per API call, in my tests), sequentially and with the global cache lock held (when updateClusterState -> ...-> GetMigForInstance kicks in). Yet we already get that bit of information (MIG's basename) from any other mig.Get or mig.List call, like the one fetching target sizes. Leveraging this helps significantly on large fleets (for instance this shaves 8mn startup time on the large cluster I tested on).

MaciekPytel · 2021-05-17T15:36:49Z

Opportunistic basename fill only improves the duration of the very first loop I think, because we never invalidate the basename cache. Is that correct? Or am I missing some path where we actually do invalidate it?

The code looks good to me, but I'd like to ask how you tested it and what level of improvement have you seen outside of the first loop (or is it 8m every loop? in that case I'd like to understand why it helps so much).

bpineau · 2021-05-17T22:21:29Z

That's correct, the baseName change only helps at startup time. Then that cache lives forever.

As for the tests: I first ran this (go -race build) on a pair of small clusters (then filled them with hundreds MIGs) for about a week; and now run that for over 5 days on a dozen clusters ranging from a few dozen nodes to a few thousands.

About the level of improvement: I manually measured startup time benefits (helped by both basename change, and concurrency) for a large cluster having 2k MIGs+IT attached, but didn't check other clusters (I can do if you want me to?).
Non-startup improvements (at every loop, due to concurrent refreshes, not basename) looks like so for instance:

On a 1k nodes / 600 MIGs, not very busy cluster (in a project hosting ~5k MIGs overall, 3 zones, where InstanceGroupManagers.List() takes 10-12s for each zone):

On a 3k+ nodes / 2k MIGs, quite active cluster (also bumped concurrency at the end of 05/11, but that only helped p99):

By the way, the higher percentiles remains high and unstable on the larger clusters: that's probably due any single MIG change causing (at the next minute tick) a complete and blocking instanceRefToMigRef cache invalidation followed by a full list instances refresh for each MIG. That's via Refresh->forceRefresh->fetchAutoMigs->[changed]->RegenerateInstancesCache()->gc.instanceRefToMigRef=make(...).

Perhaps RegenerateInstancesCache() could take a []GceRef slice as argument, and update just those (and gc entries pointing to a MIG unknown by gc.getMigRefs()) only after it fetched instances rather than starting by a full invalidation? The background hourly RegenerateInstancesCache() goroutine spawn by CreateGceManager could then pass all known migs. Unless you suggest otherwise, I'm considering looking into this too.

MaciekPytel · 2021-05-21T14:41:49Z

/lgtm
/approve

Thanks, that seems like some thorough testing. I don't think faster startup is all that critical, but the loop time improvement is very nice.

Regarding your other ideas - I'd need to dig into code some more, but the overall idea seems reasonable. We've optimized GCE provider while testing it with something like 100 MIGs, but I'm not aware of any prior optimization work targeting a case of 1k+ MIGs. So the impact of forced cache refresh was much lower and there was probably not enough incentive to optimize it.

k8s-ci-robot · 2021-05-21T14:41:59Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bpineau, MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [MaciekPytel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 5, 2021

k8s-ci-robot requested review from aleksandra-malinowska and feiskyer May 5, 2021 15:08

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 5, 2021

bpineau force-pushed the gce-faster-lists-calls branch from 574d9ec to 6432c27 Compare May 5, 2021 15:13

jbartosik added the area/cluster-autoscaler label May 11, 2021

MaciekPytel mentioned this pull request May 20, 2021

[cluster-autoscaler] Publish node group min/max metrics #4022

Merged

k8s-ci-robot assigned MaciekPytel May 21, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2021

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2021

k8s-ci-robot merged commit b0948c7 into kubernetes:master May 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gce: concurrent zonal List()s + opportunistic basename fill #4058

gce: concurrent zonal List()s + opportunistic basename fill #4058

bpineau commented May 5, 2021

MaciekPytel commented May 17, 2021

bpineau commented May 17, 2021

MaciekPytel commented May 21, 2021

k8s-ci-robot commented May 21, 2021

gce: concurrent zonal List()s + opportunistic basename fill #4058

gce: concurrent zonal List()s + opportunistic basename fill #4058

Conversation

bpineau commented May 5, 2021

MaciekPytel commented May 17, 2021

bpineau commented May 17, 2021

MaciekPytel commented May 21, 2021

k8s-ci-robot commented May 21, 2021