Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gce: concurrent zonal List()s + opportunistic basename fill #4058

Merged
merged 1 commit into from
May 21, 2021

Conversation

bpineau
Copy link
Contributor

@bpineau bpineau commented May 5, 2021

FetchAllMigs (unfiltered InstanceGroupManagers.List) is costly as it's not
bounded to MIGs attached to the current cluster, but rather lists all MIGs
in the project/zone, and therefore equally affects all clusters in that
project/zone. Running the calls concurrently over the region's zones (so at
most, 4 concurrent API calls, about once per minute) contains that impact.

findMigsInRegion might be scoped to the current cluster (name pattern),
but also benefits from the same improvement, as it's also costly and
called at each refreshInterval (1mn).

Also: we're calling out GCE mig.Get() API again for each MIG (at ~300ms per
API call, in my tests), sequentially and with the global cache lock held
(when updateClusterState -> ...-> GetMigForInstance kicks in). Yet we
already get that bit of information (MIG's basename) from any other
mig.Get or mig.List call, like the one fetching target sizes. Leveraging
this helps significantly on large fleets (for instance this shaves 8mn
startup time on the large cluster I tested on).

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 5, 2021
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 5, 2021
FetchAllMigs (unfiltered InstanceGroupManagers.List) is costly as it's not
bounded to MIGs attached to the current cluster, but rather lists all MIGs
in the project/zone, and therefore equally affects all clusters in that
project/zone. Running the calls concurrently over the region's zones (so at
most, 4 concurrent API calls, about once per minute) contains that impact.

findMigsInRegion might be scoped to the current cluster (name pattern),
but also benefits from the same improvement, as it's also costly and
called at each refreshInterval (1mn).

Also: we're calling out GCE mig.Get() API again for each MIG (at ~300ms per
API call, in my tests), sequentially and with the global cache lock held
(when updateClusterState -> ...-> GetMigForInstance kicks in). Yet we
already get that bit of information (MIG's basename) from any other
mig.Get or mig.List call, like the one fetching target sizes. Leveraging
this helps significantly on large fleets (for instance this shaves 8mn
startup time on the large cluster I tested on).
@MaciekPytel
Copy link
Contributor

Opportunistic basename fill only improves the duration of the very first loop I think, because we never invalidate the basename cache. Is that correct? Or am I missing some path where we actually do invalidate it?

The code looks good to me, but I'd like to ask how you tested it and what level of improvement have you seen outside of the first loop (or is it 8m every loop? in that case I'd like to understand why it helps so much).

@bpineau
Copy link
Contributor Author

bpineau commented May 17, 2021

That's correct, the baseName change only helps at startup time. Then that cache lives forever.

As for the tests: I first ran this (go -race build) on a pair of small clusters (then filled them with hundreds MIGs) for about a week; and now run that for over 5 days on a dozen clusters ranging from a few dozen nodes to a few thousands.

About the level of improvement: I manually measured startup time benefits (helped by both basename change, and concurrency) for a large cluster having 2k MIGs+IT attached, but didn't check other clusters (I can do if you want me to?).
Non-startup improvements (at every loop, due to concurrent refreshes, not basename) looks like so for instance:

On a 1k nodes / 600 MIGs, not very busy cluster (in a project hosting ~5k MIGs overall, 3 zones, where InstanceGroupManagers.List() takes 10-12s for each zone):
cluster-autoscaler  main loop duration quantiles (p50, p90, p99)
On a 3k+ nodes / 2k MIGs, quite active cluster (also bumped concurrency at the end of 05/11, but that only helped p99):
cluster-autoscaler  main loop duration quantiles (p50, p90, p99) (1)

By the way, the higher percentiles remains high and unstable on the larger clusters: that's probably due any single MIG change causing (at the next minute tick) a complete and blocking instanceRefToMigRef cache invalidation followed by a full list instances refresh for each MIG. That's via Refresh->forceRefresh->fetchAutoMigs->[changed]->RegenerateInstancesCache()->gc.instanceRefToMigRef=make(...).

Perhaps RegenerateInstancesCache() could take a []GceRef slice as argument, and update just those (and gc entries pointing to a MIG unknown by gc.getMigRefs()) only after it fetched instances rather than starting by a full invalidation? The background hourly RegenerateInstancesCache() goroutine spawn by CreateGceManager could then pass all known migs. Unless you suggest otherwise, I'm considering looking into this too.

@MaciekPytel
Copy link
Contributor

/lgtm
/approve

Thanks, that seems like some thorough testing. I don't think faster startup is all that critical, but the loop time improvement is very nice.

Regarding your other ideas - I'd need to dig into code some more, but the overall idea seems reasonable. We've optimized GCE provider while testing it with something like 100 MIGs, but I'm not aware of any prior optimization work targeting a case of 1k+ MIGs. So the impact of forced cache refresh was much lower and there was probably not enough incentive to optimize it.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 21, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bpineau, MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 21, 2021
@k8s-ci-robot k8s-ci-robot merged commit b0948c7 into kubernetes:master May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants