-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gce: concurrent zonal List()s + opportunistic basename fill #4058
gce: concurrent zonal List()s + opportunistic basename fill #4058
Conversation
FetchAllMigs (unfiltered InstanceGroupManagers.List) is costly as it's not bounded to MIGs attached to the current cluster, but rather lists all MIGs in the project/zone, and therefore equally affects all clusters in that project/zone. Running the calls concurrently over the region's zones (so at most, 4 concurrent API calls, about once per minute) contains that impact. findMigsInRegion might be scoped to the current cluster (name pattern), but also benefits from the same improvement, as it's also costly and called at each refreshInterval (1mn). Also: we're calling out GCE mig.Get() API again for each MIG (at ~300ms per API call, in my tests), sequentially and with the global cache lock held (when updateClusterState -> ...-> GetMigForInstance kicks in). Yet we already get that bit of information (MIG's basename) from any other mig.Get or mig.List call, like the one fetching target sizes. Leveraging this helps significantly on large fleets (for instance this shaves 8mn startup time on the large cluster I tested on).
574d9ec
to
6432c27
Compare
Opportunistic basename fill only improves the duration of the very first loop I think, because we never invalidate the basename cache. Is that correct? Or am I missing some path where we actually do invalidate it? The code looks good to me, but I'd like to ask how you tested it and what level of improvement have you seen outside of the first loop (or is it 8m every loop? in that case I'd like to understand why it helps so much). |
/lgtm Thanks, that seems like some thorough testing. I don't think faster startup is all that critical, but the loop time improvement is very nice. Regarding your other ideas - I'd need to dig into code some more, but the overall idea seems reasonable. We've optimized GCE provider while testing it with something like 100 MIGs, but I'm not aware of any prior optimization work targeting a case of 1k+ MIGs. So the impact of forced cache refresh was much lower and there was probably not enough incentive to optimize it. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bpineau, MaciekPytel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
FetchAllMigs (unfiltered InstanceGroupManagers.List) is costly as it's not
bounded to MIGs attached to the current cluster, but rather lists all MIGs
in the project/zone, and therefore equally affects all clusters in that
project/zone. Running the calls concurrently over the region's zones (so at
most, 4 concurrent API calls, about once per minute) contains that impact.
findMigsInRegion might be scoped to the current cluster (name pattern),
but also benefits from the same improvement, as it's also costly and
called at each refreshInterval (1mn).
Also: we're calling out GCE mig.Get() API again for each MIG (at ~300ms per
API call, in my tests), sequentially and with the global cache lock held
(when updateClusterState -> ...-> GetMigForInstance kicks in). Yet we
already get that bit of information (MIG's basename) from any other
mig.Get or mig.List call, like the one fetching target sizes. Leveraging
this helps significantly on large fleets (for instance this shaves 8mn
startup time on the large cluster I tested on).