async create and update for machine pools #1067

devigned · 2020-12-05T00:18:26Z

What type of PR is this?
/kind feature

What this PR does / why we need it:
Currently, all of the reconcilers call Azure and wait for the operation to complete. At times, this is the right thing to do, but most of the time, it's the equivalent of the UI freezing on an app b/c your have used the UI thread to fetch some data causing your user to wonder why and when the software will react. This PR is a WIP that will make the reaction time of MachinePools in CAPZ drastically faster and possibly, more resilient.

related to: #819

Special notes for your reviewer:
I've intentionally not change the version upgrade / VMSS instance update behavior. I'll follow up with another PR which will add min-available replicas, maxsurge, and drain behaviors. That should finish off #819.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

Enable asynchronous reconciliation for AzureMachinePools

devigned · 2020-12-05T00:22:56Z

@alexeldeib notice half or the time in reconcile now is spent resourceSkus. We should discuss a good way to really cache that data rather than on each reconcile.

CecileRobertMichon

This looks great! I'll have to take an in depth look on Monday but excited to see where it's going. One question that came to mind, how will this affect # of API calls, especially for subscriptions running many clusters with many scale sets? Before, we'd make one call for create and wait for the operation to be done. Now, we might check in a few times before the actual operation is done so a single update might be x3-4 calls, which could add up with many scale sets. I see one improvement to not update the scale set unless something actually changed though so that's good.

CecileRobertMichon · 2020-12-05T01:03:09Z

One more question: in the case of machinepools, there are only two resources involved: vmss and role assignments. If you were to create a machinepool with system assigned identity, does it mean the role assignment create would fail and requeue until the scale set has finished provisioning? In the case of an AzureMachine where there are a lot more services involved, would every resource that has a dependency on the previous one fail if the dependencies aren't ready? I don't see this necessarily being a problem but maybe we can be smart about it and only attempt to provision if we know that the dependent resources have finished provisioning by looking at their future/state? But then that means CAPZ needs to recreate the ARM dependency graph. Or maybe we just need to improve logging so the logs aren't full of noisy "vm not found" ARM errors.

devigned · 2020-12-05T17:06:22Z

I added an LRU cache for the resource skus. The resulting performance is greatly increased. See the Jaeger output below. Notice there is no call to refresh SKU data. A "quick" reconcile for an AzureMachinePool is now roughly 200 ms when in the middle of deploy.

/cc @alexeldeib

devigned · 2020-12-05T17:19:01Z

This looks great! I'll have to take an in depth look on Monday but excited to see where it's going. One question that came to mind, how will this affect # of API calls, especially for subscriptions running many clusters with many scale sets? Before, we'd make one call for create and wait for the operation to be done. Now, we might check in a few times before the actual operation is done so a single update might be x3-4 calls, which could add up with many scale sets. I see one improvement to not update the scale set unless something actually changed though so that's good.

That is an excellent question! Though, the basis for concern is a little misplaced. When we issue a long running operation to Azure, as CAPZ is in master, we ask Azure to create a resource, this causes a long running operation to start. This means that we get back a 202 (sometimes 201) saying the request was accepted, but not completed. We then rely on the SDK to poll the API for a result, often many times, like every 10s until a terminal provisioning status is reached. Unfortunately, Azure has no way to call us back when it's done building the resource. The Azure SDK is not very smart about how often it polls. It is configurable, but at this point, we don't do anything with the polling configuration.

In this PR, I have it set to requeue after 15s, which is probably too quick for VMSS anyway. I think we might be making less API calls over time than using the SDK to poll the long running operation.

In the future, I think we could get really clever with this functionality and provide some data driven targets. For example, we could determine the statistical distribution of VM provision time, know that the controller is creating a VM, then setup a requeue for each P20% with the idea that we'd likely have 2 or 3 requests per VM creation. We could do this per resource and update the data set as time moves forward. This could be a huge win for CAPZ users.

WDYT, @CecileRobertMichon?

devigned · 2020-12-05T17:26:02Z

One more question: in the case of machinepools, there are only two resources involved: vmss and role assignments. If you were to create a machinepool with system assigned identity, does it mean the role assignment create would fail and requeue until the scale set has finished provisioning? In the case of an AzureMachine where there are a lot more services involved, would every resource that has a dependency on the previous one fail if the dependencies aren't ready? I don't see this necessarily being a problem but maybe we can be smart about it and only attempt to provision if we know that the dependent resources have finished provisioning by looking at their future/state? But then that means CAPZ needs to recreate the ARM dependency graph. Or maybe we just need to improve logging so the logs aren't full of noisy "vm not found" ARM errors.

This is interesting artifact of how I structured the async behavior. When we reach a point in reconciliation where we are going to wait on Azure, the service returns an OperationNotDoneError. This causes a short-circuit of the reconciliation. That is to say that when we encounter an error, we wrap and send up the stack immediately. We do not progress, we halt and return. This behavior is desirable because it stops us from progressing to resources that we may not be able to complete yet. This PR would not be possible without the uniform handling of transient error which was recently merged.

There is way too much noise in the logs from transient errors right now as we log them as actual errors. I don't think they should be errors, but rather a category of expected behavior where the operator should stay calm and reconcile on. Thus, we should remove them from the log.

controllers/azuremachine_reconciler.go

alexeldeib · 2020-12-05T21:29:27Z

Looks like great stuff so far :)

We then rely on the SDK to poll the API for a result, often many times, like every 10s until a terminal provisioning status is reached. Unfortunately, Azure has no way to call us back when it's done building the resource. The Azure SDK is not very smart about how often it polls. It is configurable, but at this point, we don't do anything with the polling configuration.

Wonder if it's worth extracting that ourselves into a one httpclient (using some of the autorest helpers?) and then using that as the base for all service clients.

devigned · 2020-12-07T13:53:49Z

Wonder if it's worth extracting that ourselves into a one httpclient (using some of the autorest helpers?) and then using that as the base for all service clients.

We could / should. I'll send you a proposal I had worked on unrelated to this project. The thing is, the behaviors will be domain specific. For example, a vanilla marketplace image VM with a given sku starts / deletes have different latency distributions than our specialized use case. The helpers would be useful if we can provide operation type, resource type and latency distribution. Without that specific input, it probably would be too generic to be helpful.

note that this introduces a subtle edge case, where the cached sku data is not refresh.

Great point! I'll work on evicting a SKU from the cache after a max TTL. Any thoughts on how long the TTL should be?

cloud/services/scalesets/client.go

alexeldeib · 2020-12-08T20:23:47Z

Any thoughts on how long the TTL should be?

the easiest way might not be a TTL, but e.g. refreshing the cache for a specific subscription if you hit an unexpected error? but that might be too aggressive. anyway I doubt restrictions change per sub that often, if it becomes an issue maybe we can revisit.

devigned · 2021-01-16T00:06:31Z

@CecileRobertMichon I've rebased off of master with the VMSS test changes.

CecileRobertMichon

/lgtm

devigned · 2021-01-16T10:27:26Z

/retest

nader-ziada · 2021-01-20T14:40:45Z

/lgtm

devigned · 2021-01-21T17:53:16Z

/retest

k8s-ci-robot · 2021-01-22T01:34:04Z

@devigned: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-cluster-api-provider-azure-apidiff	`87561e0`	link	`/test pull-cluster-api-provider-azure-apidiff`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

CecileRobertMichon · 2021-01-23T00:07:36Z

/lgtm

CecileRobertMichon · 2021-01-23T00:11:50Z

/pony

k8s-ci-robot · 2021-01-23T00:12:07Z

@CecileRobertMichon:

In response to this:

/pony

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot requested review from CecileRobertMichon and justaugustus December 5, 2020 00:18

k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Dec 5, 2020

CecileRobertMichon reviewed Dec 5, 2020

View reviewed changes

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 5, 2020

devigned force-pushed the async-mp branch from 9b4cdcc to 698c5de Compare December 5, 2020 17:04

alexeldeib reviewed Dec 5, 2020

View reviewed changes

controllers/azuremachine_reconciler.go Outdated Show resolved Hide resolved

CecileRobertMichon reviewed Dec 7, 2020

View reviewed changes

cloud/services/scalesets/client.go Outdated Show resolved Hide resolved

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2020

devigned force-pushed the async-mp branch from c682c47 to 1b41805 Compare December 9, 2020 22:37

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2020

devigned force-pushed the async-mp branch from 1b41805 to 61748af Compare December 9, 2020 22:40

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 10, 2020

devigned force-pushed the async-mp branch 2 times, most recently from 6f938b5 to 58b5330 Compare December 11, 2020 01:13

devigned force-pushed the async-mp branch from e644b2a to a675676 Compare January 16, 2021 00:05

CecileRobertMichon approved these changes Jan 16, 2021

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2021

devigned force-pushed the async-mp branch from a675676 to 889cfcd Compare January 16, 2021 01:35

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 20, 2021

devigned force-pushed the async-mp branch from 889cfcd to b74eef8 Compare January 21, 2021 18:59

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 21, 2021

add time to live least recently used sku cache

4141d33

devigned force-pushed the async-mp branch from b74eef8 to f84fcde Compare January 21, 2021 22:38

enable async reconcilation for azure machine pool

87561e0

devigned force-pushed the async-mp branch from f84fcde to 87561e0 Compare January 22, 2021 01:32

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 23, 2021

k8s-ci-robot merged commit e7ffa8e into kubernetes-sigs:master Jan 23, 2021

k8s-ci-robot added this to the v0.4.11 milestone Jan 23, 2021

devigned deleted the async-mp branch January 23, 2021 00:13

CecileRobertMichon mentioned this pull request Jan 25, 2021

Add Linux VM and VMSS extensions #1107

Merged

3 tasks

This was referenced Feb 17, 2021

Parallel reconciliation of azure resources #1181

Closed

Windows VMSS E2E flakes #1182

Closed

CecileRobertMichon mentioned this pull request Mar 24, 2021

Add Linux VM extension bootstrap script and conditions #1232

Merged

3 tasks

shysank mentioned this pull request May 7, 2021

POC: Parallel reconciliation of azure machine services #1369

Closed

3 tasks

CecileRobertMichon mentioned this pull request Jul 20, 2021

Add proposal for Async Azure Resource Creation and Deletion #1541

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async create and update for machine pools #1067

async create and update for machine pools #1067

devigned commented Dec 5, 2020 •

edited

Loading

devigned commented Dec 5, 2020

CecileRobertMichon left a comment

CecileRobertMichon commented Dec 5, 2020

devigned commented Dec 5, 2020 •

edited

Loading

devigned commented Dec 5, 2020 •

edited

Loading

devigned commented Dec 5, 2020

alexeldeib commented Dec 5, 2020

devigned commented Dec 7, 2020

alexeldeib commented Dec 8, 2020

devigned commented Jan 16, 2021

CecileRobertMichon left a comment

devigned commented Jan 16, 2021

nader-ziada commented Jan 20, 2021

devigned commented Jan 21, 2021

k8s-ci-robot commented Jan 22, 2021 •

edited

Loading

CecileRobertMichon commented Jan 23, 2021

CecileRobertMichon commented Jan 23, 2021

k8s-ci-robot commented Jan 23, 2021

async create and update for machine pools #1067

async create and update for machine pools #1067

Conversation

devigned commented Dec 5, 2020 • edited Loading

devigned commented Dec 5, 2020

CecileRobertMichon left a comment

Choose a reason for hiding this comment

CecileRobertMichon commented Dec 5, 2020

devigned commented Dec 5, 2020 • edited Loading

devigned commented Dec 5, 2020 • edited Loading

devigned commented Dec 5, 2020

alexeldeib commented Dec 5, 2020

devigned commented Dec 7, 2020

alexeldeib commented Dec 8, 2020

devigned commented Jan 16, 2021

CecileRobertMichon left a comment

Choose a reason for hiding this comment

devigned commented Jan 16, 2021

nader-ziada commented Jan 20, 2021

devigned commented Jan 21, 2021

k8s-ci-robot commented Jan 22, 2021 • edited Loading

CecileRobertMichon commented Jan 23, 2021

CecileRobertMichon commented Jan 23, 2021

k8s-ci-robot commented Jan 23, 2021

devigned commented Dec 5, 2020 •

edited

Loading

devigned commented Dec 5, 2020 •

edited

Loading

devigned commented Dec 5, 2020 •

edited

Loading

k8s-ci-robot commented Jan 22, 2021 •

edited

Loading