Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async create and update for machine pools #1067

Merged
merged 2 commits into from
Jan 23, 2021

Conversation

devigned
Copy link
Contributor

@devigned devigned commented Dec 5, 2020

What type of PR is this?
/kind feature

What this PR does / why we need it:
Currently, all of the reconcilers call Azure and wait for the operation to complete. At times, this is the right thing to do, but most of the time, it's the equivalent of the UI freezing on an app b/c your have used the UI thread to fetch some data causing your user to wonder why and when the software will react. This PR is a WIP that will make the reaction time of MachinePools in CAPZ drastically faster and possibly, more resilient.

image

related to: #819

Special notes for your reviewer:
I've intentionally not change the version upgrade / VMSS instance update behavior. I'll follow up with another PR which will add min-available replicas, maxsurge, and drain behaviors. That should finish off #819.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Enable asynchronous reconciliation for AzureMachinePools

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 5, 2020
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Dec 5, 2020
@devigned
Copy link
Contributor Author

devigned commented Dec 5, 2020

@alexeldeib notice half or the time in reconcile now is spent resourceSkus. We should discuss a good way to really cache that data rather than on each reconcile.

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I'll have to take an in depth look on Monday but excited to see where it's going. One question that came to mind, how will this affect # of API calls, especially for subscriptions running many clusters with many scale sets? Before, we'd make one call for create and wait for the operation to be done. Now, we might check in a few times before the actual operation is done so a single update might be x3-4 calls, which could add up with many scale sets. I see one improvement to not update the scale set unless something actually changed though so that's good.

@CecileRobertMichon
Copy link
Contributor

One more question: in the case of machinepools, there are only two resources involved: vmss and role assignments. If you were to create a machinepool with system assigned identity, does it mean the role assignment create would fail and requeue until the scale set has finished provisioning? In the case of an AzureMachine where there are a lot more services involved, would every resource that has a dependency on the previous one fail if the dependencies aren't ready? I don't see this necessarily being a problem but maybe we can be smart about it and only attempt to provision if we know that the dependent resources have finished provisioning by looking at their future/state? But then that means CAPZ needs to recreate the ARM dependency graph. Or maybe we just need to improve logging so the logs aren't full of noisy "vm not found" ARM errors.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 5, 2020
@devigned
Copy link
Contributor Author

devigned commented Dec 5, 2020

I added an LRU cache for the resource skus. The resulting performance is greatly increased. See the Jaeger output below. Notice there is no call to refresh SKU data. A "quick" reconcile for an AzureMachinePool is now roughly 200 ms when in the middle of deploy.

image

/cc @alexeldeib

@devigned
Copy link
Contributor Author

devigned commented Dec 5, 2020

This looks great! I'll have to take an in depth look on Monday but excited to see where it's going. One question that came to mind, how will this affect # of API calls, especially for subscriptions running many clusters with many scale sets? Before, we'd make one call for create and wait for the operation to be done. Now, we might check in a few times before the actual operation is done so a single update might be x3-4 calls, which could add up with many scale sets. I see one improvement to not update the scale set unless something actually changed though so that's good.

That is an excellent question! Though, the basis for concern is a little misplaced. When we issue a long running operation to Azure, as CAPZ is in master, we ask Azure to create a resource, this causes a long running operation to start. This means that we get back a 202 (sometimes 201) saying the request was accepted, but not completed. We then rely on the SDK to poll the API for a result, often many times, like every 10s until a terminal provisioning status is reached. Unfortunately, Azure has no way to call us back when it's done building the resource. The Azure SDK is not very smart about how often it polls. It is configurable, but at this point, we don't do anything with the polling configuration.

In this PR, I have it set to requeue after 15s, which is probably too quick for VMSS anyway. I think we might be making less API calls over time than using the SDK to poll the long running operation.

In the future, I think we could get really clever with this functionality and provide some data driven targets. For example, we could determine the statistical distribution of VM provision time, know that the controller is creating a VM, then setup a requeue for each P20% with the idea that we'd likely have 2 or 3 requests per VM creation. We could do this per resource and update the data set as time moves forward. This could be a huge win for CAPZ users.

WDYT, @CecileRobertMichon?

@devigned
Copy link
Contributor Author

devigned commented Dec 5, 2020

One more question: in the case of machinepools, there are only two resources involved: vmss and role assignments. If you were to create a machinepool with system assigned identity, does it mean the role assignment create would fail and requeue until the scale set has finished provisioning? In the case of an AzureMachine where there are a lot more services involved, would every resource that has a dependency on the previous one fail if the dependencies aren't ready? I don't see this necessarily being a problem but maybe we can be smart about it and only attempt to provision if we know that the dependent resources have finished provisioning by looking at their future/state? But then that means CAPZ needs to recreate the ARM dependency graph. Or maybe we just need to improve logging so the logs aren't full of noisy "vm not found" ARM errors.

This is interesting artifact of how I structured the async behavior. When we reach a point in reconciliation where we are going to wait on Azure, the service returns an OperationNotDoneError. This causes a short-circuit of the reconciliation. That is to say that when we encounter an error, we wrap and send up the stack immediately. We do not progress, we halt and return. This behavior is desirable because it stops us from progressing to resources that we may not be able to complete yet. This PR would not be possible without the uniform handling of transient error which was recently merged.

There is way too much noise in the logs from transient errors right now as we log them as actual errors. I don't think they should be errors, but rather a category of expected behavior where the operator should stay calm and reconcile on. Thus, we should remove them from the log.

@alexeldeib
Copy link
Contributor

Looks like great stuff so far :)

We then rely on the SDK to poll the API for a result, often many times, like every 10s until a terminal provisioning status is reached. Unfortunately, Azure has no way to call us back when it's done building the resource. The Azure SDK is not very smart about how often it polls. It is configurable, but at this point, we don't do anything with the polling configuration.

Wonder if it's worth extracting that ourselves into a one httpclient (using some of the autorest helpers?) and then using that as the base for all service clients.

@devigned
Copy link
Contributor Author

devigned commented Dec 7, 2020

Wonder if it's worth extracting that ourselves into a one httpclient (using some of the autorest helpers?) and then using that as the base for all service clients.

We could / should. I'll send you a proposal I had worked on unrelated to this project. The thing is, the behaviors will be domain specific. For example, a vanilla marketplace image VM with a given sku starts / deletes have different latency distributions than our specialized use case. The helpers would be useful if we can provide operation type, resource type and latency distribution. Without that specific input, it probably would be too generic to be helpful.

note that this introduces a subtle edge case, where the cached sku data is not refresh.

Great point! I'll work on evicting a SKU from the cache after a max TTL. Any thoughts on how long the TTL should be?

@alexeldeib
Copy link
Contributor

Any thoughts on how long the TTL should be?

the easiest way might not be a TTL, but e.g. refreshing the cache for a specific subscription if you hit an unexpected error? but that might be too aggressive. anyway I doubt restrictions change per sub that often, if it becomes an issue maybe we can revisit.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 9, 2020
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 10, 2020
@devigned devigned force-pushed the async-mp branch 2 times, most recently from 6f938b5 to 58b5330 Compare December 11, 2020 01:13
@devigned
Copy link
Contributor Author

@CecileRobertMichon I've rebased off of master with the VMSS test changes.

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2021
@devigned
Copy link
Contributor Author

/retest

@nader-ziada
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 20, 2021
@devigned
Copy link
Contributor Author

/retest

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 21, 2021
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Jan 22, 2021

@devigned: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-cluster-api-provider-azure-apidiff 87561e0 link /test pull-cluster-api-provider-azure-apidiff

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@CecileRobertMichon
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 23, 2021
@k8s-ci-robot k8s-ci-robot merged commit e7ffa8e into kubernetes-sigs:master Jan 23, 2021
@k8s-ci-robot k8s-ci-robot added this to the v0.4.11 milestone Jan 23, 2021
@CecileRobertMichon
Copy link
Contributor

/pony

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon: pony image

In response to this:

/pony

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants