-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
async create and update for machine pools #1067
Conversation
@alexeldeib notice half or the time in reconcile now is spent resourceSkus. We should discuss a good way to really cache that data rather than on each reconcile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I'll have to take an in depth look on Monday but excited to see where it's going. One question that came to mind, how will this affect # of API calls, especially for subscriptions running many clusters with many scale sets? Before, we'd make one call for create and wait for the operation to be done. Now, we might check in a few times before the actual operation is done so a single update might be x3-4 calls, which could add up with many scale sets. I see one improvement to not update the scale set unless something actually changed though so that's good.
One more question: in the case of machinepools, there are only two resources involved: vmss and role assignments. If you were to create a machinepool with system assigned identity, does it mean the role assignment create would fail and requeue until the scale set has finished provisioning? In the case of an AzureMachine where there are a lot more services involved, would every resource that has a dependency on the previous one fail if the dependencies aren't ready? I don't see this necessarily being a problem but maybe we can be smart about it and only attempt to provision if we know that the dependent resources have finished provisioning by looking at their future/state? But then that means CAPZ needs to recreate the ARM dependency graph. Or maybe we just need to improve logging so the logs aren't full of noisy "vm not found" ARM errors. |
I added an LRU cache for the resource skus. The resulting performance is greatly increased. See the Jaeger output below. Notice there is no call to refresh SKU data. A "quick" reconcile for an AzureMachinePool is now roughly 200 ms when in the middle of deploy. /cc @alexeldeib |
That is an excellent question! Though, the basis for concern is a little misplaced. When we issue a long running operation to Azure, as CAPZ is in master, we ask Azure to create a resource, this causes a long running operation to start. This means that we get back a 202 (sometimes 201) saying the request was accepted, but not completed. We then rely on the SDK to poll the API for a result, often many times, like every 10s until a terminal provisioning status is reached. Unfortunately, Azure has no way to call us back when it's done building the resource. The Azure SDK is not very smart about how often it polls. It is configurable, but at this point, we don't do anything with the polling configuration. In this PR, I have it set to requeue after 15s, which is probably too quick for VMSS anyway. I think we might be making less API calls over time than using the SDK to poll the long running operation. In the future, I think we could get really clever with this functionality and provide some data driven targets. For example, we could determine the statistical distribution of VM provision time, know that the controller is creating a VM, then setup a requeue for each P20% with the idea that we'd likely have 2 or 3 requests per VM creation. We could do this per resource and update the data set as time moves forward. This could be a huge win for CAPZ users. WDYT, @CecileRobertMichon? |
This is interesting artifact of how I structured the async behavior. When we reach a point in reconciliation where we are going to wait on Azure, the service returns an There is way too much noise in the logs from transient errors right now as we log them as actual errors. I don't think they should be errors, but rather a category of expected behavior where the operator should stay calm and reconcile on. Thus, we should remove them from the log. |
Looks like great stuff so far :)
Wonder if it's worth extracting that ourselves into a one httpclient (using some of the autorest helpers?) and then using that as the base for all service clients. |
We could / should. I'll send you a proposal I had worked on unrelated to this project. The thing is, the behaviors will be domain specific. For example, a vanilla marketplace image VM with a given sku starts / deletes have different latency distributions than our specialized use case. The helpers would be useful if we can provide operation type, resource type and latency distribution. Without that specific input, it probably would be too generic to be helpful.
Great point! I'll work on evicting a SKU from the cache after a max TTL. Any thoughts on how long the TTL should be? |
the easiest way might not be a TTL, but e.g. refreshing the cache for a specific subscription if you hit an unexpected error? but that might be too aggressive. anyway I doubt restrictions change per sub that often, if it becomes an issue maybe we can revisit. |
6f938b5
to
58b5330
Compare
@CecileRobertMichon I've rebased off of master with the VMSS test changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/retest |
/lgtm |
/retest |
@devigned: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/lgtm |
/pony |
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Currently, all of the reconcilers call Azure and wait for the operation to complete. At times, this is the right thing to do, but most of the time, it's the equivalent of the UI freezing on an app b/c your have used the UI thread to fetch some data causing your user to wonder why and when the software will react. This PR is a WIP that will make the reaction time of MachinePools in CAPZ drastically faster and possibly, more resilient.
related to: #819
Special notes for your reviewer:
I've intentionally not change the version upgrade / VMSS instance update behavior. I'll follow up with another PR which will add min-available replicas, maxsurge, and drain behaviors. That should finish off #819.
TODOs:
Release note: