POC: Parallel reconciliation of azure machine services #1369

shysank · 2021-05-06T16:36:18Z

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

This is a poc to reconcile azure machine service that are mutually exclusive to reconcile in parallel. If the concept is approved, we could extend this to other reconcilers.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1181

Special notes for your reviewer:

This pr is not meant to be merged. If the concept is approved, There should be a follow up pr with proper implementation along with tests.

TODOs:

squashed commits
includes documentation
adds unit tests

Release note:

k8s-ci-robot · 2021-05-06T16:36:19Z

@shysank: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2021-05-06T16:36:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign detiber after the PR has been reviewed.
You can assign the PR to them by writing /assign @detiber in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nader-ziada · 2021-05-07T19:18:53Z

controllers/azuremachine_reconciler.go

+	ar.submit(ctx, s.networkInterfacesSvc)
+	ar.submit(ctx, s.availabilitySetsSvc)
+	if err := ar.wait(); err != nil {
+		return err
 	}



what about dependencies between these resources

The resources that are independent are submitted to the asyncReconciler which will be processed in parallel, and whenever there's a dependency, we wait for the tasks to finish before proceeding.

devigned

I see where this is going and it's not exactly what was meant by async reconciliation. Let me try to explain.

When the Azure SDK for Go does a PUT, PATCH and sometimes DELETE to Azure Resource Manager, a long running operation is started. The Azure SDK makes the PUT request, it then starts polling the resource until the Azure SDK determines the resource is in a "terminal" state (succeeded or failed) only then does the SDK return to the client. This is what we refer to when way say a synchronous request.

What you are doing this PR is a parallel set of synchronous requests.

I think what was intended by the async reconciliation of machine resources is more similar to what happens in the AMPM pull request in ScaleSets where the Azure SDK is used to make the PUT, a future for the long running operation is captured, stored on the status of the reconciling resource, then used in subsequent reconciles to asynchronously reconcile the resource.

wdyt?

shysank · 2021-05-07T21:41:08Z

I think we are trying optimize for two different reasons here: 1) time to do a full reconciliation, 2) don't block resources if reconciliation take long.

#1181 was more aimed at solving (1), which might perhaps improve the overall machine creation time, and is more useful where the resource creation does not take much time. OTOH #1067 solves (2), and is more suitable for resources that take a long time to create (vm, vmss etc.)

Maybe we can use a combination of both? wdyt?

devigned · 2021-05-07T23:11:08Z

Maybe we can use a combination of both? wdyt?

Definitely could do both. The trick is tracking the state of multiple in-flight operations. Furthermore with 2, the start of the PUT or PATCH is returned in like 100ms, so parallelizing these activities isn't a huge gain.

Also, some reconciles are actually gating on a previous reconcile. Sending bad requests to Azure is a good way of inflating the number of API calls and possibly cause a busy signal from the Azure APIs.

shysank · 2021-05-10T17:29:35Z

@devigned By combination I meant having (1) for a group of resources and (2) for other resources (eg. vm reconciliation in this pr could use (2) ) in the same reconciliation loop. Having said that I agree that it adds complexity to the code which may not be proportional to the gains. I'll leave this pr open for a while in case anyone else has any feedback, and then close it. wdyt?

nader-ziada · 2021-05-17T22:12:12Z

/hold

devigned · 2021-05-18T01:00:08Z

This might be a useful library to define the items that are serial and those that are parallel: https://github.com/johnsiilver/orchestration/tree/main/linear.

shysank · 2021-05-18T14:58:13Z

This might be a useful library to define the items that are serial and those that are parallel: https://github.com/johnsiilver/orchestration/tree/main/linear.

Thanks David! It looks very much similar to what we are trying to do here. I'll hack around and see how it goes.

nader-ziada · 2021-05-21T13:23:19Z

@shysank @devigned are we going to close this PR? or use it?

devigned · 2021-05-21T16:56:59Z

@shysank @devigned are we going to close this PR? or use it?

Not in the current form. We should probably have a discussion during next office hours to get some points of view.

shysank · 2021-05-21T18:22:03Z

@nader-ziada Agree with @devigned the pr is still very much incomplete. I was going to try out the library suggested but didn't have time this week. I'll try to do it sometime next week (still won't be mergeable, just to get feedback) and then we can close it.

alexeldeib · 2021-05-28T07:50:04Z

@devigned fwiw I'm not totally sure about this: "parallelizing these activities isn't a huge gain". If you fire off one async request and have to requeue multiple times for the future to complete, it feels like a waste to not start multiple async requests at once. I just skimmed the AMP PR and it looks like you'll end up looping multiple times before the future is actually done:

cluster-api-provider-azure/azure/services/scalesets/scalesets.go

Lines 141 to 143 in b30b86f

    
           // if we get to hear, we have completed any long running VMSS operations (creates / updates) 
        
           s.Scope.SetLongRunningOperationState(nil) 
        
           return nil

looking at the code, roleAssignments + vmExtensions seems like a natural one where the parallel initialization of the requests would help, and then checking the futures in e.g. a waitgroup or DAG or something.

devigned · 2021-05-28T12:08:05Z

fwiw I'm not totally sure about this: "parallelizing these activities isn't a huge gain". If you fire off one async request and have to requeue multiple times for the future to complete, it feels like a waste to not start multiple async requests at once. I just skimmed the AMP PR and it looks like you'll end up looping multiple times before the future is actually done. Looking at the code, roleAssignments + vmExtensions seems like a natural one where the parallel initialization of the requests would help, and then checking the futures in e.g. a waitgroup or DAG or something.

@alexeldeib I agree with your assessment. In the case where resources can be created concurrently, it would be an improvement. Perhaps, I was not as nuanced as I should have been. If you check out the office hours from yesterday, https://youtu.be/IDCDPuEMkR8?t=497, you'll hear a more nuanced discussion.

The tl;dr; in machine provisioning, there are really 2, maybe 3, groups of resources for worker nodes. The nic, the VM, then the ext and role assignment which need each group to be provisioned serially, but can be concurrent inside of the group. The yield for this set of resources minimal since, as you pointed out, only the ext and the role assignment can be done concurrently. The biggest win for user experience would be async updates as the VM provisions. For the sake of time and complexity, we should proceed by first making the VM create async, then evaluate to see if we need further performance improvements.

k8s-ci-robot · 2021-07-22T22:40:53Z

@shysank: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nader-ziada · 2021-08-24T12:52:05Z

@shysank does this need to be closed?

shysank · 2021-08-24T16:07:44Z

yes, we can close this in favor of #1610.

add async reconciler that can reconcile multiple services simultaneously

bcb51e3

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 6, 2021

k8s-ci-robot requested review from cpanato and nader-ziada May 6, 2021 16:36

k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels May 6, 2021

reocnile azure machine services in parallel

878b0c7

shysank force-pushed the parallel_reconciliation branch from 63df722 to 878b0c7 Compare May 6, 2021 19:16

nader-ziada reviewed May 7, 2021

View reviewed changes

devigned reviewed May 7, 2021

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 17, 2021

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 22, 2021

shysank closed this Aug 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Parallel reconciliation of azure machine services #1369

POC: Parallel reconciliation of azure machine services #1369

shysank commented May 6, 2021

k8s-ci-robot commented May 6, 2021

k8s-ci-robot commented May 6, 2021

nader-ziada May 7, 2021

shysank May 7, 2021

devigned left a comment

shysank commented May 7, 2021

devigned commented May 7, 2021 •

edited

Loading

shysank commented May 10, 2021

nader-ziada commented May 17, 2021

devigned commented May 18, 2021

shysank commented May 18, 2021

nader-ziada commented May 21, 2021

devigned commented May 21, 2021

shysank commented May 21, 2021

alexeldeib commented May 28, 2021

devigned commented May 28, 2021

k8s-ci-robot commented Jul 22, 2021

nader-ziada commented Aug 24, 2021

shysank commented Aug 24, 2021

POC: Parallel reconciliation of azure machine services #1369

POC: Parallel reconciliation of azure machine services #1369

Conversation

shysank commented May 6, 2021

k8s-ci-robot commented May 6, 2021

k8s-ci-robot commented May 6, 2021

nader-ziada May 7, 2021

Choose a reason for hiding this comment

shysank May 7, 2021

Choose a reason for hiding this comment

devigned left a comment

Choose a reason for hiding this comment

shysank commented May 7, 2021

devigned commented May 7, 2021 • edited Loading

shysank commented May 10, 2021

nader-ziada commented May 17, 2021

devigned commented May 18, 2021

shysank commented May 18, 2021

nader-ziada commented May 21, 2021

devigned commented May 21, 2021

shysank commented May 21, 2021

alexeldeib commented May 28, 2021

devigned commented May 28, 2021

k8s-ci-robot commented Jul 22, 2021

nader-ziada commented Aug 24, 2021

shysank commented Aug 24, 2021

devigned commented May 7, 2021 •

edited

Loading