Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel reconciliation of azure resources #1181

Closed
shysank opened this issue Feb 17, 2021 · 16 comments
Closed

Parallel reconciliation of azure resources #1181

shysank opened this issue Feb 17, 2021 · 16 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@shysank
Copy link
Contributor

shysank commented Feb 17, 2021

/kind feature
/kind explore

What steps did you take and what happened:

We reconcile serially a bunch of resources in azurecluster and azuremachine reconcilers. Some of these resources have no dependency on each other, and can be reconciled simultaneously. Eg. After a vm is reconciled, vmextensions, tags and roleassignments can be reconciled simultaneously, and potentially save some time.

Describe the solution you'd like

Reconcile mutually exclusive services in parallel using goroutines and channels.

Anything else you would like to add:

This is more of an exploration than an actual requirement.

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot
Copy link
Contributor

@shysank: The label(s) kind/explore cannot be applied, because the repository doesn't have them

In response to this:

/kind feature
/kind explore

What steps did you take and what happened:

We reconcile serially a bunch of resources in azurecluster and azuremachine reconcilers. Some of these resources have no dependency on each other, and can be reconciled simultaneously. Eg. After a vm is reconciled, vmextensions, tags and roleassignments can be reconciled simultaneously, and potentially save some time.

Describe the solution you'd like

Reconcile these services in parallel using goroutines and channels.

Anything else you would like to add:

This is more of an exploration than an actual requirement.

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 17, 2021
@shysank shysank changed the title Parallel reconciliation of individual azure resources Parallel reconciliation of azure resources Feb 17, 2021
@CecileRobertMichon
Copy link
Contributor

Another thing we should consider doing more of is #1067 for other resources. A lot of these create operations are long running and we don't need to be block everything else while the resource creates. We can start the create and then exit the controller loop right away and come back later to check if the operation is done. This also allows us to deal with a lot more objects concurrently (imagine you're trying to create 500 nodes, having the controller be able to kickstart the create and then circle back is a lot more efficient than having to wait for each one to be done before exiting the controller loop).

@shysank
Copy link
Contributor Author

shysank commented Feb 17, 2021

Yeah, the ideal solution is probably a combination of both. We could use #1067 for long running operations (vm create), and go routines for relatively shorter ones.

Or another thing we can do is to just timeout all operations longer than 5 seconds (let's say), update future data, and return control. This will mean only the reconciliation during create will be async, the subsequent ones will mostly be synchronous.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 14, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 14, 2021
@shysank
Copy link
Contributor Author

shysank commented Jul 14, 2021

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 14, 2021
@nader-ziada nader-ziada modified the milestones: v0.5.x, v0.5 Aug 26, 2021
@arschles
Copy link
Contributor

my $0.02 is that it would be helpful to use goroutines for all operations. if that were done:

  • all ongoing work could be tracked and reconciled uniformly using tools like errgroup
  • a generic workflow engine could be built that knows how to parallelize or serialize an arbitrary group of operations requested by the user

this kind of thing would build on #1067. async operations can still happen, but the synchronous boundary would be the goroutine, and if you run N goroutines, then their termination point is when all operations are done, including synchronous ones.

here's what such an API could look like:

type operation struct {
    // the action to take
    action func() error
    // the unique ID of this operation
    id uuid.UUID
    // if set, run this operation after the one identified by this ID
    after uuid.UUID
}

func newAzureMachineOperation(after operation) operation {
    //...
}

// other constructors for operations go here...

type workflowEngine interface{
    submitOperation(ctx context.Context, workflowID uuid.UUID, op operation)
    finalize(ctx context.Context, workflowID uuid.UUID) []error
}

if a single instance of workflowEngine was passed to all reconcilers, they would use a single UUID (e.g. their correlation ID) to submit all of their operations and call finalize when it's time to record the final result of the reconcile.

@arschles
Copy link
Contributor

Thanks @CecileRobertMichon - that was very illuminating! Is it too late to comment on the proposal?

@CecileRobertMichon
Copy link
Contributor

@arschles not too late (PR #1610 has not merged yet) but the proposal has already merged so any amendments to the design would have to be through a new proposal PR

@arschles
Copy link
Contributor

arschles commented Sep 7, 2021

@CecileRobertMichon great, thanks for letting me know! I'm gonna work to get #1575 done and then I'll drop in and see where this effort is, and how I can help 👍

@CecileRobertMichon CecileRobertMichon modified the milestones: v0.5, next Oct 28, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 26, 2022
@shysank
Copy link
Contributor Author

shysank commented Jan 26, 2022

@CecileRobertMichon now that we are following async pattern for reconcile/delete of services, can we close this one?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 25, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants