Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

Only do work for instances from a single queue #1074

Merged
merged 1 commit into from
Aug 1, 2017

Conversation

pmorie
Copy link
Contributor

@pmorie pmorie commented Jul 28, 2017

Fixes the race condition uncovered during #1017 by only doing work for instances from a single work queue. Instances are now added to the polling queue in a rate limited manner, and Instead of the polling queue triggering a re-reconcile of the instance, it instead adds the instance's key into the main instance work queue. This means that only a single goroutine will do work on an instance at a time.

Fixes #780

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 28, 2017
@pmorie
Copy link
Contributor Author

pmorie commented Jul 28, 2017

@vaikas-google your review appreciated, since we've talked about this a bunch before.

@pmorie
Copy link
Contributor Author

pmorie commented Jul 28, 2017

In a follow-up, i will add integration tests for more permutations

@pmorie pmorie requested review from MHBauer and vaikas July 28, 2017 06:01
// Since polling is rate-limited, it is not possible to check whether the
// instance is in the polling queue.
//
// TODO: add a way to peak into rate-limited adds that are still pending,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to address this before merging a PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, that is a change to client-go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to add a couple additional integration tests in this PR, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit, s/peak/peek/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@pmorie
Copy link
Contributor Author

pmorie commented Jul 28, 2017

We should think about how we want to rate-limit the polling queue. There's a fast-slow rate limiter that makes n fast attempts before switching to slow attempts that seems suitable.

@pmorie pmorie force-pushed the instance-polling branch 2 times, most recently from 861fb7f to d56da9b Compare August 1, 2017 04:47
@pmorie
Copy link
Contributor Author

pmorie commented Aug 1, 2017

@nilebox test added to PR; we now have integration test coverage for:

  • async provision/deprovision
  • sync provision/deprovision
  • failed provision

@pmorie pmorie requested review from arschles and nilebox August 1, 2017 12:53
// queue for instances. It is used to trigger polling for the status of an
// async operation on and instance and is called by the worker servicing the
// instance polling queue. After requeueInstanceForPoll exits, the worker
// forgets the key from the polling queue, so the controller must call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment makes me little worried, since the division of labor here is split between couple of components.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem hard to trace data through the queue, but I'm not sure if we can do anything about it at this point without a large refactor. @pmorie @vaikas-google what are your thoughts RE cleaning up the flow? Either way, I think that work would be outside the scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of refactor do you have in mind? we could probably do some method moves that make it clearer, but I think this is the best mechanism we currently have to solve this problem.

// Since polling is rate-limited, it is not possible to check whether the
// instance is in the polling queue.
//
// TODO: add a way to peak into rate-limited adds that are still pending,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit, s/peak/peek/

@vaikas vaikas added the LGTM1 label Aug 1, 2017
@pmorie pmorie force-pushed the instance-polling branch from d56da9b to 119f2ab Compare August 1, 2017 18:56
Copy link
Contributor

@arschles arschles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmorie looks good overall. I made a few comments requesting issues to track future work to improve tests.

// queue for instances. It is used to trigger polling for the status of an
// async operation on and instance and is called by the worker servicing the
// instance polling queue. After requeueInstanceForPoll exits, the worker
// forgets the key from the polling queue, so the controller must call
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem hard to trace data through the queue, but I'm not sure if we can do anything about it at this point without a large refactor. @pmorie @vaikas-google what are your thoughts RE cleaning up the flow? Either way, I think that work would be outside the scope.

// Since polling is rate-limited, it is not possible to check whether the
// instance is in the polling queue.
//
// TODO: add a way to peek into rate-limited adds that are still pending,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an issue for this? if not, can you create one? same with below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arschles arschles added the LGTM2 label Aug 1, 2017
err = c.continuePollingInstance(instance)
if err != nil {
return err
}
return fmt.Errorf("last operation not completed (still in progress) for %v/%v", instance.Namespace, instance.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this error anymore, do we? Since we're calling continuePollingInstance to re-add the key to the polling queue, we could return nil here and the instance will still be reprocessed.

@kibbles-n-bytes
Copy link
Contributor

kibbles-n-bytes commented Aug 1, 2017

Overall the architecture looks fine to me for now to fix the race condition. I have some further questions about the rate limiting, but nothing that should block this going in. I'll merge this and rebase #1067 .

@kibbles-n-bytes kibbles-n-bytes merged commit a6e80ea into kubernetes-retired:master Aug 1, 2017
kibbles-n-bytes pushed a commit to kibbles-n-bytes/service-catalog that referenced this pull request Aug 7, 2017
kibbles-n-bytes pushed a commit to kibbles-n-bytes/service-catalog that referenced this pull request Aug 11, 2017
kibbles-n-bytes pushed a commit to kibbles-n-bytes/service-catalog that referenced this pull request Aug 11, 2017
kibbles-n-bytes pushed a commit to kibbles-n-bytes/service-catalog that referenced this pull request Aug 11, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. LGTM1 LGTM2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants