-
Notifications
You must be signed in to change notification settings - Fork 382
Only do work for instances from a single queue #1074
Only do work for instances from a single queue #1074
Conversation
@vaikas-google your review appreciated, since we've talked about this a bunch before. |
In a follow-up, i will add integration tests for more permutations |
// Since polling is rate-limited, it is not possible to check whether the | ||
// instance is in the polling queue. | ||
// | ||
// TODO: add a way to peak into rate-limited adds that are still pending, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you going to address this before merging a PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, that is a change to client-go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to add a couple additional integration tests in this PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit, s/peak/peek/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
We should think about how we want to rate-limit the polling queue. There's a fast-slow rate limiter that makes n fast attempts before switching to slow attempts that seems suitable. |
861fb7f
to
d56da9b
Compare
@nilebox test added to PR; we now have integration test coverage for:
|
// queue for instances. It is used to trigger polling for the status of an | ||
// async operation on and instance and is called by the worker servicing the | ||
// instance polling queue. After requeueInstanceForPoll exits, the worker | ||
// forgets the key from the polling queue, so the controller must call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment makes me little worried, since the division of labor here is split between couple of components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem hard to trace data through the queue, but I'm not sure if we can do anything about it at this point without a large refactor. @pmorie @vaikas-google what are your thoughts RE cleaning up the flow? Either way, I think that work would be outside the scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what kind of refactor do you have in mind? we could probably do some method moves that make it clearer, but I think this is the best mechanism we currently have to solve this problem.
// Since polling is rate-limited, it is not possible to check whether the | ||
// instance is in the polling queue. | ||
// | ||
// TODO: add a way to peak into rate-limited adds that are still pending, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit, s/peak/peek/
d56da9b
to
119f2ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmorie looks good overall. I made a few comments requesting issues to track future work to improve tests.
// queue for instances. It is used to trigger polling for the status of an | ||
// async operation on and instance and is called by the worker servicing the | ||
// instance polling queue. After requeueInstanceForPoll exits, the worker | ||
// forgets the key from the polling queue, so the controller must call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does seem hard to trace data through the queue, but I'm not sure if we can do anything about it at this point without a large refactor. @pmorie @vaikas-google what are your thoughts RE cleaning up the flow? Either way, I think that work would be outside the scope.
// Since polling is rate-limited, it is not possible to check whether the | ||
// instance is in the polling queue. | ||
// | ||
// TODO: add a way to peek into rate-limited adds that are still pending, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an issue for this? if not, can you create one? same with below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is, kubernetes/kubernetes#49783
err = c.continuePollingInstance(instance) | ||
if err != nil { | ||
return err | ||
} | ||
return fmt.Errorf("last operation not completed (still in progress) for %v/%v", instance.Namespace, instance.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this error anymore, do we? Since we're calling continuePollingInstance
to re-add the key to the polling queue, we could return nil
here and the instance will still be reprocessed.
Overall the architecture looks fine to me for now to fix the race condition. I have some further questions about the rate limiting, but nothing that should block this going in. I'll merge this and rebase #1067 . |
…tired#1074)" This reverts commit a6e80ea.
…tired#1074)" This reverts commit a6e80ea.
…tired#1074)" This reverts commit a6e80ea.
…tired#1074)" This reverts commit a6e80ea.
Fixes the race condition uncovered during #1017 by only doing work for instances from a single work queue. Instances are now added to the polling queue in a rate limited manner, and Instead of the polling queue triggering a re-reconcile of the instance, it instead adds the instance's key into the main instance work queue. This means that only a single goroutine will do work on an instance at a time.
Fixes #780