-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consul: fix deadlock in check-based restarts #5975
Conversation
Fixes #5395 Alternative to #5957 Make task restarting asynchronous when handling check-based restarts. This matches the pre-0.9 behavior where TaskRunner.Restart was an asynchronous signal. The check-based restarting code was not designed to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and could easily overwhelm the buffered update chan and deadlock. Many thanks to @byronwolfman for his excellent debugging, PR, and reproducer! I created this alternative as changing the functionality of TaskRunner.Restart has a much larger impact. This approach reverts to old known-good behavior and minimizes the number of places changes are made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm - i have a question about the test but merge away.
@@ -194,6 +195,28 @@ func TestCheckWatcher_Healthy(t *testing.T) { | |||
} | |||
} | |||
|
|||
// TestCheckWatcher_Unhealthy asserts unhealthy tasks are restarted exactly once. | |||
func TestCheckWatcher_Unhealthy(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm missing some context for this test - does it trigger the deadlock issue here? Is it a relatively easy thing to test for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this test just asserts checks are only restarted once.
I added a new test for the deadlock in 1763672 and confirmed it does cause the deadlock before my changes.
// Error restarting | ||
return false | ||
} | ||
go asyncRestart(ctx, c.logger, c.task, event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: does it make sense to use a semaphore or channel blocking technique used elsewhere, so we don't call task.Restart
concurrently and if we get a spike of Restarts applies, we only restart once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question!
The checkWatcher.Run loop removes a check after Restart is called, so the same task won't be restarted more than once (until it completes the restart and re-registers the check).
In 0.8 TR.Restart just ticked a chan and so was async without having to create a new goroutine. This seemed like the least risky way of replicating that behavior.
Tasks that fail in a tight loop (check_restart.grace=0
and restart.delay=0
) could in theory spin up lots of goroutines, but the goroutines for a single task should rarely if ever overlap and restarting a task already involves creating a lot of resources more expensive than a goroutine.
That being said I hate this "fire and forget" pattern, so I'm open to ideas as long as they can't block checkWatcher.Run
/checkRestart.apply
. (checkWatcher
should probably be refactored to separate Watch/Unwatch mutations from check watching, but that seemed way too big a risk for a point release)
Co-Authored-By: Mahmood Ali <[email protected]>
Something that occurs to me about this approach is that because we're still going through the new hooks (even if async), we get held up by the shutdown delay. I'm not sure if there's an easy way around this, but speaking selfishly for our own cluster, we'd really like it if restarts due to check restarts did not have to wait through their shutdown delay before restarting. |
Good catch! It appears in 0.8 the shutdown delay only applied to shutdowns, not restarts. So this is a behavior change in 0.9. I think your logic is actually better than either 0.8 or 0.9! It seems like ShutdownDelay should be honored on "healthy" restarts like Consul template changes, but not on failure-induced restarts like your PR implemented. It would be easy to make 0.9 match the 0.8 behavior, but I'm leaning toward reopening and accepting your PR - #5957 - over this one. Will run it by the team. |
@byronwolfman As a team we decided to push forward with this conservative PR that's focused on fixing the critical deadlock issue. I opened #5980 to track the shutdown delay enhancement, and we would merge PRs for that after 0.9.4 is released! Your existing PR (#5957) would still work as a fix to the new shutdown delay issue, but I think we would want to drop passing the Thanks again for your hard work on this issue! Sorry we're being so conservative with the 0.9.4 release, but we really want 0.9.x to stabilize and put all new features into 0.10.0. |
@schmichael Thanks for following-up! That seems like a sensible decision to me. Our shop is likely going to run with our custom patch for now just so that we can rollout 0.9.3 since we're really keen on the 0.9.x headline features. Passing a bool through a million functions works to skip the shutdown delay, but seems fragile in the face of future dev work. The issue has been captured now so I think we can look forward to being back on an official binary the next time we upgrade our cluster. I'll see if I can't put up a PR or two to address the QoL issues since this PR solves the deadlock, but golang is still a real trip for me so it might be best left to the experts. :) |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Fixes #5395
Alternative to #5957
Make task restarting asynchronous when handling check-based restarts.
This matches the pre-0.9 behavior where TaskRunner.Restart was an
asynchronous signal. The check-based restarting code was not designed
to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and
could easily overwhelm the buffered update chan and deadlock.
Many thanks to @byronwolfman for his excellent debugging, PR, and
reproducer!
I created this alternative as changing the functionality of
TaskRunner.Restart has a much larger impact. This approach reverts to
old known-good behavior and minimizes the number of places changes are
made.