Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul: fix deadlock in check-based restarts #5975

Merged
merged 3 commits into from
Jul 18, 2019

Conversation

schmichael
Copy link
Member

Fixes #5395
Alternative to #5957

Make task restarting asynchronous when handling check-based restarts.
This matches the pre-0.9 behavior where TaskRunner.Restart was an
asynchronous signal. The check-based restarting code was not designed
to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and
could easily overwhelm the buffered update chan and deadlock.

Many thanks to @byronwolfman for his excellent debugging, PR, and
reproducer!

I created this alternative as changing the functionality of
TaskRunner.Restart has a much larger impact. This approach reverts to
old known-good behavior and minimizes the number of places changes are
made.

Fixes #5395
Alternative to #5957

Make task restarting asynchronous when handling check-based restarts.
This matches the pre-0.9 behavior where TaskRunner.Restart was an
asynchronous signal. The check-based restarting code was not designed
to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and
could easily overwhelm the buffered update chan and deadlock.

Many thanks to @byronwolfman for his excellent debugging, PR, and
reproducer!

I created this alternative as changing the functionality of
TaskRunner.Restart has a much larger impact. This approach reverts to
old known-good behavior and minimizes the number of places changes are
made.
Copy link
Contributor

@notnoop notnoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - i have a question about the test but merge away.

command/agent/consul/check_watcher.go Outdated Show resolved Hide resolved
@@ -194,6 +195,28 @@ func TestCheckWatcher_Healthy(t *testing.T) {
}
}

// TestCheckWatcher_Unhealthy asserts unhealthy tasks are restarted exactly once.
func TestCheckWatcher_Unhealthy(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm missing some context for this test - does it trigger the deadlock issue here? Is it a relatively easy thing to test for?

Copy link
Member Author

@schmichael schmichael Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this test just asserts checks are only restarted once.

I added a new test for the deadlock in 1763672 and confirmed it does cause the deadlock before my changes.

// Error restarting
return false
}
go asyncRestart(ctx, c.logger, c.task, event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: does it make sense to use a semaphore or channel blocking technique used elsewhere, so we don't call task.Restart concurrently and if we get a spike of Restarts applies, we only restart once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question!

The checkWatcher.Run loop removes a check after Restart is called, so the same task won't be restarted more than once (until it completes the restart and re-registers the check).

In 0.8 TR.Restart just ticked a chan and so was async without having to create a new goroutine. This seemed like the least risky way of replicating that behavior.

Tasks that fail in a tight loop (check_restart.grace=0 and restart.delay=0) could in theory spin up lots of goroutines, but the goroutines for a single task should rarely if ever overlap and restarting a task already involves creating a lot of resources more expensive than a goroutine.

That being said I hate this "fire and forget" pattern, so I'm open to ideas as long as they can't block checkWatcher.Run/checkRestart.apply. (checkWatcher should probably be refactored to separate Watch/Unwatch mutations from check watching, but that seemed way too big a risk for a point release)

@byronwolfman
Copy link
Contributor

Something that occurs to me about this approach is that because we're still going through the new hooks (even if async), we get held up by the shutdown delay. I'm not sure if there's an easy way around this, but speaking selfishly for our own cluster, we'd really like it if restarts due to check restarts did not have to wait through their shutdown delay before restarting.

@schmichael
Copy link
Member Author

Something that occurs to me about this approach is that because we're still going through the new hooks (even if async), we get held up by the shutdown delay.
-- @byronwolfman

Good catch! It appears in 0.8 the shutdown delay only applied to shutdowns, not restarts. So this is a behavior change in 0.9.

I think your logic is actually better than either 0.8 or 0.9! It seems like ShutdownDelay should be honored on "healthy" restarts like Consul template changes, but not on failure-induced restarts like your PR implemented.

It would be easy to make 0.9 match the 0.8 behavior, but I'm leaning toward reopening and accepting your PR - #5957 - over this one. Will run it by the team.

@schmichael
Copy link
Member Author

@byronwolfman As a team we decided to push forward with this conservative PR that's focused on fixing the critical deadlock issue. I opened #5980 to track the shutdown delay enhancement, and we would merge PRs for that after 0.9.4 is released!

Your existing PR (#5957) would still work as a fix to the new shutdown delay issue, but I think we would want to drop passing the failure bool all the way into ServiceClient.RemoveTask. We can always discuss things like that on your PR if you choose to resubmit it as a fix for #5798

Thanks again for your hard work on this issue! Sorry we're being so conservative with the 0.9.4 release, but we really want 0.9.x to stabilize and put all new features into 0.10.0.

@byronwolfman
Copy link
Contributor

@schmichael Thanks for following-up! That seems like a sensible decision to me. Our shop is likely going to run with our custom patch for now just so that we can rollout 0.9.3 since we're really keen on the 0.9.x headline features. Passing a bool through a million functions works to skip the shutdown delay, but seems fragile in the face of future dev work. The issue has been captured now so I think we can look forward to being back on an official binary the next time we upgrade our cluster.

I'll see if I can't put up a PR or two to address the QoL issues since this PR solves the deadlock, but golang is still a real trip for me so it might be best left to the experts. :)

@schmichael schmichael merged commit 858d18d into master Jul 18, 2019
@schmichael schmichael deleted the b-check-watcher-deadlock branch July 18, 2019 20:13
schmichael added a commit that referenced this pull request Jul 19, 2019
@github-actions
Copy link

github-actions bot commented Feb 7, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

job still running though received kill signal
4 participants