consul: fix deadlock in check-based restarts #5975

schmichael · 2019-07-17T22:28:00Z

Fixes #5395
Alternative to #5957

Make task restarting asynchronous when handling check-based restarts.
This matches the pre-0.9 behavior where TaskRunner.Restart was an
asynchronous signal. The check-based restarting code was not designed
to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and
could easily overwhelm the buffered update chan and deadlock.

Many thanks to @byronwolfman for his excellent debugging, PR, and
reproducer!

I created this alternative as changing the functionality of
TaskRunner.Restart has a much larger impact. This approach reverts to
old known-good behavior and minimizes the number of places changes are
made.

@byronwolfman

Fixes #5395 Alternative to #5957 Make task restarting asynchronous when handling check-based restarts. This matches the pre-0.9 behavior where TaskRunner.Restart was an asynchronous signal. The check-based restarting code was not designed to handle blocking in TaskRunner.Restart. 0.9 made it reentrant and could easily overwhelm the buffered update chan and deadlock. Many thanks to @byronwolfman for his excellent debugging, PR, and reproducer! I created this alternative as changing the functionality of TaskRunner.Restart has a much larger impact. This approach reverts to old known-good behavior and minimizes the number of places changes are made.

notnoop

lgtm - i have a question about the test but merge away.

command/agent/consul/check_watcher.go

notnoop · 2019-07-18T03:55:53Z

command/agent/consul/check_watcher_test.go

@@ -194,6 +195,28 @@ func TestCheckWatcher_Healthy(t *testing.T) {
 	}
 }

+// TestCheckWatcher_Unhealthy asserts unhealthy tasks are restarted exactly once.
+func TestCheckWatcher_Unhealthy(t *testing.T) {


I'm missing some context for this test - does it trigger the deadlock issue here? Is it a relatively easy thing to test for?

No, this test just asserts checks are only restarted once.

I added a new test for the deadlock in 1763672 and confirmed it does cause the deadlock before my changes.

notnoop · 2019-07-18T04:05:14Z

command/agent/consul/check_watcher.go

-			// Error restarting
-			return false
-		}
+		go asyncRestart(ctx, c.logger, c.task, event)


Question: does it make sense to use a semaphore or channel blocking technique used elsewhere, so we don't call task.Restart concurrently and if we get a spike of Restarts applies, we only restart once?

Good question!

The checkWatcher.Run loop removes a check after Restart is called, so the same task won't be restarted more than once (until it completes the restart and re-registers the check).

In 0.8 TR.Restart just ticked a chan and so was async without having to create a new goroutine. This seemed like the least risky way of replicating that behavior.

Tasks that fail in a tight loop (check_restart.grace=0 and restart.delay=0) could in theory spin up lots of goroutines, but the goroutines for a single task should rarely if ever overlap and restarting a task already involves creating a lot of resources more expensive than a goroutine.

That being said I hate this "fire and forget" pattern, so I'm open to ideas as long as they can't block checkWatcher.Run/checkRestart.apply. (checkWatcher should probably be refactored to separate Watch/Unwatch mutations from check watching, but that seemed way too big a risk for a point release)

Co-Authored-By: Mahmood Ali <[email protected]>

byronwolfman · 2019-07-18T14:26:59Z

Something that occurs to me about this approach is that because we're still going through the new hooks (even if async), we get held up by the shutdown delay. I'm not sure if there's an easy way around this, but speaking selfishly for our own cluster, we'd really like it if restarts due to check restarts did not have to wait through their shutdown delay before restarting.

schmichael · 2019-07-18T15:30:51Z

Something that occurs to me about this approach is that because we're still going through the new hooks (even if async), we get held up by the shutdown delay.
-- @byronwolfman

Good catch! It appears in 0.8 the shutdown delay only applied to shutdowns, not restarts. So this is a behavior change in 0.9.

I think your logic is actually better than either 0.8 or 0.9! It seems like ShutdownDelay should be honored on "healthy" restarts like Consul template changes, but not on failure-induced restarts like your PR implemented.

It would be easy to make 0.9 match the 0.8 behavior, but I'm leaning toward reopening and accepting your PR - #5957 - over this one. Will run it by the team.

schmichael · 2019-07-18T16:50:43Z

@byronwolfman As a team we decided to push forward with this conservative PR that's focused on fixing the critical deadlock issue. I opened #5980 to track the shutdown delay enhancement, and we would merge PRs for that after 0.9.4 is released!

Your existing PR (#5957) would still work as a fix to the new shutdown delay issue, but I think we would want to drop passing the failure bool all the way into ServiceClient.RemoveTask. We can always discuss things like that on your PR if you choose to resubmit it as a fix for #5798

Thanks again for your hard work on this issue! Sorry we're being so conservative with the 0.9.4 release, but we really want 0.9.x to stabilize and put all new features into 0.10.0.

byronwolfman · 2019-07-18T17:19:34Z

@schmichael Thanks for following-up! That seems like a sensible decision to me. Our shop is likely going to run with our custom patch for now just so that we can rollout 0.9.3 since we're really keen on the 0.9.x headline features. Passing a bool through a million functions works to skip the shutdown delay, but seems fragile in the face of future dev work. The issue has been captured now so I think we can look forward to being back on an official binary the next time we upgrade our cluster.

I'll see if I can't put up a PR or two to address the QoL issues since this PR solves the deadlock, but golang is still a real trip for me so it might be best left to the experts. :)

github-actions · 2023-02-07T02:15:00Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael mentioned this pull request Jul 17, 2019

Prevent checkwatcher from deadlocking from too many check restarts #5957

Closed

notnoop approved these changes Jul 18, 2019

View reviewed changes

notnoop reviewed Jul 18, 2019

View reviewed changes

Update command/agent/consul/check_watcher.go

b4b2b42

Co-Authored-By: Mahmood Ali <[email protected]>

preetapan added the 0.9.4 label Jul 18, 2019

consul: add test for check watcher deadlock

1763672

schmichael mentioned this pull request Jul 18, 2019

ShutdownDelay should apply on shutdowns and "healthy" restarts #5980

Open

2 tasks

preetapan approved these changes Jul 18, 2019

View reviewed changes

schmichael merged commit 858d18d into master Jul 18, 2019

schmichael deleted the b-check-watcher-deadlock branch July 18, 2019 20:13

schmichael mentioned this pull request Jul 18, 2019

job still running though received kill signal #5395

Closed

schmichael added a commit that referenced this pull request Jul 19, 2019

changelog: add #5791 and #5975

fdb14bd

github-actions bot locked as resolved and limited conversation to collaborators Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consul: fix deadlock in check-based restarts #5975

consul: fix deadlock in check-based restarts #5975

schmichael commented Jul 17, 2019

notnoop left a comment

notnoop Jul 18, 2019

schmichael Jul 18, 2019 •

edited

Loading

notnoop Jul 18, 2019

schmichael Jul 18, 2019

byronwolfman commented Jul 18, 2019

schmichael commented Jul 18, 2019

schmichael commented Jul 18, 2019

byronwolfman commented Jul 18, 2019

github-actions bot commented Feb 7, 2023

consul: fix deadlock in check-based restarts #5975

consul: fix deadlock in check-based restarts #5975

Conversation

schmichael commented Jul 17, 2019

notnoop left a comment

Choose a reason for hiding this comment

notnoop Jul 18, 2019

Choose a reason for hiding this comment

schmichael Jul 18, 2019 • edited Loading

Choose a reason for hiding this comment

notnoop Jul 18, 2019

Choose a reason for hiding this comment

schmichael Jul 18, 2019

Choose a reason for hiding this comment

byronwolfman commented Jul 18, 2019

schmichael commented Jul 18, 2019

schmichael commented Jul 18, 2019

byronwolfman commented Jul 18, 2019

github-actions bot commented Feb 7, 2023

schmichael Jul 18, 2019 •

edited

Loading