Deadlock demo in acquireShards #5824

Groxx · 2024-03-29T20:43:31Z

Demo for #5823, for a flawed construct from the combination of #2888 (concurrently acquire shards) and #3134 (handle shutdown).

It's pretty obviously incorrect:

❯ go test -v -test.run TestDeadlock ./service/history/shard
=== RUN   TestDeadlock
    controller_test.go:193: time remaining: 9m59.498006333s, avg runtime: 499.157614ms, longest: 501.214333ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m58.954002708s, avg runtime: 542.559293ms, longest: 543.647833ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m58.434399375s, avg runtime: 518.730977ms, longest: 519.447708ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m57.905938917s, avg runtime: 527.407791ms, longest: 528.353042ms, successes: 100, fails: 0
...
    controller_test.go:193: time remaining: 9m50.650997167s, avg runtime: 498.004745ms, longest: 499.985292ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m50.172357417s, avg runtime: 477.521516ms, longest: 478.509375ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m49.658607167s, avg runtime: 512.926921ms, longest: 513.583542ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m49.151183708s, avg runtime: 504.016801ms, longest: 507.16975ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m48.671873833s, avg runtime: 478.087117ms, longest: 479.052208ms, successes: 100, fails: 0
    controller_test.go:147: took too long
    controller_test.go:193: time remaining: 9m38.669626958s, avg runtime: 542.430156ms, longest: 544.350417ms, successes: 99, fails: 1
                         note 10s have passed ^
         though it's reliably ~0.5s normally.
                CPU is idle during this time.

A cause-sequence that leads to this, starting from a full work channel:

worker thread: worker pulls from chan
acquire thread: ^ == shard loop pushes to refill the chan, at the same time
acquire thread: checks shutdown, nope
acquire thread: push to full chan, blocks
other thread: shutdown occurs
worker thread: worker checks shutdown, yes, returns
and now there are no workers to pull from the full chan, so the push blocks forever

The fix is rather easy: buffer correctly, so the acquire thread cannot fail to notice the shutdown.
Then the test runs like:

    controller_test.go:205: time remaining: 8m33.72483275s, avg runtime: 12.304837ms, longest: 55.83725ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.683416958s, avg runtime: 17.070645ms, longest: 37.5345ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.634528958s, avg runtime: 15.512894ms, longest: 43.902167ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.565585208s, avg runtime: 18.193592ms, longest: 67.813334ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.50871675s, avg runtime: 10.90086ms, longest: 54.84125ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.452797208s, avg runtime: 9.594426ms, longest: 52.924791ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.394790083s, avg runtime: 15.320305ms, longest: 55.232875ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.33406525s, avg runtime: 20.083535ms, longest: 55.97175ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.273468375s, avg runtime: 31.289924ms, longest: 57.83125ms, successes: 100, fails: 0

goroutine-swapping and atomics add up a fair bit in these micro-benchmarks - it's >20x faster. that's inconsequential in the real code though.

Pretty clearly a flawed construct. ``` ❯ go test -v -test.run TestDeadlock ./service/history/shard === RUN TestDeadlock controller_test.go:193: time remaining: 9m59.498006333s, avg runtime: 499.157614ms, longest: 501.214333ms, successes: 100, fails: 0 controller_test.go:193: time remaining: 9m58.954002708s, avg runtime: 542.559293ms, longest: 543.647833ms, successes: 100, fails: 0 controller_test.go:193: time remaining: 9m58.434399375s, avg runtime: 518.730977ms, longest: 519.447708ms, successes: 100, fails: 0 controller_test.go:193: time remaining: 9m57.905938917s, avg runtime: 527.407791ms, longest: 528.353042ms, successes: 100, fails: 0 ... controller_test.go:193: time remaining: 9m50.650997167s, avg runtime: 498.004745ms, longest: 499.985292ms, successes: 100, fails: 0 controller_test.go:193: time remaining: 9m50.172357417s, avg runtime: 477.521516ms, longest: 478.509375ms, successes: 100, fails: 0 controller_test.go:193: time remaining: 9m49.658607167s, avg runtime: 512.926921ms, longest: 513.583542ms, successes: 100, fails: 0 controller_test.go:193: time remaining: 9m49.151183708s, avg runtime: 504.016801ms, longest: 507.16975ms, successes: 100, fails: 0 controller_test.go:193: time remaining: 9m48.671873833s, avg runtime: 478.087117ms, longest: 479.052208ms, successes: 100, fails: 0 controller_test.go:147: took too long controller_test.go:193: time remaining: 9m38.669626958s, avg runtime: 542.430156ms, longest: 544.350417ms, successes: 99, fails: 1 ```

Fixing the deadlock demonstrated in #5824. I decided to move the channel-writing entirely before consuming so it's a bit more accidental-change-resistant: some kinds of simple incorrect changes will lead to an _immediate_ deadlock every time, rather than a random chance of one. And if someone _does_ want to move it after and go back to a smaller buffer, more code will have to be changed, so hopefully people will pay more attention to the concurrency risks involved. More generally, the atomic shutdown stuff is _highly_ prone to causing this kind of error because there's no way to wait on it safely, and I would really love for us to get rid of it.

Groxx · 2024-03-29T22:14:01Z

closing as this is not intended to be merged

Groxx added 2 commits March 29, 2024 15:42

fix, and demonstrate it is fixed

29fc9da

Groxx force-pushed the deadlocks branch from c633064 to 29fc9da Compare March 29, 2024 21:06

Groxx mentioned this pull request Mar 29, 2024

Deadlock fix in acquireShards #5825

Merged

Groxx added 2 commits March 29, 2024 16:50

fix so tests pass

ccc158f

intentionally leave broken as an example

f867fea

Groxx closed this Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock demo in acquireShards #5824

Deadlock demo in acquireShards #5824

Groxx commented Mar 29, 2024 •

edited

Loading

Groxx commented Mar 29, 2024

Deadlock demo in acquireShards #5824

Deadlock demo in acquireShards #5824

Conversation

Groxx commented Mar 29, 2024 • edited Loading

Groxx commented Mar 29, 2024

Groxx commented Mar 29, 2024 •

edited

Loading