Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock demo in acquireShards #5824

Closed
wants to merge 4 commits into from
Closed

Conversation

Groxx
Copy link
Contributor

@Groxx Groxx commented Mar 29, 2024

Demo for #5823, for a flawed construct from the combination of #2888 (concurrently acquire shards) and #3134 (handle shutdown).

It's pretty obviously incorrect:

❯ go test -v -test.run TestDeadlock ./service/history/shard
=== RUN   TestDeadlock
    controller_test.go:193: time remaining: 9m59.498006333s, avg runtime: 499.157614ms, longest: 501.214333ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m58.954002708s, avg runtime: 542.559293ms, longest: 543.647833ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m58.434399375s, avg runtime: 518.730977ms, longest: 519.447708ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m57.905938917s, avg runtime: 527.407791ms, longest: 528.353042ms, successes: 100, fails: 0
...
    controller_test.go:193: time remaining: 9m50.650997167s, avg runtime: 498.004745ms, longest: 499.985292ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m50.172357417s, avg runtime: 477.521516ms, longest: 478.509375ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m49.658607167s, avg runtime: 512.926921ms, longest: 513.583542ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m49.151183708s, avg runtime: 504.016801ms, longest: 507.16975ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m48.671873833s, avg runtime: 478.087117ms, longest: 479.052208ms, successes: 100, fails: 0
    controller_test.go:147: took too long
    controller_test.go:193: time remaining: 9m38.669626958s, avg runtime: 542.430156ms, longest: 544.350417ms, successes: 99, fails: 1
                         note 10s have passed ^
         though it's reliably ~0.5s normally.
                CPU is idle during this time.

A cause-sequence that leads to this, starting from a full work channel:

  • worker thread: worker pulls from chan
  • acquire thread: ^ == shard loop pushes to refill the chan, at the same time
  • acquire thread: checks shutdown, nope
  • acquire thread: push to full chan, blocks
  • other thread: shutdown occurs
  • worker thread: worker checks shutdown, yes, returns
  • and now there are no workers to pull from the full chan, so the push blocks forever

The fix is rather easy: buffer correctly, so the acquire thread cannot fail to notice the shutdown.
Then the test runs like:

    controller_test.go:205: time remaining: 8m33.72483275s, avg runtime: 12.304837ms, longest: 55.83725ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.683416958s, avg runtime: 17.070645ms, longest: 37.5345ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.634528958s, avg runtime: 15.512894ms, longest: 43.902167ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.565585208s, avg runtime: 18.193592ms, longest: 67.813334ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.50871675s, avg runtime: 10.90086ms, longest: 54.84125ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.452797208s, avg runtime: 9.594426ms, longest: 52.924791ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.394790083s, avg runtime: 15.320305ms, longest: 55.232875ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.33406525s, avg runtime: 20.083535ms, longest: 55.97175ms, successes: 100, fails: 0
    controller_test.go:205: time remaining: 8m33.273468375s, avg runtime: 31.289924ms, longest: 57.83125ms, successes: 100, fails: 0

goroutine-swapping and atomics add up a fair bit in these micro-benchmarks - it's >20x faster. that's inconsequential in the real code though.

Pretty clearly a flawed construct.
```
❯ go test -v -test.run TestDeadlock ./service/history/shard
=== RUN   TestDeadlock
    controller_test.go:193: time remaining: 9m59.498006333s, avg runtime: 499.157614ms, longest: 501.214333ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m58.954002708s, avg runtime: 542.559293ms, longest: 543.647833ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m58.434399375s, avg runtime: 518.730977ms, longest: 519.447708ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m57.905938917s, avg runtime: 527.407791ms, longest: 528.353042ms, successes: 100, fails: 0
...
    controller_test.go:193: time remaining: 9m50.650997167s, avg runtime: 498.004745ms, longest: 499.985292ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m50.172357417s, avg runtime: 477.521516ms, longest: 478.509375ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m49.658607167s, avg runtime: 512.926921ms, longest: 513.583542ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m49.151183708s, avg runtime: 504.016801ms, longest: 507.16975ms, successes: 100, fails: 0
    controller_test.go:193: time remaining: 9m48.671873833s, avg runtime: 478.087117ms, longest: 479.052208ms, successes: 100, fails: 0
    controller_test.go:147: took too long
    controller_test.go:193: time remaining: 9m38.669626958s, avg runtime: 542.430156ms, longest: 544.350417ms, successes: 99, fails: 1
```
Groxx added a commit that referenced this pull request Mar 29, 2024
Fixing the deadlock demonstrated in #5824.

I decided to move the channel-writing entirely before consuming so it's a bit more accidental-change-resistant: some kinds of simple incorrect changes will lead to an _immediate_ deadlock every time, rather than a random chance of one.

And if someone _does_ want to move it after and go back to a smaller buffer, more code will have to be changed, so hopefully people will pay more attention to the concurrency risks involved.

More generally, the atomic shutdown stuff is _highly_ prone to causing this kind of error because there's no way to wait on it safely, and I would really love for us to get rid of it.
@Groxx
Copy link
Contributor Author

Groxx commented Mar 29, 2024

closing as this is not intended to be merged

@Groxx Groxx closed this Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant