-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: deadlock stopping server because of stopper<->lease acquisition cycle #63761
Comments
From the snippets above, I'm not sure I completely follow how we're deadlocking. We seem to be respecting the Store.mu < Replica.Mu ordering we want in both threads. Agreed we can do something better with rejecting tasks/closing closers with Store.mu, but how do I convince myself it's actually a deadlock we're seeing here? |
Oh, doh, s.mu != Store.mu. |
It wants to rlock s.mu while it adds the task, to ensure it we don't do it after a Stop(). cockroach/pkg/util/stop/stopper.go Lines 421 to 423 in 0c9f64f
I'm still not sure about this deadlock, how do I reproduce it? The locks seem fine to me, just looks like the Closers can take a while sometimes. I was unable to reproduce it with your branch at #63672.
|
Don't the linked stack traces show a deadlock between a closer trying to visit a replica and that replica trying to start a stopper task (a lease acquisition)? Was my explanation of the top clear / which part doesn't hold up?
I was producing it pretty easily at some iteration of the test #63672. I'm sorry I didn't keep a snapshot of that state; I thought that the stacks I captured are clear enough but maybe I was wrong. |
I don't think it does, no. The stacks do show that the lease acquisition thread is blocked on the closer thread, but that's what we want to happen (things are being closed, and we don't want new lease acquisition tasks to get started). It's also not clear that these are the same replica (we can't tell; the memory address of the replica in the lease acquisition thread, The stacks simply show that the closer thread is waiting on some |
Closing the stopper does not trigger lease acquisitions. But if a lease acquisition was triggered by something else just before the the stopper is asked to stop -> that's the deadlock. The lease acquisition takes an exclusive lock on |
Ack, missed that. Also forgot about
|
Fixes cockroachdb#63761. As of cockroachdb#61279, it was possible for us to deadlock due to inconsistent lock orderings between Stopper.mu and Replica.mu. We were previously holding onto Stopper.mu while executing all closers, including those that may acquire other locks. Because closers can be defined anywhere (and may consequently grab any in-scope lock), we should order Stopper.mu to come after all other locks in the system. The closer added in cockroachdb#61279 iterated over all non-destroyed replicas, locking Replica.mu to check for the replica's destroy status (in Store.VisitReplicas). This deadlocked with the lease acquisition code path that first grabs an exclusive lock over Replica.mu (see InitOrJoinRequest), and uses the stopper to kick off an async task acquiring the lease. The stopper internally locks Stopper.mu to check whether or not it was already stopped. Release note: None
Fixes cockroachdb#63761. As of cockroachdb#61279, it was possible for us to deadlock due to inconsistent lock orderings between Stopper.mu and Replica.mu. We were previously holding onto Stopper.mu while executing all closers, including those that may acquire other locks. Because closers can be defined anywhere (and may consequently grab any in-scope lock), we should order Stopper.mu to come after all other locks in the system. The closer added in cockroachdb#61279 iterated over all non-destroyed replicas, locking Replica.mu to check for the replica's destroy status (in Store.VisitReplicas). This deadlocked with the lease acquisition code path that first grabs an exclusive lock over Replica.mu (see InitOrJoinRequest), and uses the stopper to kick off an async task acquiring the lease. The stopper internally locks Stopper.mu to check whether or not it was already stopped. Release note: None
63867: stop: break deadlock between Stopper.mu and Replica.mu r=irfansharif a=irfansharif Fixes #63761. As of #61279, it was possible for us to deadlock due to inconsistent lock orderings between Stopper.mu and Replica.mu. We were previously holding onto Stopper.mu while executing all closers, including those that may acquire other locks. Because closers can be defined anywhere (and may consequently grab any in-scope lock), we should order Stopper.mu to come after all other locks in the system. The closer added in #61279 iterated over all non-destroyed replicas, locking Replica.mu to check for the replica's destroy status (in Store.VisitReplicas). This deadlocked with the lease acquisition code path that first grabs an exclusive lock over Replica.mu (see InitOrJoinRequest), and uses the stopper to kick off an async task acquiring the lease. The stopper internally locks Stopper.mu to check whether or not it was already stopped. Release note: None Co-authored-by: irfan sharif <[email protected]>
We appear to have the following deadlock:
The stopper wants to stop; it runs all the closers under s.mu. A closer that was recently added wants to visit all the replicas. Visiting a replica briefly rlocks its
r.mu
to check if it's been destroyed.The lease acquisition tries to start a task, and starting tasks wants an rlock on
s.mu
just to figure out whether the task should be refused@irfansharif you've added the closer in #61279, so I'll let you figure out who has to give. It seems to me that rejecting new tasks in the stopper can probably be done in a lockfree way. I also wonder whether the closers actually need to run under
s.mu
- particularly under a write lock. Perhaps we can copy them out of the lock.Stacks
The text was updated successfully, but these errors were encountered: