-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stop: break deadlock between Stopper.mu and Replica.mu #63867
stop: break deadlock between Stopper.mu and Replica.mu #63867
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)
pkg/kv/kvserver/store.go, line 474 at r1 (raw file):
// obeyed: baseQueue.mu < Replica.raftMu < Replica.readOnlyCmdMu < Store.mu // < Replica.mu < Replica.unreachablesMu < Store.coalescedMu // < Store.scheduler.mu < Stopper.mu.
I don't think including stopper.mu
in this list is a good idea... That's a library lock not directly related to replica processing.
pkg/util/stop/stopper.go, line 477 at r1 (raw file):
s.mu.Lock() closers := s.mu.closers defer s.mu.Unlock()
mmm I don't think you've fixed what you wanted to fix... We're still calling the closer under the lock. I think you don't want to defer
the unlock any more.
In fact I think you don't even need to take this lock any more, based on the same argument below about how new closers can't be added any more (and the freezing was done by this thread ).
pkg/util/stop/stopper.go, line 479 at r1 (raw file):
defer s.mu.Unlock() // We can't hold on to Stopper.mu while executing closers. Closers are
s/Stopper.mu/s.mu (or just "the lock").
pkg/util/stop/stopper.go, line 481 at r1 (raw file):
// We can't hold on to Stopper.mu while executing closers. Closers are // allowed to be defined anywhere, and are allowed to grab any lock // available. As far as lock ordering is concerned, Stopper.mu comes after
nit: what does "any lock available" mean? I think this note is generally too verbose; just say "run the closers without holding the lock". Avoiding calling callbacks with locks held is a very common thing.
48eedde
to
0c2383f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)
pkg/kv/kvserver/store.go, line 474 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
I don't think including
stopper.mu
in this list is a good idea... That's a library lock not directly related to replica processing.
Done.
pkg/util/stop/stopper.go, line 477 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
mmm I don't think you've fixed what you wanted to fix... We're still calling the closer under the lock. I think you don't want to
defer
the unlock any more.In fact I think you don't even need to take this lock any more, based on the same argument below about how new closers can't be added any more (and the freezing was done by this thread ).
Done.
pkg/util/stop/stopper.go, line 479 at r1 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
s/Stopper.mu/s.mu (or just "the lock").
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)
Fixes cockroachdb#63761. As of cockroachdb#61279, it was possible for us to deadlock due to inconsistent lock orderings between Stopper.mu and Replica.mu. We were previously holding onto Stopper.mu while executing all closers, including those that may acquire other locks. Because closers can be defined anywhere (and may consequently grab any in-scope lock), we should order Stopper.mu to come after all other locks in the system. The closer added in cockroachdb#61279 iterated over all non-destroyed replicas, locking Replica.mu to check for the replica's destroy status (in Store.VisitReplicas). This deadlocked with the lease acquisition code path that first grabs an exclusive lock over Replica.mu (see InitOrJoinRequest), and uses the stopper to kick off an async task acquiring the lease. The stopper internally locks Stopper.mu to check whether or not it was already stopped. Release note: None
0c2383f
to
79922e4
Compare
bors r+ |
Build succeeded: |
Fixes #63761. As of #61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.
The closer added in #61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.
Release note: None