Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stop: break deadlock between Stopper.mu and Replica.mu #63867

Merged
merged 1 commit into from
Apr 20, 2021

Conversation

irfansharif
Copy link
Contributor

Fixes #63761. As of #61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.

The closer added in #61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.

Release note: None

@irfansharif irfansharif requested review from andreimatei and tbg April 19, 2021 19:14
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


pkg/kv/kvserver/store.go, line 474 at r1 (raw file):

	// obeyed: baseQueue.mu < Replica.raftMu < Replica.readOnlyCmdMu < Store.mu
	// < Replica.mu < Replica.unreachablesMu < Store.coalescedMu
	// < Store.scheduler.mu < Stopper.mu.

I don't think including stopper.mu in this list is a good idea... That's a library lock not directly related to replica processing.


pkg/util/stop/stopper.go, line 477 at r1 (raw file):

	s.mu.Lock()
	closers := s.mu.closers
	defer s.mu.Unlock()

mmm I don't think you've fixed what you wanted to fix... We're still calling the closer under the lock. I think you don't want to defer the unlock any more.

In fact I think you don't even need to take this lock any more, based on the same argument below about how new closers can't be added any more (and the freezing was done by this thread ).


pkg/util/stop/stopper.go, line 479 at r1 (raw file):

	defer s.mu.Unlock()

	// We can't hold on to Stopper.mu while executing closers. Closers are

s/Stopper.mu/s.mu (or just "the lock").


pkg/util/stop/stopper.go, line 481 at r1 (raw file):

	// We can't hold on to Stopper.mu while executing closers. Closers are
	// allowed to be defined anywhere, and are allowed to grab any lock
	// available. As far as lock ordering is concerned, Stopper.mu comes after

nit: what does "any lock available" mean? I think this note is generally too verbose; just say "run the closers without holding the lock". Avoiding calling callbacks with locks held is a very common thing.

@irfansharif irfansharif force-pushed the 210419.deadlock-store-close branch from 48eedde to 0c2383f Compare April 19, 2021 19:45
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)


pkg/kv/kvserver/store.go, line 474 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't think including stopper.mu in this list is a good idea... That's a library lock not directly related to replica processing.

Done.


pkg/util/stop/stopper.go, line 477 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

mmm I don't think you've fixed what you wanted to fix... We're still calling the closer under the lock. I think you don't want to defer the unlock any more.

In fact I think you don't even need to take this lock any more, based on the same argument below about how new closers can't be added any more (and the freezing was done by this thread ).

Done.


pkg/util/stop/stopper.go, line 479 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/Stopper.mu/s.mu (or just "the lock").

Done.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei and @tbg)

Fixes cockroachdb#63761. As of cockroachdb#61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.

The closer added in cockroachdb#61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.

Release note: None
@irfansharif irfansharif force-pushed the 210419.deadlock-store-close branch from 0c2383f to 79922e4 Compare April 20, 2021 04:46
@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 20, 2021

Build succeeded:

@craig craig bot merged commit d094f02 into cockroachdb:master Apr 20, 2021
@irfansharif irfansharif deleted the 210419.deadlock-store-close branch April 20, 2021 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: deadlock stopping server because of stopper<->lease acquisition cycle
3 participants