Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: deadlock stopping server because of stopper<->lease acquisition cycle #63761

Closed
andreimatei opened this issue Apr 15, 2021 · 7 comments · Fixed by #63867
Closed

kvserver: deadlock stopping server because of stopper<->lease acquisition cycle #63761

andreimatei opened this issue Apr 15, 2021 · 7 comments · Fixed by #63867
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Apr 15, 2021

We appear to have the following deadlock:

The stopper wants to stop; it runs all the closers under s.mu. A closer that was recently added wants to visit all the replicas. Visiting a replica briefly rlocks its r.mu to check if it's been destroyed.

The lease acquisition tries to start a task, and starting tasks wants an rlock on s.mu just to figure out whether the task should be refused

@irfansharif you've added the closer in #61279, so I'll let you figure out who has to give. It seems to me that rejecting new tasks in the stopper can probably be done in a lockfree way. I also wonder whether the closers actually need to run under s.mu - particularly under a write lock. Perhaps we can copy them out of the lock.

Stacks
goroutine 11 [semacquire]:
sync.runtime_SemacquireMutex(0xc00206f994, 0xc002884500, 0x0)
	/home/andrei/goroot/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).RLock(...)
	/home/andrei/goroot/src/sync/rwmutex.go:50
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*storeReplicaVisitor).Visit(0xc002884540, 0xc0028650b0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/store.go:372 +0x1db
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).VisitReplicas(...)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/store.go:2070
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processRaft.func2()
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/store_raft.go:624 +0x9b
github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close(0xc00218d8c0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/util/stop/stopper.go:110 +0x25
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc00073ea80, 0x55e39e0, 0xc00021e018)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/util/stop/stopper.go:478 +0x268
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServerLocked(0xc0007bb200, 0x0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/testutils/testcluster/testcluster.go:165 +0x70
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).stopServers(0xc0007bb200, 0x55e39e0, 0xc00021e018)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/testutils/testcluster/testcluster.go:116 +0x1cf
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func2()
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/testutils/testcluster/testcluster.go:335 +0x45
github.com/cockroachdb/cockroach/pkg/util/stop.CloserFn.Close(0xc0006d5e40)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/util/stop/stopper.go:110 +0x25
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc00073e680, 0x55e39e0, 0xc00021e010)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/util/stop/stopper.go:478 +0x268
github.com/cockroachdb/cockroach/pkg/kv/kvserver_test.TestRejectedLeaseDoesntDictateClosedTimestamp(0xc000483080)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/replica_closedts_test.go:680 +0x1b46
testing.tRunner(0xc000483080, 0x4f13230)
	/home/andrei/goroot/src/testing/testing.go:1123 +0xef
created by testing.(*T).Run
	/home/andrei/goroot/src/testing/testing.go:1168 +0x2b3

goroutine 2083 [semacquire]:
sync.runtime_SemacquireMutex(0xc00073eaa4, 0xc001e62200, 0x0)
	/home/andrei/goroot/src/runtime/sema.go:71 +0x47
sync.(*RWMutex).RLock(...)
	/home/andrei/goroot/src/sync/rwmutex.go:50
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).runPrelude(0xc00073ea80, 0x4853100)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/util/stop/stopper.go:416 +0xdd
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask(0xc00073ea80, 0x55e39a0, 0xc00297ea00, 0xc0004d8540, 0x35, 0xc001ed66e0, 0x1, 0x1)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/util/stop/stopper.go:339 +0x79
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*pendingLeaseRequest).requestLeaseAsync(0xc00206fa90, 0x55e3a60, 0xc001ebf260, 0x100000001, 0x1676230a00000001, 0x0, 0x1676230ac22faf2a, 0x212, 0x0, 0x100000001, ...)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/replica_range_lease.go:344 +0x3f6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*pendingLeaseRequest).InitOrJoinRequest(0xc00206fa90, 0x55e3a60, 0xc001ebf260, 0x100000001, 0x1, 0x0, 0x0, 0x0, 0x0, 0x100000001, ...)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/replica_range_lease.go:279 +0x988
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).requestLeaseLocked(0xc00206f600, 0x55e3a60, 0xc001ebf260, 0x0, 0x0, 0x0, 0x100000001, 0x1, 0x0, 0x0, ...)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/replica_range_lease.go:738 +0x3f8
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).redirectOnOrAcquireLeaseForRequest.func1(0xc00206f600, 0x55e3a60, 0xc001ebf260, 0xc0030e53b0, 0x0, 0x0, 0xc0030e53d0, 0xc0021870a0, 0x0, 0x0, ...)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/replica_range_lease.go:1125 +0x305
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).redirectOnOrAcquireLeaseForRequest(0xc00206f600, 0x55e3a60, 0xc001ebf260, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/replica_range_lease.go:1151 +0x365
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).redirectOnOrAcquireLease(...)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/replica_range_lease.go:1056
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).ensureClosedTimestampStarted(0xc00206f600, 0x55e3a60, 0xc001ebf260, 0xc002187098)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/replica_rangefeed.go:671 +0x73
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).RangeFeed(0xc00206f600, 0xc000dd00c0, 0x56371a0, 0xc0006defb0, 0x0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/replica_rangefeed.go:153 +0x194
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).RangeFeed(0xc002042000, 0xc000dd00c0, 0x56371a0, 0xc0006defb0, 0x0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/store.go:2502 +0x11e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).RangeFeed(0xc000848070, 0xc000dd00c0, 0x56371a0, 0xc0006defb0, 0x10)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/kv/kvserver/stores.go:216 +0xce
github.com/cockroachdb/cockroach/pkg/server.(*Node).RangeFeed(0xc000e88c00, 0xc000dd00c0, 0x56371a0, 0xc0006defb0, 0xc000e88c00, 0xc0020efbf0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/server/node.go:1011 +0x54
github.com/cockroachdb/cockroach/pkg/roachpb._Internal_RangeFeed_Handler(0x47d4860, 0xc000e88c00, 0x562bfe0, 0xc000dd0000, 0x0, 0x0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/roachpb/api.pb.go:8055 +0x10b
github.com/cockroachdb/cockroach/pkg/util/tracing.StreamServerInterceptor.func1(0x47d4860, 0xc000e88c00, 0x562bfe0, 0xc000dd0000, 0xc0010378e0, 0x4f13a98, 0x0, 0x0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/util/tracing/grpc_interceptor.go:169 +0x5a9
google.golang.org/grpc.getChainStreamHandler.func1(0x47d4860, 0xc000e88c00, 0x562bfe0, 0xc000dd0000, 0x455d1c0, 0xc00297e9c0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/vendor/google.golang.org/grpc/server.go:1302 +0xdd
github.com/cockroachdb/cockroach/pkg/rpc.NewServer.func2(0x47d4860, 0xc000e88c00, 0x562bfe0, 0xc000dd0000, 0xc0010378e0, 0xc00297e9c0, 0xc00297e9c0, 0x2)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/rpc/context.go:182 +0x96
google.golang.org/grpc.getChainStreamHandler.func1(0x47d4860, 0xc000e88c00, 0x562bfe0, 0xc000dd0000, 0x0, 0x0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/vendor/google.golang.org/grpc/server.go:1302 +0xdd
github.com/cockroachdb/cockroach/pkg/rpc.kvAuth.streamInterceptor(0x47d4860, 0xc000e88c00, 0x562bfe0, 0xc000dd0000, 0xc0010378e0, 0xc00297e980, 0x455d1c0, 0xc00297e980)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/pkg/rpc/auth.go:86 +0xa8
google.golang.org/grpc.chainStreamServerInterceptors.func1(0x47d4860, 0xc000e88c00, 0x562bfe0, 0xc000dd0000, 0xc0010378e0, 0x4f13a98, 0x55e3a60, 0xc001ebf110)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/vendor/google.golang.org/grpc/server.go:1288 +0xbd
google.golang.org/grpc.(*Server).processStreamingRPC(0xc000b648c0, 0x564d7c0, 0xc001c88d80, 0xc001bb0b00, 0xc000ead050, 0x74d89c0, 0x0, 0x0, 0x0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/vendor/google.golang.org/grpc/server.go:1434 +0x522
google.golang.org/grpc.(*Server).handleStream(0xc000b648c0, 0x564d7c0, 0xc001c88d80, 0xc001bb0b00, 0x0)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/vendor/google.golang.org/grpc/server.go:1507 +0xc9c
google.golang.org/grpc.(*Server).serveStreams.func1.2(0xc0037e5130, 0xc000b648c0, 0x564d7c0, 0xc001c88d80, 0xc001bb0b00)
	/home/andrei/src/github.com/cockroachdb/cockroach-2/vendor/google.golang.org/grpc/server.go:843 +0xa5
created by google.golang.org/grpc.(*Server).serveStreams.func1
	/home/andrei/src/github.com/cockroachdb/cockroach-2/vendor/google.golang.org/grpc/server.go:841 +0x1fd
@andreimatei andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker branch-release-21.1 labels Apr 15, 2021
@irfansharif
Copy link
Contributor

From the snippets above, I'm not sure I completely follow how we're deadlocking. We seem to be respecting the Store.mu < Replica.Mu ordering we want in both threads. Agreed we can do something better with rejecting tasks/closing closers with Store.mu, but how do I convince myself it's actually a deadlock we're seeing here?

@irfansharif
Copy link
Contributor

irfansharif commented Apr 19, 2021

Oh, doh, s.mu != Store.mu.

@irfansharif
Copy link
Contributor

starting tasks wants an rlock on s.mu just to figure out whether the task should be refused

It wants to rlock s.mu while it adds the task, to ensure it we don't do it after a Stop().

// NB: we run this under the read lock to ensure that `refuseRLocked()` cannot
// change until the task is registered. If we didn't do this, we'd run the
// risk of starting a task after a successful call to Stop().

I'm still not sure about this deadlock, how do I reproduce it? The locks seem fine to me, just looks like the Closers can take a while sometimes. I was unable to reproduce it with your branch at #63672.

$ make stress PKG=./pkg/kv/kvserver TESTS=TestRejectedLeaseDoesntDictateClosedTimestamp TESTTIMEOUT=1m
...
5761 runs so far, 0 failures, over 7m40s

@andreimatei
Copy link
Contributor Author

I'm still not sure about this deadlock

Don't the linked stack traces show a deadlock between a closer trying to visit a replica and that replica trying to start a stopper task (a lease acquisition)? Was my explanation of the top clear / which part doesn't hold up?

how do I reproduce it?

I was producing it pretty easily at some iteration of the test #63672. I'm sorry I didn't keep a snapshot of that state; I thought that the stacks I captured are clear enough but maybe I was wrong.

@irfansharif
Copy link
Contributor

Don't the linked stack traces show a deadlock between a closer trying to visit a replica and that replica trying to start a stopper task (a lease acquisition)?

I don't think it does, no. The stacks do show that the lease acquisition thread is blocked on the closer thread, but that's what we want to happen (things are being closed, and we don't want new lease acquisition tasks to get started). It's also not clear that these are the same replica (we can't tell; the memory address of the replica in the lease acquisition thread, 0xc00206f600, would not appear in the other stack trace). Closing the stopper, and visiting all the replicas when doing so, does not trigger lease acquisition requests. So I'm not sure about this being a deadlock.

The stacks simply show that the closer thread is waiting on some repl.mu to be unlocked, which could be held by some other guy (not captured in the stack trace snippet). Perhaps some funkiness with the earlier iteration of the test?

@andreimatei
Copy link
Contributor Author

Closing the stopper, and visiting all the replicas when doing so, does not trigger lease acquisition requests. So I'm not sure about this being a deadlock.

Closing the stopper does not trigger lease acquisitions. But if a lease acquisition was triggered by something else just before the the stopper is asked to stop -> that's the deadlock. The lease acquisition takes an exclusive lock on r.mu and then wants the stopper lock. The closer runs under stopper.mu and wants a read lock on the replica. The ordering of these two locks is reversed.

@irfansharif
Copy link
Contributor

irfansharif commented Apr 19, 2021

Ack, missed that. Also forgot about TAGS=deadlock. Did we stop running CI with this?

$ make test PKG=./pkg/kv/kvserver TESTS=TestRejectedLeaseDoesntDictateClosedTimestamp TAGS=deadlock
...
POTENTIAL DEADLOCK: Inconsistent locking. saw this ordering in one goroutine:
happened before
replica_application_result.go:281 kvserver.(*Replica).handleLeaseResult { r.mu.Lock() } <<<<<
replica_application_state_machine.go:1156 kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult { sm.r.handleLeaseResult(ctx, newLease, rResult.PriorReadSummary) }
replica_application_state_machine.go:1076 kvserver.(*replicaStateMachine).ApplySideEffects { shouldAssert, isRemoved := sm.handleNonTrivialReplicatedEvalResult(ctx, cmd.replicatedResult()) }
apply/cmd.go:196 apply.mapCheckedCmdIter { applied, err := fn(iter.CurChecked()) }
apply/task.go:291 apply.(*Task).applyOneBatch { appliedIter, err := mapCheckedCmdIter(stagedIter, t.sm.ApplySideEffects) }
apply/task.go:247 apply.(*Task).ApplyCommittedEntries { if err := t.applyOneBatch(ctx, iter); err != nil { }
replica_raft.go:796 kvserver.(*Replica).handleRaftReadyRaftMuLocked { err := appTask.ApplyCommittedEntries(ctx) }
replica_raft.go:459 kvserver.(*Replica).handleRaftReady { return r.handleRaftReadyRaftMuLocked(ctx, inSnap) }
store_raft.go:523 kvserver.(*Store).processReady { stats, expl, err := r.handleRaftReady(ctx, noSnap) }
scheduler.go:284 kvserver.(*raftScheduler).worker { s.processor.processReady(ctx, id) }
../../util/stop/stopper.go:351 stop.(*Stopper).RunAsyncTask.func1 { f(ctx) }

happened after
../../util/stop/stopper.go:416 stop.(*Stopper).runPrelude { s.mu.RLock() } <<<<<
../../util/stop/stopper.go:339 stop.(*Stopper).RunAsyncTask { if !s.runPrelude() { }
replica_proposal.go:505 kvserver.(*Replica).leasePostApplyLocked { _ = r.store.stopper.RunAsyncTask(ctx, "lease-triggers", func(ctx context.Context) { }
replica_application_result.go:283 kvserver.(*Replica).handleLeaseResult { r.leasePostApplyLocked(ctx, }
replica_application_state_machine.go:1156 kvserver.(*replicaStateMachine).handleNonTrivialReplicatedEvalResult { sm.r.handleLeaseResult(ctx, newLease, rResult.PriorReadSummary) }
replica_application_state_machine.go:1076 kvserver.(*replicaStateMachine).ApplySideEffects { shouldAssert, isRemoved := sm.handleNonTrivialReplicatedEvalResult(ctx, cmd.replicatedResult()) }
apply/cmd.go:196 apply.mapCheckedCmdIter { applied, err := fn(iter.CurChecked()) }
apply/task.go:291 apply.(*Task).applyOneBatch { appliedIter, err := mapCheckedCmdIter(stagedIter, t.sm.ApplySideEffects) }
apply/task.go:247 apply.(*Task).ApplyCommittedEntries { if err := t.applyOneBatch(ctx, iter); err != nil { }
replica_raft.go:796 kvserver.(*Replica).handleRaftReadyRaftMuLocked { err := appTask.ApplyCommittedEntries(ctx) }
replica_raft.go:459 kvserver.(*Replica).handleRaftReady { return r.handleRaftReadyRaftMuLocked(ctx, inSnap) }
store_raft.go:523 kvserver.(*Store).processReady { stats, expl, err := r.handleRaftReady(ctx, noSnap) }
scheduler.go:284 kvserver.(*raftScheduler).worker { s.processor.processReady(ctx, id) }
../../util/stop/stopper.go:351 stop.(*Stopper).RunAsyncTask.func1 { f(ctx) }

in another goroutine: happened before
../../util/stop/stopper.go:475 stop.(*Stopper).Stop { s.mu.Lock() } <<<<<
../../testutils/testcluster/testcluster.go:165 testcluster.(*TestCluster).stopServerLocked { tc.mu.serverStoppers[idx].Stop(context.TODO()) }
../../testutils/testcluster/testcluster.go:116 testcluster.(*TestCluster).stopServers { tc.stopServerLocked(i) }
../../testutils/testcluster/testcluster.go:335 testcluster.(*TestCluster).Start.func2 { tc.stopper.AddCloser(stop.CloserFn(func() { tc.stopServers(context.TODO()) })) }
../../util/stop/stopper.go:110 stop.CloserFn.Close { f() }
../../util/stop/stopper.go:478 stop.(*Stopper).Stop { c.Close() }
replica_closedts_test.go:651 kvserver_test.TestRejectedLeaseDoesntDictateClosedTimestamp { } }

happend after
replica_init.go:251 kvserver.(*Replica).isInitializedRLocked { return r.mu.state.Desc.IsInitialized() } <<<<<
store.go:2070 kvserver.(*Store).processRaft.func2 { v.Visit(visitor) }
store_raft.go:624 kvserver.(*Store).processRaft.func2 { s.VisitReplicas(func(r *Replica) (more bool) { }
../../util/stop/stopper.go:110 stop.CloserFn.Close { f() }
../../util/stop/stopper.go:478 stop.(*Stopper).Stop { c.Close() }
../../testutils/testcluster/testcluster.go:165 testcluster.(*TestCluster).stopServerLocked { tc.mu.serverStoppers[idx].Stop(context.TODO()) }
../../testutils/testcluster/testcluster.go:116 testcluster.(*TestCluster).stopServers { tc.stopServerLocked(i) }
../../testutils/testcluster/testcluster.go:335 testcluster.(*TestCluster).Start.func2 { tc.stopper.AddCloser(stop.CloserFn(func() { tc.stopServers(context.TODO()) })) }
../../util/stop/stopper.go:110 stop.CloserFn.Close { f() }
../../util/stop/stopper.go:478 stop.(*Stopper).Stop { c.Close() }
replica_closedts_test.go:651 kvserver_test.TestRejectedLeaseDoesntDictateClosedTimestamp { } }

irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 19, 2021
Fixes cockroachdb#63761. As of cockroachdb#61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.

The closer added in cockroachdb#61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.

Release note: None
irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 19, 2021
Fixes cockroachdb#63761. As of cockroachdb#61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.

The closer added in cockroachdb#61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.

Release note: None
craig bot pushed a commit that referenced this issue Apr 20, 2021
63867: stop: break deadlock between Stopper.mu and Replica.mu r=irfansharif a=irfansharif

Fixes #63761. As of #61279, it was possible for us to deadlock due to
inconsistent lock orderings between Stopper.mu and Replica.mu. We were
previously holding onto Stopper.mu while executing all closers,
including those that may acquire other locks. Because closers can be
defined anywhere (and may consequently grab any in-scope lock), we
should order Stopper.mu to come after all other locks in the system.

The closer added in #61279 iterated over all non-destroyed replicas,
locking Replica.mu to check for the replica's destroy status (in
Store.VisitReplicas). This deadlocked with the lease acquisition code
path that first grabs an exclusive lock over Replica.mu (see
InitOrJoinRequest), and uses the stopper to kick off an async task
acquiring the lease. The stopper internally locks Stopper.mu to check
whether or not it was already stopped.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
@craig craig bot closed this as completed in 79922e4 Apr 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. GA-blocker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants