raft: Give up leadership if the store is wedged #2057

aaronlehmann · 2017-03-24T22:46:50Z

There have been a few bugs in the past that involved deadlocks where the
store lock was held indefinitely. These bugs are particularly bad
because the manager continues to send heartbeats as normal, so there's
no opportunity for a leader election to replace the stuck leader.

Add a method to MemoryStore that returns true if the lock has been held
for more than 30 seconds. Check this method in the raft implementation,
and use TransferLeadership when the store gets stuck.

cc @cyli

Fixes #1658

codecov · 2017-03-24T22:57:23Z

Codecov Report

Merging #2057 into master will decrease coverage by 0.08%.
The diff coverage is 48.27%.

@@            Coverage Diff             @@
##           master    #2057      +/-   ##
==========================================
- Coverage   54.33%   54.24%   -0.09%     
==========================================
  Files         111      111              
  Lines       19323    19352      +29     
==========================================
- Hits        10499    10498       -1     
- Misses       7559     7590      +31     
+ Partials     1265     1264       -1

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 970b45a...0141555. Read the comment docs.

cyli · 2017-03-27T18:53:49Z

manager/state/store/memory.go

+}
+
+func (m *timedMutex) Lock() {
+	m.lockedAt.Store(time.Now())


Should this be defer m.lockedAt.Store(time.Now())? Otherwise, if the lock is wedged (e.g. m.Mutex.Lock()) will block, then so long as something else attempts to call timedMutex.Lock(), lockedAt value will be reset and we won't necessarily be able to tell if something's wedged?

Fair enough. I had imagined that these states would involve everything that tries to get the lock getting stuck, but I guess there could be cases where new goroutines get created continuously and try to acquire the lock.

cyli · 2017-03-27T19:01:59Z

manager/state/raft/testutils/testutils.go

+		ClockSource:      clockSource,
+		TLSCredentials:   securityConfig.ClientTLSCreds,
+		KeyRotator:       keyRotator,
+		DisableStackDump: true,


Do we want to disable all stack dumps for tests? It might be useful to have a stack dump for tests that aren't meant to wedge, and just disable it for the single new test that is meant to always wedge (TestRaftWedgedManager)

I don't think this situation has come up before with the raft unit tests - there is just not much concurrent store usage involved. We'll still get the stack traces within integration tests, where I think an issue would be far more likely.

But yeah, I can make this specific to TestRaftWedgedManager.

Ah ok, makes sense. In that case this is not as important then - up to you whether you want to make the change or not.

There have been a few bugs in the past that involved deadlocks where the store lock was held indefinitely. These bugs are particularly bad because the manager continues to send heartbeats as normal, so there's no opportunity for a leader election to replace the stuck leader. Add a method to MemoryStore that returns true if the lock has been held for more than 30 second. Check this method in the raft implementation, and use TransferLeadership when the store gets stuck. Signed-off-by: Aaron Lehmann <[email protected]>

aaronlehmann · 2017-03-27T20:59:30Z

Addressed comments. Also changed the timeout used by tests to 3 seconds, to avoid possible false positives.

cyli

LGTM

Out of curiosity, I had added another assertion to the new test to see if the next round of leader election can still happen while the one node is still wedged (I just restarted the new leader and waited for the cluster to be ready again), and it seemed to work, but I wasn't sure about the reason.

Is this because responding to another node's votes does not require a lock on the memory store?

aaronlehmann · 2017-03-27T22:17:44Z

Yes, that's correct.

dongluochen · 2017-03-29T00:14:11Z

LGTM

aaronlehmann · 2017-03-29T00:15:31Z

Thanks for the reviews.

cyli reviewed Mar 27, 2017

View reviewed changes

aaronlehmann force-pushed the wedged-managers branch from 0b8bc63 to 0141555 Compare March 27, 2017 20:59

cyli approved these changes Mar 27, 2017

View reviewed changes

aaronlehmann merged commit b4c4309 into moby:master Mar 29, 2017

aaronlehmann deleted the wedged-managers branch March 29, 2017 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: Give up leadership if the store is wedged #2057

raft: Give up leadership if the store is wedged #2057

aaronlehmann commented Mar 24, 2017

codecov bot commented Mar 24, 2017 •

edited

Loading

cyli Mar 27, 2017

aaronlehmann Mar 27, 2017

cyli Mar 27, 2017

aaronlehmann Mar 27, 2017

cyli Mar 27, 2017

aaronlehmann commented Mar 27, 2017

cyli left a comment

aaronlehmann commented Mar 27, 2017

dongluochen commented Mar 29, 2017

aaronlehmann commented Mar 29, 2017

raft: Give up leadership if the store is wedged #2057

raft: Give up leadership if the store is wedged #2057

Conversation

aaronlehmann commented Mar 24, 2017

codecov bot commented Mar 24, 2017 • edited Loading

Codecov Report

cyli Mar 27, 2017

Choose a reason for hiding this comment

aaronlehmann Mar 27, 2017

Choose a reason for hiding this comment

cyli Mar 27, 2017

Choose a reason for hiding this comment

aaronlehmann Mar 27, 2017

Choose a reason for hiding this comment

cyli Mar 27, 2017

Choose a reason for hiding this comment

aaronlehmann commented Mar 27, 2017

cyli left a comment

Choose a reason for hiding this comment

aaronlehmann commented Mar 27, 2017

dongluochen commented Mar 29, 2017

aaronlehmann commented Mar 29, 2017

codecov bot commented Mar 24, 2017 •

edited

Loading