Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: Give up leadership if the store is wedged #2057

Merged
merged 1 commit into from
Mar 29, 2017

Conversation

aaronlehmann
Copy link
Collaborator

There have been a few bugs in the past that involved deadlocks where the
store lock was held indefinitely. These bugs are particularly bad
because the manager continues to send heartbeats as normal, so there's
no opportunity for a leader election to replace the stuck leader.

Add a method to MemoryStore that returns true if the lock has been held
for more than 30 seconds. Check this method in the raft implementation,
and use TransferLeadership when the store gets stuck.

cc @cyli

Fixes #1658

@codecov
Copy link

codecov bot commented Mar 24, 2017

Codecov Report

Merging #2057 into master will decrease coverage by 0.08%.
The diff coverage is 48.27%.

@@            Coverage Diff             @@
##           master    #2057      +/-   ##
==========================================
- Coverage   54.33%   54.24%   -0.09%     
==========================================
  Files         111      111              
  Lines       19323    19352      +29     
==========================================
- Hits        10499    10498       -1     
- Misses       7559     7590      +31     
+ Partials     1265     1264       -1

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 970b45a...0141555. Read the comment docs.

}

func (m *timedMutex) Lock() {
m.lockedAt.Store(time.Now())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be defer m.lockedAt.Store(time.Now())? Otherwise, if the lock is wedged (e.g. m.Mutex.Lock()) will block, then so long as something else attempts to call timedMutex.Lock(), lockedAt value will be reset and we won't necessarily be able to tell if something's wedged?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I had imagined that these states would involve everything that tries to get the lock getting stuck, but I guess there could be cases where new goroutines get created continuously and try to acquire the lock.

ClockSource: clockSource,
TLSCredentials: securityConfig.ClientTLSCreds,
KeyRotator: keyRotator,
DisableStackDump: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to disable all stack dumps for tests? It might be useful to have a stack dump for tests that aren't meant to wedge, and just disable it for the single new test that is meant to always wedge (TestRaftWedgedManager)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this situation has come up before with the raft unit tests - there is just not much concurrent store usage involved. We'll still get the stack traces within integration tests, where I think an issue would be far more likely.

But yeah, I can make this specific to TestRaftWedgedManager.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, makes sense. In that case this is not as important then - up to you whether you want to make the change or not.

There have been a few bugs in the past that involved deadlocks where the
store lock was held indefinitely. These bugs are particularly bad
because the manager continues to send heartbeats as normal, so there's
no opportunity for a leader election to replace the stuck leader.

Add a method to MemoryStore that returns true if the lock has been held
for more than 30 second. Check this method in the raft implementation,
and use TransferLeadership when the store gets stuck.

Signed-off-by: Aaron Lehmann <[email protected]>
@aaronlehmann
Copy link
Collaborator Author

Addressed comments. Also changed the timeout used by tests to 3 seconds, to avoid possible false positives.

Copy link
Contributor

@cyli cyli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Out of curiosity, I had added another assertion to the new test to see if the next round of leader election can still happen while the one node is still wedged (I just restarted the new leader and waited for the cluster to be ready again), and it seemed to work, but I wasn't sure about the reason.

Is this because responding to another node's votes does not require a lock on the memory store?

@aaronlehmann
Copy link
Collaborator Author

Yes, that's correct.

@dongluochen
Copy link
Contributor

LGTM

@aaronlehmann
Copy link
Collaborator Author

Thanks for the reviews.

@aaronlehmann aaronlehmann merged commit b4c4309 into moby:master Mar 29, 2017
@aaronlehmann aaronlehmann deleted the wedged-managers branch March 29, 2017 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants