-
Notifications
You must be signed in to change notification settings - Fork 615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft: Give up leadership if the store is wedged #2057
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2057 +/- ##
==========================================
- Coverage 54.33% 54.24% -0.09%
==========================================
Files 111 111
Lines 19323 19352 +29
==========================================
- Hits 10499 10498 -1
- Misses 7559 7590 +31
+ Partials 1265 1264 -1 Continue to review full report at Codecov.
|
manager/state/store/memory.go
Outdated
} | ||
|
||
func (m *timedMutex) Lock() { | ||
m.lockedAt.Store(time.Now()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be defer m.lockedAt.Store(time.Now())
? Otherwise, if the lock is wedged (e.g. m.Mutex.Lock()
) will block, then so long as something else attempts to call timedMutex.Lock()
, lockedAt
value will be reset and we won't necessarily be able to tell if something's wedged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. I had imagined that these states would involve everything that tries to get the lock getting stuck, but I guess there could be cases where new goroutines get created continuously and try to acquire the lock.
ClockSource: clockSource, | ||
TLSCredentials: securityConfig.ClientTLSCreds, | ||
KeyRotator: keyRotator, | ||
DisableStackDump: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to disable all stack dumps for tests? It might be useful to have a stack dump for tests that aren't meant to wedge, and just disable it for the single new test that is meant to always wedge (TestRaftWedgedManager
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this situation has come up before with the raft unit tests - there is just not much concurrent store usage involved. We'll still get the stack traces within integration tests, where I think an issue would be far more likely.
But yeah, I can make this specific to TestRaftWedgedManager
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, makes sense. In that case this is not as important then - up to you whether you want to make the change or not.
There have been a few bugs in the past that involved deadlocks where the store lock was held indefinitely. These bugs are particularly bad because the manager continues to send heartbeats as normal, so there's no opportunity for a leader election to replace the stuck leader. Add a method to MemoryStore that returns true if the lock has been held for more than 30 second. Check this method in the raft implementation, and use TransferLeadership when the store gets stuck. Signed-off-by: Aaron Lehmann <[email protected]>
0b8bc63
to
0141555
Compare
Addressed comments. Also changed the timeout used by tests to 3 seconds, to avoid possible false positives. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Out of curiosity, I had added another assertion to the new test to see if the next round of leader election can still happen while the one node is still wedged (I just restarted the new leader and waited for the cluster to be ready again), and it seemed to work, but I wasn't sure about the reason.
Is this because responding to another node's votes does not require a lock on the memory store?
Yes, that's correct. |
LGTM |
Thanks for the reviews. |
There have been a few bugs in the past that involved deadlocks where the
store lock was held indefinitely. These bugs are particularly bad
because the manager continues to send heartbeats as normal, so there's
no opportunity for a leader election to replace the stuck leader.
Add a method to
MemoryStore
that returns true if the lock has been heldfor more than 30 seconds. Check this method in the raft implementation,
and use
TransferLeadership
when the store gets stuck.cc @cyli
Fixes #1658