You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If a manager gets stuck with the store lock held (for example, as in #1651), this does not affect heartbeats, and there won't be a new leader election. It's a very bad failure mode because you end up with a stuck leader that doesn't get replaced. We should consider ways to reduce the impact of this situation.
Perhaps the leader should stop sending heartbeats if the store lock has been held for too long?
Or perhaps other managers should periodically poll a noop RPC endpoint on the leader that takes the store lock, and start a leader election if it times out?
If a manager gets stuck with the store lock held (for example, as in #1651), this does not affect heartbeats, and there won't be a new leader election. It's a very bad failure mode because you end up with a stuck leader that doesn't get replaced. We should consider ways to reduce the impact of this situation.
Perhaps the leader should stop sending heartbeats if the store lock has been held for too long?
Or perhaps other managers should periodically poll a noop RPC endpoint on the leader that takes the store lock, and start a leader election if it times out?
cc @aluzzardi @LK4D4
The text was updated successfully, but these errors were encountered: