Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft.LeaderCh() always deliver latest transition #384

Merged
merged 1 commit into from
Feb 11, 2020

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Jan 23, 2020

This makes leaderCh always deliver the latest leadership transition
rather than dropping messages somewhat unexpectedly

Managing leadership transition is quite fragile and easy to get wrong.
Prior to this, LeaderCh() is unreliable and is prone to misuse.
For example, both Consul[1] and Nomad[2] started with LeaderCh and
ended up switching to manage their own channels after hitting issues.
But managing reliable leadership channels is tricky; both Consul[3] and
Nomad[4] had deadlock bugs related to leadership flagging!

Here, we aim to improve the library ergonomics for future library
applications as well as Nomad, by having stricter guarantees of
raft.LeaderCh. Namely, receivers are guaranteed to receive a
notification if leadership transition event occured, and slow receivers
will be guaranteed to receive the latest view since their last receive*.

This seems like a reasonable interface for practical purposes. If a
leadership flapped many times but but ended as a non-leader ultimately,
the receiver should just recored the final step down without attempting
to handle the intermediary states events.

One edge case is if a leader lost but then regained leadership quickly. The
application may need to issue a Barrier call and reconcile its state.
Thus, the library should still send notification events even if final
leadership value matches last delivered value.

  • note that due to concurrency, last received value might still be
    stale, if raft just recognized transition event but didn't send yet.
    Receivers on a loop will get the event in subsequent receive.

[1] hashicorp/consul#2911
[2] hashicorp/nomad@c2b7ce9
[3] hashicorp/consul#6852
[4] hashicorp/nomad#4749

@notnoop notnoop self-assigned this Jan 23, 2020
This makes leaderCh always deliver the latest leadership transition
rather than dropping messages somewhat unexpectedly

Managing leadership transition is quite fragile and easy to get wrong.
Prior to this, `LeaderCh()` is unreliable and is prone to misuse.
For example, both Consul[1] and Nomad[2] started with LeaderCh and
ended up switching to manage their own channels after hitting issues.
But managing reliable leadership channels is tricky; both Consul[3] and
Nomad[4] had deadlock bugs related to leadership flagging!

Here, we aim to improve the library ergonomics for future library
applications as well as Nomad, by having stricter guarantees of
`raft.LeaderCh`.  Namely, receivers are guaranteed to receive a
notification if leadership transition event occured, and slow receivers
will be guaranteed to receive the latest view since their last receive*.

This seems like a reasonable interface for practical purposes.  If a
leadership flapped many times but but ended as a non-leader ultimately,
the receiver should just recored the final step down without attempting
to handle the intermediary states events.

One edge case is if a leader lost but then regained leadership quickly. The
application may need to issue a Barrier call and reconcile its state.
Thus, the library should still send notification events even if final
leadership value matches last delivered value.

* note that due to concurrency, last received value might still be
stale, if raft just recognized transition event but didn't send yet.
Receivers on a loop will get the event in subsequent receive.
@notnoop notnoop force-pushed the f-stricter-leaderch-semantics branch from 9699196 to 9e800fe Compare January 23, 2020 16:07
Copy link
Member

@hanshasselberg hanshasselberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started going through the code to understand your PR better and one thing that confused me is the difference between leaderCh and NotifyCh. I found a comment in consul code https://github.com/hashicorp/consul/blob/b3cf47c861f9a957e66011e7d466304f9bd93434/agent/consul/leader.go#L50-L53:

        // We use the notify channel we configured Raft with, NOT Raft's
        // leaderCh, which is only notified best-effort. Doing this ensures
        // that we get all notifications in order, which is required for
        // cleanup and to ensure we never run multiple leader loops.

So far I think your changes mitigate this problem because this channel no longer has the old value but always the latest one. Which means it would be preferable to use that instead of NotifyCh. Will keep thinking about it.

// transition has occured.
//
// If receivers aren't ready for the signal, signals may drop and only the
// latest leadership transition. For example, if a receiver receives subsequent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// latest leadership transition. For example, if a receiver receives subsequent
// latest leadership transition will be received. For example, if a receiver receives consecutive

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Interesting one Mahmood.

I can see how this is more reliable than before but I still wonder if it's prone to misuse/errors?

The barrier example you gave is one case where this has a pretty big gotcha - if we see consecutive true when already in leader state the app probably needs to still rescind leadership and take it up again in order to get back into a good state where we need to not only be sure FSM is in sync but also any other state the leader has setup based on the FSM and believes no one else can make writes without it knowing might need to be recreated from the FSM. In Consul for example, thinks like the Connect CA state are setup during establishLeadership and only modified on updates to CA config. It wouldn't be safe to just keep running even if we issued a new barrier on the FSM, we'd explicitly need to exit the leader loop and re-enter it again to ensure everything was in correct state.

Maybe this change makes that possible at least but I think you'd need to explicitly treat every message delivered as "you need to change state now" even if that's transitioning to follower and back to leader again to ensure everything is correct. That's hinted at in the comment you added but I think if we keep this we should spell it out more concretely.

On the other hand, if Consul and Nomad have already given up on trying to use this method, perhaps we should instead just document it as Deprecated and impossible to use correctly and describe the better alternative instead?

@notnoop
Copy link
Contributor Author

notnoop commented Jan 24, 2020

@banks Thanks for the review - yeah, I can spill the need to handle transition more specifically. In Nomad land, I implemented this approach, just outside the library in https://github.com/hashicorp/nomad/pull/6977/files (would love your review there too :P). I was hoping that we'd simply nomad's logic by incorporating the leadership channel behavior here and rely on this instead. Based on our experience, requiring clients to configure their reliable channels properly is tricky.

Another alternative that may relieve consumers more and be less prone to misuse is to common and "best practices" bits of Consul or Nomad's monitorLoop and leaderLoop into the library and expose a LeadershipUpdateDelegate for application specific logic. If this is something worth checking, I can implement it.

I'm motivated to address as much of this in raft library, because managing leadership transitions is very subtle now - and exposing a better api will hopefully avoid having other applications (Vault?!) rediscover Consul/Nomad bugs and easies fixing them globally.

@banks
Copy link
Member

banks commented Jan 24, 2020 via email

@notnoop
Copy link
Contributor Author

notnoop commented Jan 25, 2020

Perhaps encapsulating the way Consul and Nomad do this via notifyCh in the
lib like you suggest is a better improvement?

Maybe - Defining the right interface isn't obvious. Nomad and Consul monitor leadership loop have other concerns (reconciling Serf, tracking members, and enabling ACL progress). I'm unclear what the right abstraction is here.

Let's not make the perfect be the enemy of the better. This change makes leaderCh useful (previous version was unusable), simplify Nomad and Consul implementation, and doesn't preclude us from adding a delegate in future. This will be an improvement for Nomad; Consul would benefit too: its current approach of arbitrary increasing buffer limit is susceptible to two issues: first, it relies on a manually-set channel buffering value for safety without grantees of how fast channel is dequeued; also in extreme flapping, it does redundant unnecessary transitions (e.g. if a server gains and relinquished leadership few times and ultimately became a non-voter, it's safe to step down without ever calling establishLeadership).

Alternatively, if we do want to punt, let's simply remove the current leaderCh - it's a ticking bomb for bugs - and properly document how applications can monitor leadership properly.

@hanshasselberg
Copy link
Member

I was wondering if this change would be technically speaking backwards incompatible. leaderCh was dropping messages before and still does, just other messages. From the outside the message dropping was unpredictable and now we are conscious about it. From an API perspective I don't see why this change would be a problem.

This change makes leaderCh useful (previous version was unusable)

agreed.

also in extreme flapping, it does redundant unnecessary transitions

I was thinking that if we would drain notifyCh before doing anything, we could get the latest state as well and act on it. Avoiding unnecessary transitions.

Copy link
Contributor

@schristoff schristoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in Raft Force - approved with the intention to look into alternatives in the future.

@notnoop notnoop merged commit 365023d into master Feb 11, 2020
@notnoop notnoop deleted the f-stricter-leaderch-semantics branch February 11, 2020 19:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants