-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recovery mode in a new force_restart_server call/3 #308
Conversation
@erlmachinedev Please sign the Contributor License Agreement! Click here to manually synchronize the status of this Pull Request. See the FAQ for frequently asked questions. |
@erlmachinedev Thank you for signing the Contributor License Agreement! |
Thanks for this. I've taken a look and my main concern is that this allows you to change the effective membership on a temporary basis only (unless you do a membership change at which point it ends up persisting the filtered cluster). Say in a cluster A, B, C, D, E where D and E for a partitioned minority. You force boot them with a filtered cluster of [D, E] where [A, B, C] keep making independent progress. Run for a while and then connectivity is restored and then what happens? You now have two very different histories that cannot be merged (you cannot simply restart D, E without the filter) so you are going to have to manually pick one and delete any nodes with a different history and remove and re-add them. Now the same applies to my Pr #306 but in this case the membership change is made permanent in a manually changed member that you then use to seed the rest of the cluster from. This could also be used to restore availability on a minority side. The point is I don't see any benefit with having the membership temporary as there is no way to remove the filter safely once it is in place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to list some node that the cluster won't contact on boot, we should rename the setting to something like "nodes expected to be unavailable".
@kjnilsson and I had a different approach in mind but never finished it. I wouldn't say it was drastically different so we can see if this PR can be polished to a mergeable state.
@@ -84,7 +84,8 @@ | |||
query_index := non_neg_integer(), | |||
queries_waiting_heartbeats := queue:queue({non_neg_integer(), consistent_query_ref()}), | |||
pending_consistent_queries := [consistent_query_ref()], | |||
commit_latency => 'maybe'(non_neg_integer()) | |||
commit_latency => 'maybe'(non_neg_integer()), | |||
filter_nodes => 'maybe'([node()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear from the name what nodes are being filtered, or whether these nodes act as filters, or what kind of filtering will be performed.
Cluster0; | ||
_ -> | ||
maps:filter(fun ({Name, Node}, _) -> Res = lists:member(Node, Filter), | ||
Res orelse ?INFO("~p is filtered on node: ~s", [Name, Node]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this log message to be confusing. "[machine name] is filtered on node […]" does not really tell the user about the consequences.
I am reluctant to merge this without further discussing the issues I outlined in my comment above. |
Thank you for reviewing these changes. We will definitely change any kind of naming to be a better one as needed. :) Our primary goal is to recover service with as many messages recovered as possible, after it was made sure that the nodes lost can not be recovered. That is, we have a side channel to verify that those nodes are forever lost. After the cluster is recovered from minority, we would like to also grow the cluster back to the original size (may or may not be with the original nodenames). It would be good to be able to restore from any number of minority nodes. In a 5 node cluster this may mean 1 or 2 nodes. The filter_nodes parameter in this PR expects the list of nodes which are considered still part of the cluster. The thinking here is that from RabbitMQ we can provide the list of nodes we still consider cluster members. In conjunction with I checked #306 out as well, and I think I got what you are trying to do there. I think that could also be changed to accept a list of valid nodes, therefore electing the most up to date Follower, though as I understand, for that we would need to keep the commit index and similar data instead of resetting. The reason we did not really want to do this is that we have to coordinate the changes with all "up" nodes somehow, and in this PR we rely on mnesia for that. I am guessing we would need to have a multi step procedure of setting new cluster members on all peers, then force election. What we were also going for is easy usage from RabbitMQ. As you can see, you can just forget_cluster_node each down node, then restart the nodes with this change, recovering the queues. This should (in theory) protect against "lost" nodes rejoining the cluster, though in our testing we ran into some issues with that - and nodes rejoined the mnesia cluster automatically. You are correct that this change would "remember" these lost nodes forever. We could not come up with a good solution to that. Maybe we can have a command to force a cluster member change into the ra log? We would like to ask your opinion and would be happy to implement. We'd prefer if the lost node would be forgotten. In our testing if a node which was "lost" came back, definitely some weird behaviour of reelecting leaders were seen. I think the cluster will need to reject these peers in some way. A drawback of our changes is that it requires a restart - but we think that is acceptable in this scenario. |
Hi @kjnilsson , do you have any suggestions on how to proceed on this? Maybe we could lift your strategy of setting the cluster nodes with a call, but instead of resetting the current state, we could keep it? That would make it possible to elect the most up to date member. |
Closing as we already have a recover mode that works ok. |
Proposed Changes
The raft protocol itself is built around the notion of a leader which handles cluster change requests and elected by the majority which forms the cluster.
There is a use case when the majority nodes is gone (during network split or outage) and we still need data back. Especially in the scenario when we have inter datacenter communication and our requirement is to keep the minority group running after the incident.
This PR introduces the next calls:
force_restart_server(System, ServerId, FilterNodes)
force_restart_server(System, ServerId, FilterNodes, AddConfig)
Which indicate in the client codebase that cluster is forcefully reduced to the setup which is under
FilterNodes
inclusion list. We tried to make minimal adjustment to not break the general approach behind library code and Ra protocol itself.The state machine is left deterministic and the all improvement is built behind
filter_nodes
setting which is passed through mutable config during the restart. The overhead is minimal sincevalidate_cluster/1
is to be called 3 times: during the recover, receiving snapshot or processing cluster change command.Also this PR includes
enq_drain_recovery
in addition to the defaultenq_drain_basic
test SUITE which plays the split and recovery scenario into nemesis.The respective documentation is added (plus small fixes) and one more sanity test.
Types of Changes
What types of changes does your code introduce to this project?
Put an
x
in the boxes that applyChecklist
Put an
x
in the boxes that apply. You can also fill these out after creatingthe PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.
CONTRIBUTING.md
documentFurther Comments
P.S. We see quite fresh and interesting improvement from @kjnilsson but still need multi DC setup running which would allow us to handle quorum queues in multiple node setup as well.
Also we took into account this issue.