Recovery mode in a new force_restart_server call/3 #308

erlmachinedev · 2022-09-27T22:04:48Z

Proposed Changes

The raft protocol itself is built around the notion of a leader which handles cluster change requests and elected by the majority which forms the cluster.

There is a use case when the majority nodes is gone (during network split or outage) and we still need data back. Especially in the scenario when we have inter datacenter communication and our requirement is to keep the minority group running after the incident.

This PR introduces the next calls:

force_restart_server(System, ServerId, FilterNodes)
force_restart_server(System, ServerId, FilterNodes, AddConfig)

Which indicate in the client codebase that cluster is forcefully reduced to the setup which is under FilterNodes inclusion list. We tried to make minimal adjustment to not break the general approach behind library code and Ra protocol itself.

The state machine is left deterministic and the all improvement is built behind filter_nodes setting which is passed through mutable config during the restart. The overhead is minimal since validate_cluster/1 is to be called 3 times: during the recover, receiving snapshot or processing cluster change command.

Also this PR includes enq_drain_recovery in addition to the default enq_drain_basic test SUITE which plays the split and recovery scenario into nemesis.

The respective documentation is added (plus small fixes) and one more sanity test.

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation (correction or otherwise)
Cosmetics (whitespace, appearance)

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
All tests pass locally with my changes
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if appropriate)
Any dependent changes have been merged and published in related repositories

Further Comments

P.S. We see quite fresh and interesting improvement from @kjnilsson but still need multi DC setup running which would allow us to handle quorum queues in multiple node setup as well.

Also we took into account this issue.

pivotal-cla · 2022-09-27T22:04:52Z

@erlmachinedev Please sign the Contributor License Agreement!

Click here to manually synchronize the status of this Pull Request.

See the FAQ for frequently asked questions.

pivotal-cla · 2022-09-27T22:06:29Z

@erlmachinedev Thank you for signing the Contributor License Agreement!

kjnilsson · 2022-09-28T08:40:50Z

Thanks for this. I've taken a look and my main concern is that this allows you to change the effective membership on a temporary basis only (unless you do a membership change at which point it ends up persisting the filtered cluster).

Say in a cluster A, B, C, D, E where D and E for a partitioned minority. You force boot them with a filtered cluster of [D, E] where [A, B, C] keep making independent progress. Run for a while and then connectivity is restored and then what happens? You now have two very different histories that cannot be merged (you cannot simply restart D, E without the filter) so you are going to have to manually pick one and delete any nodes with a different history and remove and re-add them.

Now the same applies to my Pr #306 but in this case the membership change is made permanent in a manually changed member that you then use to seed the rest of the cluster from. This could also be used to restore availability on a minority side.

The point is I don't see any benefit with having the membership temporary as there is no way to remove the filter safely once it is in place.

michaelklishin

If the goal is to list some node that the cluster won't contact on boot, we should rename the setting to something like "nodes expected to be unavailable".

@kjnilsson and I had a different approach in mind but never finished it. I wouldn't say it was drastically different so we can see if this PR can be polished to a mergeable state.

michaelklishin · 2022-09-28T09:39:47Z

src/ra_server.erl

@@ -84,7 +84,8 @@
      query_index := non_neg_integer(),
      queries_waiting_heartbeats := queue:queue({non_neg_integer(), consistent_query_ref()}),
      pending_consistent_queries := [consistent_query_ref()],
-      commit_latency => 'maybe'(non_neg_integer())
+      commit_latency => 'maybe'(non_neg_integer()),
+      filter_nodes => 'maybe'([node()])


It's not clear from the name what nodes are being filtered, or whether these nodes act as filters, or what kind of filtering will be performed.

michaelklishin · 2022-09-28T09:42:21Z

src/ra_server.erl

+            Cluster0;
+        _ ->
+            maps:filter(fun ({Name, Node}, _) -> Res = lists:member(Node, Filter),
+                                                 Res orelse ?INFO("~p is filtered on node: ~s", [Name, Node]),


I find this log message to be confusing. "[machine name] is filtered on node […]" does not really tell the user about the consequences.

kjnilsson · 2022-09-28T09:52:09Z

If the goal is to list some node that the cluster won't contact on boot, we should rename the setting to something like "nodes expected to be unavailable".

@kjnilsson and I had a different approach in mind but never finished it. I wouldn't say it was drastically different so we can see if this PR can be polished to a mergeable state.

I am reluctant to merge this without further discussing the issues I outlined in my comment above.

luos · 2022-09-29T14:48:49Z

Thank you for reviewing these changes.

We will definitely change any kind of naming to be a better one as needed. :)

Our primary goal is to recover service with as many messages recovered as possible, after it was made sure that the nodes lost can not be recovered. That is, we have a side channel to verify that those nodes are forever lost. After the cluster is recovered from minority, we would like to also grow the cluster back to the original size (may or may not be with the original nodenames).

It would be good to be able to restore from any number of minority nodes. In a 5 node cluster this may mean 1 or 2 nodes.
The reason we'd like to have 2 node recovery is that it is possible for the Followers to be behind the Leader in different amounts, so we'd like to give it the best chance to recover as much data as possible.

The filter_nodes parameter in this PR expects the list of nodes which are considered still part of the cluster. The thinking here is that from RabbitMQ we can provide the list of nodes we still consider cluster members. In conjunction with forget_cluster_node therefore we can forget nodes which are lost. This way we can rely on which nodes mnesia considers cluster members.

I checked #306 out as well, and I think I got what you are trying to do there. I think that could also be changed to accept a list of valid nodes, therefore electing the most up to date Follower, though as I understand, for that we would need to keep the commit index and similar data instead of resetting. The reason we did not really want to do this is that we have to coordinate the changes with all "up" nodes somehow, and in this PR we rely on mnesia for that. I am guessing we would need to have a multi step procedure of setting new cluster members on all peers, then force election.

What we were also going for is easy usage from RabbitMQ. As you can see, you can just forget_cluster_node each down node, then restart the nodes with this change, recovering the queues. This should (in theory) protect against "lost" nodes rejoining the cluster, though in our testing we ran into some issues with that - and nodes rejoined the mnesia cluster automatically.

You are correct that this change would "remember" these lost nodes forever. We could not come up with a good solution to that. Maybe we can have a command to force a cluster member change into the ra log? We would like to ask your opinion and would be happy to implement. We'd prefer if the lost node would be forgotten.

In our testing if a node which was "lost" came back, definitely some weird behaviour of reelecting leaders were seen. I think the cluster will need to reject these peers in some way.

A drawback of our changes is that it requires a restart - but we think that is acceptable in this scenario.
Another may be that the filter_nodes param is pre-set on restart, so it should be only set after the cluster is fully built.

luos · 2022-10-27T07:37:19Z

Hi @kjnilsson , do you have any suggestions on how to proceed on this?

Maybe we could lift your strategy of setting the cluster nodes with a call, but instead of resetting the current state, we could keep it? That would make it possible to elect the most up to date member.

kjnilsson · 2024-05-20T08:57:01Z

Closing as we already have a recover mode that works ok.

erlmachinedev added 13 commits August 30, 2022 14:30

Added distribition config

31b54cc

Documentation fix

4217e0c

Removed repeated code

7b085ea

Added force start methods

64548d6

Code spaces fix

a362453

Fixed and old state argument

8dd58ee

Added API test

79cd883

Intendation format

9188bc5

Sanity check on test (force_restart)

e8c7b22

Add docs sections

95d9e68

Added recovery SUITE

96c3127

Improved cluster through invaliadtion pointt

3c4a3b1

Format improvement

550ba1a

michaelklishin requested changes Sep 28, 2022

View reviewed changes

kjnilsson closed this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery mode in a new force_restart_server call/3 #308

Recovery mode in a new force_restart_server call/3 #308

erlmachinedev commented Sep 27, 2022

pivotal-cla commented Sep 27, 2022

pivotal-cla commented Sep 27, 2022

kjnilsson commented Sep 28, 2022

michaelklishin left a comment

michaelklishin Sep 28, 2022

michaelklishin Sep 28, 2022

kjnilsson commented Sep 28, 2022

luos commented Sep 29, 2022

luos commented Oct 27, 2022

kjnilsson commented May 20, 2024

Recovery mode in a new force_restart_server call/3 #308

Recovery mode in a new force_restart_server call/3 #308

Conversation

erlmachinedev commented Sep 27, 2022

Proposed Changes

Types of Changes

Checklist

Further Comments

pivotal-cla commented Sep 27, 2022

pivotal-cla commented Sep 27, 2022

kjnilsson commented Sep 28, 2022

michaelklishin left a comment

Choose a reason for hiding this comment

michaelklishin Sep 28, 2022

Choose a reason for hiding this comment

michaelklishin Sep 28, 2022

Choose a reason for hiding this comment

kjnilsson commented Sep 28, 2022

luos commented Sep 29, 2022

luos commented Oct 27, 2022

kjnilsson commented May 20, 2024