Add elasticsearch-node detach-cluster tool #37979

andrershov · 2019-01-29T15:31:02Z

This commit adds the second part of elasticsearch-node tool -
detach-cluster command in addition to unsafe-bootstrap command.
Also, this commit changes the semantics of unsafe-bootstrap, now
unsafe-bootstrap changes clusterUUID.
So the algorithm of running elasticsearch-node tool is the following:

Stop all nodes in the cluster.
Pick master-eligible node with the highest (term, version) pair and
run the unsafe-bootstrap command on it. If there are no survived
master-eligible nodes - skip this step.
Run detach-cluster command on the remaining survived nodes.

Detach cluster makes the following changes to the node metadata:

Sets clusterUUID commited to false.
Sets currentTerm and term to 0.
Removes voting tombstones and sets voting configurations to special
constant MUST_JOIN_ELECTED_MASTER, that prevents initial cluster
bootstrap.

ElasticsearchNodeCommand base abstract class is introduced, because
UnsafeBootstrapMasterCommand and DetachClusterCommand have a lot in
common.
Also, this commit adds "ordinal" parameter to both commands, because it's
impossible to write IT otherwise.
For MUST_JOIN_ELECTED_MASTER case special handling is introduced in
ClusterFormationFailureHelper.
Tests for both commands reside in ElasticsearchNodeCommandIT (renamed
from UnsafeBootstrapMasterIT).

elasticmachine · 2019-01-29T15:31:10Z

Pinging @elastic/es-distributed

ywelsch

Thanks for the PR @andrershov. I've left some initial comments. Can we also have a test where we have a data node rejoin a freshly bootstrapped cluster (i.e. when all master-eligible nodes were lost).
Also, can you adapt the existing tests (testDanglingIndices + ???) to show that this allows us to port them to Zen2?

ywelsch · 2019-01-30T09:34:26Z

server/src/main/java/org/elasticsearch/cluster/coordination/DetachClusterCommand.java

+            "--------------------------------------------------------------------------\n" +
+                    "\n" +
+                    "You should run this tool only if you have permanently lost all\n" +
+                    "your master-eligible nodes, and you cannot restore the cluster\n" +


we also recommend to use it when having lost a majority of master-eligible nodes, no?

See below "or you have already run elasticsearch-node unsafe-bootstrap ...". Probably @DaveCTurner will have to come up with better wording.

Looks ok to me.

ywelsch · 2019-01-30T09:36:09Z

server/src/main/java/org/elasticsearch/cluster/coordination/DetachClusterCommand.java

+                    "your master-eligible nodes, and you cannot restore the cluster\n" +
+                    "from a snapshot, or you have already run `elasticsearch-node unsafe-bootstrap`\n" +
+                    "on master-eligible node that formed cluster with this node.\n" +
+                    "This tool can result in arbitrary data loss and should be\n" +


perhaps: Usage of this tool can result in data loss and should be a means of last resort.

This sentence is just copied from unsafe-bootstrap command

sure, that still doesn't make it great :)

I think the original is ok, but suggest this as an alternative:

This tool can cause arbitrary data loss and its use should be your last resort.

ywelsch · 2019-01-30T09:37:07Z

server/src/main/java/org/elasticsearch/cluster/coordination/DetachClusterCommand.java

+        final MetaData newMetaData = MetaData.builder(metaData)
+                .version(0)
+                .coordinationMetaData(coordinationMetaData)
+                .clusterUUID(MetaData.UNKNOWN_CLUSTER_UUID)


we can keep the cluster uuid, and just set clusterUUIDCommitted to false

ywelsch · 2019-01-30T09:48:01Z

server/src/main/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterCommand.java

-            MetaData.FORMAT.cleanupOldFiles(manifest.getGlobalGeneration(), dataPaths);
-            throw new ElasticsearchException(WRITE_METADATA_EXCEPTION_MSG, e);
-        }
+        long newCurrentTerm = manifest.getCurrentTerm() + 1;


Is incrementing the term necessary given that we changed the cluster uuid? The node will anyway bump its term when electing itself. Let's only do the minimum changes required to the state.

ywelsch · 2019-01-30T09:54:59Z

server/src/main/java/org/elasticsearch/cluster/coordination/DetachClusterCommand.java

+                .lastCommittedConfiguration(CoordinationMetaData.VotingConfiguration.MUST_JOIN_ELECTED_MASTER)
+                .build();
+        final MetaData newMetaData = MetaData.builder(metaData)
+                .version(0)


why set this to 0? Is this necessary?

ywelsch · 2019-01-30T09:55:44Z

server/src/main/java/org/elasticsearch/cluster/coordination/DetachClusterCommand.java

+                .clusterUUIDCommitted(false)
+                .build();
+
+        writeNewMetaData(terminal, manifest, 0, 0, metaData, newMetaData, dataPaths);


we can keep the cluster state version, and just set the term to 0.

UnsafeBootstrap - do not increment current term DetachCluster - do not reset MetaData and cluster state versions, do not change clusterUUID

Replaces testIndexImportedFromDataOnlyNodesIfMasterLostDataFolder and testDanglingIndices

andrershov · 2019-01-30T12:45:48Z

@ywelsch Thank you for the review!
I've addressed your comments, except wording.
Regarding tests, the two tests that you meant are testDanglingIndices and testIndexImportedFromDataOnlyNodesIfMasterLostDataFolder in my opinion they are the same and one of them should be removed.
Taking this into account, I've implemented ElasticsearchNodeCommandIT.testAllMasterEligibleNodesFailedDanglingIndexImport which replaces both testDanglingIndices and testIndexImportedFromDataOnlyNodesIfMasterLostDataFolder.
I think, ElasticsearchNodeCommandIT is a better place for this test because it requires detachCluster invocation and ElasticsearchNodeCommandIT is specifically designed for this case and makes detachCluster command execution one-liner.
Please let me know what do you think.

DaveCTurner

I suggested some small changes to wording

DaveCTurner · 2019-01-30T15:45:38Z

server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationMetaData.java

@@ -325,6 +325,8 @@ public String toString() {
    public static class VotingConfiguration implements Writeable, ToXContentFragment {

        public static final VotingConfiguration EMPTY_CONFIG = new VotingConfiguration(Collections.emptySet());
+        public static final VotingConfiguration MUST_JOIN_ELECTED_MASTER = new VotingConfiguration(Collections.singleton(
+                "_must_join_elected_master_"));


I think this deserves a special case in ClusterFormationFailureHelper (and its tests) as it will yield a somewhat strange message as it is:

master not discovered or elected yet, an election requires a node with id [_must_join_elected_master_], have discovered [] which is not a quorum; discovery will continue using [] from hosts providers and [...] from last-known cluster state; node term 0, last-accepted version 0 in term 0

I suggest:

master not discovered yet and this node was detached from its previous cluster, have discovered []; discovery will continue using [] from hosts providers and [...] from last-known cluster state; node term 0, last-accepted version 0 in term 0

DaveCTurner · 2019-01-30T15:56:33Z

server/src/main/java/org/elasticsearch/cluster/coordination/DetachClusterCommand.java

+                    "your master-eligible nodes, and you cannot restore the cluster\n" +
+                    "from a snapshot, or you have already run `elasticsearch-node unsafe-bootstrap`\n" +
+                    "on master-eligible node that formed cluster with this node.\n" +
+                    "This tool can result in arbitrary data loss and should be\n" +


I think the original is ok, but suggest this as an alternative:

This tool can cause arbitrary data loss and its use should be your last resort.

DaveCTurner · 2019-01-30T15:57:06Z

server/src/main/java/org/elasticsearch/cluster/coordination/DetachClusterCommand.java

+            "--------------------------------------------------------------------------\n" +
+                    "\n" +
+                    "You should run this tool only if you have permanently lost all\n" +
+                    "your master-eligible nodes, and you cannot restore the cluster\n" +


Looks ok to me.

DaveCTurner · 2019-01-30T15:57:14Z

server/src/main/java/org/elasticsearch/cluster/coordination/DetachClusterCommand.java

+                    "You should run this tool only if you have permanently lost all\n" +
+                    "your master-eligible nodes, and you cannot restore the cluster\n" +
+                    "from a snapshot, or you have already run `elasticsearch-node unsafe-bootstrap`\n" +
+                    "on master-eligible node that formed cluster with this node.\n" +


Suggested change

"on master-eligible node that formed cluster with this node.\n" +

"on a master-eligible node that formed a cluster with this node.\n" +

DaveCTurner · 2019-01-30T15:57:50Z

server/src/main/java/org/elasticsearch/cluster/coordination/DetachClusterCommand.java

+                    "Do you want to proceed?\n";
+
+    public DetachClusterCommand() {
+        super("Detaches this node from the cluster with old UUID, allowing it to join new cluster");


Suggested change

super("Detaches this node from the cluster with old UUID, allowing it to join new cluster");

super("Detaches this node from its cluster, allowing it to unsafely join a new cluster");

andrershov · 2019-01-30T17:15:36Z

@DaveCTurner thanks for your review! I've addressed ClusterFormationFailureHelper message in a122221 and all wording in 80ad224.
@DaveCTurner and @ywelsch it's ready for the next pass.

ywelsch

LGTM. Can you also address the TODO in testCannotJoinClusterWithDifferentUUID with this?

DaveCTurner

LGTM2

# Conflicts: # server/src/test/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterIT.java

andrershov · 2019-01-31T20:30:46Z

run elasticsearch-ci/2

andrershov · 2019-02-01T07:05:58Z

run elasticsearch-ci/2

andrershov · 2019-02-01T09:40:38Z

run elasticsearch-ci/2

# Conflicts: # server/src/test/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterIT.java

* master: (36 commits) Ensure joda compatibility in custom date formats (elastic#38171) Do not compute cardinality if the `terms` execution mode does not use `global_ordinals` (elastic#38169) Do not set timeout for IndexRequests in GatewayIndexStateIT (elastic#38147) Zen2ify testMasterFailoverDuringIndexingWithMappingChanges (elastic#38178) SQL: [Docs] Add limitation for aggregate functions on scalars (elastic#38186) Add elasticsearch-node detach-cluster command (elastic#37979) Add tests for fractional epoch parsing (elastic#38162) Enable bw tests for elastic#37871 and elastic#38032. (elastic#38167) Clear send behavior rule in CloseWhileRelocatingShardsIT (elastic#38159) Fix testCorruptedIndex (elastic#38161) Add finalReduce flag to SearchRequest (elastic#38104) Forbid negative field boosts in analyzed queries (elastic#37930) Remove AtomiFieldData#getLegacyFieldValues (elastic#38087) Universal cluster bootstrap method for tests with autoMinMasterNodes=false (elastic#38038) Fix FullClusterRestartIT.testHistoryUUIDIsAdded (elastic#38098) Replace joda time in ingest-common module (elastic#38088) Fix eclipse config for ssl-config (elastic#38096) Don't load global ordinals with the `map` execution_hint (elastic#37833) Relax fault detector in some disruption tests (elastic#38101) Fix java time epoch date formatters (elastic#37829) ...

colings86 · 2019-02-07T11:36:59Z

Please remember to add all relevant labels (area label, version label(s) and change type label) on all PRs and please look for this as part of reviews. The release note generation process is made much harder when PRs are not labelled correctly.

detach-cluster tool

67a4fba

andrershov added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 29, 2019

andrershov requested review from ywelsch and DaveCTurner January 29, 2019 15:31

Remove terminal log line

f205006

ywelsch mentioned this pull request Jan 29, 2019

A new cluster coordination layer #32006

Closed

61 tasks

ywelsch suggested changes Jan 30, 2019

View reviewed changes

Andrey Ershov added 3 commits January 30, 2019 12:37

Make minimal disruption to cluster state

5572a83

UnsafeBootstrap - do not increment current term DetachCluster - do not reset MetaData and cluster state versions, do not change clusterUUID

testAllMasterEligibleNodesFailed

28940db

Extend testAllMasterEligibleNodesFailed

d1c3c2b

Replaces testIndexImportedFromDataOnlyNodesIfMasterLostDataFolder and testDanglingIndices

Merge branch 'master' into zen2_detach_cluster

699cfc0

DaveCTurner reviewed Jan 30, 2019

View reviewed changes

Andrey Ershov added 2 commits January 30, 2019 18:06

Special message for must join elected master in ClusterFormationHelper

a122221

Wording

80ad224

andrershov requested review from ywelsch and DaveCTurner January 31, 2019 10:17

ywelsch approved these changes Jan 31, 2019

View reviewed changes

DaveCTurner approved these changes Jan 31, 2019

View reviewed changes

Andrey Ershov added 3 commits January 31, 2019 16:56

Address TODO in CoordinatorTests

b12f50b

Merge branch 'master' into zen2_detach_cluster

2f63523

# Conflicts: # server/src/test/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterIT.java

Fix merge, remove min master nodes setting from tests

e1cfd09

Andrey Ershov added 2 commits February 1, 2019 11:37

Merge branch 'master' into zen2_detach_cluster

cbcf6c5

# Conflicts: # server/src/test/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterIT.java

Use internalCluster.setBootstrapMasterNodeIndex

f84d472

andrershov merged commit bda5914 into elastic:master Feb 1, 2019

colings86 added the v7.0.0-beta1 label Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add elasticsearch-node detach-cluster tool #37979

Add elasticsearch-node detach-cluster tool #37979

andrershov commented Jan 29, 2019 •

edited

Loading

elasticmachine commented Jan 29, 2019

ywelsch left a comment

ywelsch Jan 30, 2019

andrershov Jan 30, 2019

DaveCTurner Jan 30, 2019

ywelsch Jan 30, 2019

andrershov Jan 30, 2019

ywelsch Jan 30, 2019

DaveCTurner Jan 30, 2019

ywelsch Jan 30, 2019

ywelsch Jan 30, 2019

ywelsch Jan 30, 2019

ywelsch Jan 30, 2019

andrershov commented Jan 30, 2019

DaveCTurner left a comment

DaveCTurner Jan 30, 2019

DaveCTurner Jan 30, 2019

DaveCTurner Jan 30, 2019

DaveCTurner Jan 30, 2019

DaveCTurner Jan 30, 2019

andrershov commented Jan 30, 2019

ywelsch left a comment

DaveCTurner left a comment

andrershov commented Jan 31, 2019

andrershov commented Feb 1, 2019

andrershov commented Feb 1, 2019

colings86 commented Feb 7, 2019

	"on master-eligible node that formed cluster with this node.\n" +
	"on a master-eligible node that formed a cluster with this node.\n" +

	super("Detaches this node from the cluster with old UUID, allowing it to join new cluster");
	super("Detaches this node from its cluster, allowing it to unsafely join a new cluster");

Add elasticsearch-node detach-cluster tool #37979

Add elasticsearch-node detach-cluster tool #37979

Conversation

andrershov commented Jan 29, 2019 • edited Loading

elasticmachine commented Jan 29, 2019

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrershov commented Jan 30, 2019

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrershov commented Jan 30, 2019

ywelsch left a comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

andrershov commented Jan 31, 2019

andrershov commented Feb 1, 2019

andrershov commented Feb 1, 2019

colings86 commented Feb 7, 2019

andrershov commented Jan 29, 2019 •

edited

Loading