Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add elasticsearch-node detach-cluster tool #37979

Merged
merged 13 commits into from
Feb 1, 2019

Conversation

andrershov
Copy link
Contributor

@andrershov andrershov commented Jan 29, 2019

This commit adds the second part of elasticsearch-node tool -
detach-cluster command in addition to unsafe-bootstrap command.
Also, this commit changes the semantics of unsafe-bootstrap, now
unsafe-bootstrap changes clusterUUID.
So the algorithm of running elasticsearch-node tool is the following:

  1. Stop all nodes in the cluster.
  2. Pick master-eligible node with the highest (term, version) pair and
    run the unsafe-bootstrap command on it. If there are no survived
    master-eligible nodes - skip this step.
  3. Run detach-cluster command on the remaining survived nodes.

Detach cluster makes the following changes to the node metadata:

  1. Sets clusterUUID commited to false.
  2. Sets currentTerm and term to 0.
  3. Removes voting tombstones and sets voting configurations to special
    constant MUST_JOIN_ELECTED_MASTER, that prevents initial cluster
    bootstrap.

ElasticsearchNodeCommand base abstract class is introduced, because
UnsafeBootstrapMasterCommand and DetachClusterCommand have a lot in
common.
Also, this commit adds "ordinal" parameter to both commands, because it's
impossible to write IT otherwise.
For MUST_JOIN_ELECTED_MASTER case special handling is introduced in
ClusterFormationFailureHelper.
Tests for both commands reside in ElasticsearchNodeCommandIT (renamed
from UnsafeBootstrapMasterIT).

@andrershov andrershov added >enhancement :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jan 29, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch ywelsch mentioned this pull request Jan 29, 2019
61 tasks
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @andrershov. I've left some initial comments. Can we also have a test where we have a data node rejoin a freshly bootstrapped cluster (i.e. when all master-eligible nodes were lost).
Also, can you adapt the existing tests (testDanglingIndices + ???) to show that this allows us to port them to Zen2?

"--------------------------------------------------------------------------\n" +
"\n" +
"You should run this tool only if you have permanently lost all\n" +
"your master-eligible nodes, and you cannot restore the cluster\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also recommend to use it when having lost a majority of master-eligible nodes, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below "or you have already run elasticsearch-node unsafe-bootstrap ...". Probably @DaveCTurner will have to come up with better wording.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me.

"your master-eligible nodes, and you cannot restore the cluster\n" +
"from a snapshot, or you have already run `elasticsearch-node unsafe-bootstrap`\n" +
"on master-eligible node that formed cluster with this node.\n" +
"This tool can result in arbitrary data loss and should be\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps: Usage of this tool can result in data loss and should be a means of last resort.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence is just copied from unsafe-bootstrap command

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that still doesn't make it great :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original is ok, but suggest this as an alternative:

This tool can cause arbitrary data loss and its use should be your last resort.

final MetaData newMetaData = MetaData.builder(metaData)
.version(0)
.coordinationMetaData(coordinationMetaData)
.clusterUUID(MetaData.UNKNOWN_CLUSTER_UUID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep the cluster uuid, and just set clusterUUIDCommitted to false

MetaData.FORMAT.cleanupOldFiles(manifest.getGlobalGeneration(), dataPaths);
throw new ElasticsearchException(WRITE_METADATA_EXCEPTION_MSG, e);
}
long newCurrentTerm = manifest.getCurrentTerm() + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is incrementing the term necessary given that we changed the cluster uuid? The node will anyway bump its term when electing itself. Let's only do the minimum changes required to the state.

.lastCommittedConfiguration(CoordinationMetaData.VotingConfiguration.MUST_JOIN_ELECTED_MASTER)
.build();
final MetaData newMetaData = MetaData.builder(metaData)
.version(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why set this to 0? Is this necessary?

.clusterUUIDCommitted(false)
.build();

writeNewMetaData(terminal, manifest, 0, 0, metaData, newMetaData, dataPaths);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can keep the cluster state version, and just set the term to 0.

Andrey Ershov added 3 commits January 30, 2019 12:37
UnsafeBootstrap - do not increment current term
DetachCluster - do not reset MetaData and cluster state versions,
do not change clusterUUID
Replaces testIndexImportedFromDataOnlyNodesIfMasterLostDataFolder and
testDanglingIndices
@andrershov
Copy link
Contributor Author

@ywelsch Thank you for the review!
I've addressed your comments, except wording.
Regarding tests, the two tests that you meant are testDanglingIndices and testIndexImportedFromDataOnlyNodesIfMasterLostDataFolder in my opinion they are the same and one of them should be removed.
Taking this into account, I've implemented ElasticsearchNodeCommandIT.testAllMasterEligibleNodesFailedDanglingIndexImport which replaces both testDanglingIndices and testIndexImportedFromDataOnlyNodesIfMasterLostDataFolder.
I think, ElasticsearchNodeCommandIT is a better place for this test because it requires detachCluster invocation and ElasticsearchNodeCommandIT is specifically designed for this case and makes detachCluster command execution one-liner.
Please let me know what do you think.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested some small changes to wording

@@ -325,6 +325,8 @@ public String toString() {
public static class VotingConfiguration implements Writeable, ToXContentFragment {

public static final VotingConfiguration EMPTY_CONFIG = new VotingConfiguration(Collections.emptySet());
public static final VotingConfiguration MUST_JOIN_ELECTED_MASTER = new VotingConfiguration(Collections.singleton(
"_must_join_elected_master_"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves a special case in ClusterFormationFailureHelper (and its tests) as it will yield a somewhat strange message as it is:

master not discovered or elected yet, an election requires a node with id [_must_join_elected_master_], have discovered [] which is not a quorum; discovery will continue using [] from hosts providers and [...] from last-known cluster state; node term 0, last-accepted version 0 in term 0

I suggest:

master not discovered yet and this node was detached from its previous cluster, have discovered []; discovery will continue using [] from hosts providers and [...] from last-known cluster state; node term 0, last-accepted version 0 in term 0

"your master-eligible nodes, and you cannot restore the cluster\n" +
"from a snapshot, or you have already run `elasticsearch-node unsafe-bootstrap`\n" +
"on master-eligible node that formed cluster with this node.\n" +
"This tool can result in arbitrary data loss and should be\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original is ok, but suggest this as an alternative:

This tool can cause arbitrary data loss and its use should be your last resort.

"--------------------------------------------------------------------------\n" +
"\n" +
"You should run this tool only if you have permanently lost all\n" +
"your master-eligible nodes, and you cannot restore the cluster\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me.

"You should run this tool only if you have permanently lost all\n" +
"your master-eligible nodes, and you cannot restore the cluster\n" +
"from a snapshot, or you have already run `elasticsearch-node unsafe-bootstrap`\n" +
"on master-eligible node that formed cluster with this node.\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"on master-eligible node that formed cluster with this node.\n" +
"on a master-eligible node that formed a cluster with this node.\n" +

"Do you want to proceed?\n";

public DetachClusterCommand() {
super("Detaches this node from the cluster with old UUID, allowing it to join new cluster");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
super("Detaches this node from the cluster with old UUID, allowing it to join new cluster");
super("Detaches this node from its cluster, allowing it to unsafely join a new cluster");

@andrershov
Copy link
Contributor Author

@DaveCTurner thanks for your review! I've addressed ClusterFormationFailureHelper message in a122221 and all wording in 80ad224.
@DaveCTurner and @ywelsch it's ready for the next pass.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you also address the TODO in testCannotJoinClusterWithDifferentUUID with this?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

Andrey Ershov added 3 commits January 31, 2019 16:56
# Conflicts:
#	server/src/test/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterIT.java
@andrershov
Copy link
Contributor Author

run elasticsearch-ci/2

2 similar comments
@andrershov
Copy link
Contributor Author

run elasticsearch-ci/2

@andrershov
Copy link
Contributor Author

run elasticsearch-ci/2

Andrey Ershov added 2 commits February 1, 2019 11:37
# Conflicts:
#	server/src/test/java/org/elasticsearch/cluster/coordination/UnsafeBootstrapMasterIT.java
@andrershov andrershov merged commit bda5914 into elastic:master Feb 1, 2019
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Feb 1, 2019
* master: (36 commits)
  Ensure joda compatibility in custom date formats (elastic#38171)
  Do not compute cardinality if the `terms` execution mode does not use `global_ordinals` (elastic#38169)
  Do not set timeout for IndexRequests in GatewayIndexStateIT (elastic#38147)
  Zen2ify testMasterFailoverDuringIndexingWithMappingChanges (elastic#38178)
  SQL: [Docs] Add limitation for aggregate functions on scalars (elastic#38186)
  Add elasticsearch-node detach-cluster command (elastic#37979)
  Add tests for fractional epoch parsing (elastic#38162)
  Enable bw tests for elastic#37871 and elastic#38032. (elastic#38167)
  Clear send behavior rule in CloseWhileRelocatingShardsIT (elastic#38159)
  Fix testCorruptedIndex (elastic#38161)
  Add finalReduce flag to SearchRequest (elastic#38104)
  Forbid negative field boosts in analyzed queries (elastic#37930)
  Remove AtomiFieldData#getLegacyFieldValues (elastic#38087)
  Universal cluster bootstrap method for tests with autoMinMasterNodes=false (elastic#38038)
  Fix FullClusterRestartIT.testHistoryUUIDIsAdded (elastic#38098)
  Replace joda time in ingest-common module (elastic#38088)
  Fix eclipse config for ssl-config (elastic#38096)
  Don't load global ordinals with the `map` execution_hint (elastic#37833)
  Relax fault detector in some disruption tests (elastic#38101)
  Fix java time epoch date formatters (elastic#37829)
  ...
@colings86
Copy link
Contributor

Please remember to add all relevant labels (area label, version label(s) and change type label) on all PRs and please look for this as part of reviews. The release note generation process is made much harder when PRs are not labelled correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants