Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RareClusterStateIT Cancelling Publication too Early #51429

Merged
merged 1 commit into from
Jan 24, 2020

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Jan 24, 2020

Wait for the cluster to have settled down and have the same accepted version on all nodes before
executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind
and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test.

i.e. avoiding this situation:

[2020-01-23T11:26:39,636][DEBUG][o.e.c.c.PublicationTransportHandler] [node_t1] resending full cluster state to node {node_t0}{EepGi72fSguZAbAbXr4DPg}{hzfbHTkWTHqL2eX1Ig-_wA}{127.0.0.1}{127.0.0.1:35891}{dim} reason org.elasticsearch.transport.RemoteTransportException: [node_t0][127.0.0.1:35891][internal:cluster/coordination/publish_state]; org.elasticsearch.cluster.IncompatibleClusterStateVersionException: Expected diff for version 7 with uuid DkERyyF6QM28qJjg6gMSUA got version 9 and uuid 1IYFeZdnRIGqItz6VcFRXg
[2020-01-23T11:26:39,637][DEBUG][o.e.c.c.PublicationTransportHandler] [node_t0] received full cluster state version [9] with size [670]
[2020-01-23T11:26:39,652][DEBUG][o.e.g.PersistedClusterStateService] [node_t1] writing cluster state took [0ms]; wrote global metadata [false] and metadata for [1] indices and skipped [0] unchanged indices
[2020-01-23T11:26:39,652][DEBUG][o.e.g.PersistedClusterStateService] [node_t0] writing cluster state took [0ms]; wrote global metadata [false] and metadata for [1] indices and skipped [0] unchanged indices
[2020-01-23T11:26:39,652][DEBUG][o.e.c.c.CoordinationState] [node_t0] handleCommit: ignored commit request due to version mismatch (term 1, expected: [9], actual: [8])
[2020-01-23T11:26:39,653][DEBUG][o.e.c.c.C.CoordinatorPublication] [node_t1] ApplyCommitResponseHandler: [{node_t0}{EepGi72fSguZAbAbXr4DPg}{hzfbHTkWTHqL2eX1Ig-_wA}{127.0.0.1}{127.0.0.1:35891}{dim}] failed
org.elasticsearch.transport.RemoteTransportException: [node_t0][127.0.0.1:35891][internal:cluster/coordination/commit_state]

Closes #51308

Wait for the cluster to have settled down and have the same accepted version on all nodes before
executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind
and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test.

Closes elastic#51308
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.0.0 v7.7.0 labels Jan 24, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

@@ -370,11 +376,9 @@ public void testDelayedMappingPropagationOnReplica() throws Exception {

// Now make sure the indexing request finishes successfully
disruption.stopDisrupting();
assertBusy(() -> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These busy assert made no sense, we're never resending the requests we're asserting on here, all these do is make a failed test run take an extra 10s.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@original-brownbear
Copy link
Member Author

Thanks Yannick!

@original-brownbear original-brownbear merged commit 3322df3 into elastic:master Jan 24, 2020
@original-brownbear original-brownbear deleted the 51308-logging branch January 24, 2020 17:08
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 24, 2020
Wait for the cluster to have settled down and have the same accepted version on all nodes before
executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind
and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test.

Closes elastic#51308
original-brownbear added a commit that referenced this pull request Jan 24, 2020
)

Wait for the cluster to have settled down and have the same accepted version on all nodes before
executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind
and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test.

Closes #51308
rjernst pushed a commit that referenced this pull request Feb 20, 2020
Wait for the cluster to have settled down and have the same accepted version on all nodes before
executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind
and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test.

Closes #51308
@rjernst
Copy link
Member

rjernst commented Feb 20, 2020

I've backported this to 7.6 as well, to address the same failure there (https://gradle-enterprise.elastic.co/s/j7bvgrn7srrie).

@rjernst rjernst added the v7.6.1 label Feb 20, 2020
@original-brownbear original-brownbear restored the 51308-logging branch August 6, 2020 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test Issues or PRs that are addressing/adding tests v7.6.1 v7.7.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] failure in RareClusterStateIT.testDelayedMappingPropagationOnReplica
5 participants