Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] failure in RareClusterStateIT.testDelayedMappingPropagationOnReplica #51308

Closed
astefan opened this issue Jan 22, 2020 · 3 comments · Fixed by #51429
Closed

[CI] failure in RareClusterStateIT.testDelayedMappingPropagationOnReplica #51308

astefan opened this issue Jan 22, 2020 · 3 comments · Fixed by #51429
Assignees
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI

Comments

@astefan
Copy link
Contributor

astefan commented Jan 22, 2020

I've seen the issues in the past about similar failures here and I am not sure the current one is actually valid, but I opened this issue for further investigation from someone more accustomed to the code. Doesn't repro locally. CC @original-brownbear

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.6+matrix-java-periodic/ES_BUILD_JAVA=openjdk13,ES_RUNTIME_JAVA=openjdk13,nodes=general-purpose/14/console
[7.5.3] https://gradle-enterprise.elastic.co/s/cjz5do6vk2pkq
[6.8.7] https://gradle-enterprise.elastic.co/s/kgeuak6okd3ew
https://gradle-enterprise.elastic.co/s/decph63z42af6

java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([DDE5C7E2534A19E2:A19B062D03AD2201]:0)
	at org.junit.Assert.fail(Assert.java:86)
	at org.junit.Assert.assertTrue(Assert.java:41)
	at org.junit.Assert.assertFalse(Assert.java:64)
	at org.junit.Assert.assertFalse(Assert.java:74)
	at org.elasticsearch.cluster.coordination.RareClusterStateIT.testDelayedMappingPropagationOnReplica(RareClusterStateIT.java:371)
REPRODUCE WITH: ./gradlew ':server:integTest' --tests "org.elasticsearch.cluster.coordination.RareClusterStateIT.testDelayedMappingPropagationOnReplica" \
  -Dtests.seed=DDE5C7E2534A19E2 \
  -Dtests.security.manager=true \
  -Dtests.locale=zh-Hans-CN \
  -Dtests.timezone=America/Kentucky/Louisville \
  -Dcompiler.java=13 \
  -Druntime.java=13
@astefan astefan added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Jan 22, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Distributed)

@original-brownbear original-brownbear self-assigned this Jan 22, 2020
@astefan
Copy link
Contributor Author

astefan commented Jan 22, 2020

Actually, after running this with -Dtests.iters=500, it did fail around 110th test run.

@original-brownbear
Copy link
Member

This is relatively easy to reproduce by adding a wait in org.elasticsearch.gateway.PersistedClusterStateService.Writer#writeIncrementalStateAndCommit. Making the node that has its cluster state update thread blocked take even a trivial amount of time before committing the CS leads to a situation where the full CS is sent to it and our hacky publication cancelling stops working. I'll try to find a fix for this tomorrow :)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 24, 2020
Wait for the cluster to have settled down and have the same accepted version on all nodes before
executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind
and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test.

Closes elastic#51308
original-brownbear added a commit that referenced this issue Jan 24, 2020
Wait for the cluster to have settled down and have the same accepted version on all nodes before
executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind
and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test.

Closes #51308
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 24, 2020
Wait for the cluster to have settled down and have the same accepted version on all nodes before
executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind
and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test.

Closes elastic#51308
original-brownbear added a commit that referenced this issue Jan 24, 2020
)

Wait for the cluster to have settled down and have the same accepted version on all nodes before
executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind
and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test.

Closes #51308
rjernst pushed a commit that referenced this issue Feb 20, 2020
Wait for the cluster to have settled down and have the same accepted version on all nodes before
executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind
and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test.

Closes #51308
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants