[CI] failure in RareClusterStateIT.testDelayedMappingPropagationOnReplica #51308

astefan · 2020-01-22T13:39:59Z

I've seen the issues in the past about similar failures here and I am not sure the current one is actually valid, but I opened this issue for further investigation from someone more accustomed to the code. Doesn't repro locally. CC @original-brownbear

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.6+matrix-java-periodic/ES_BUILD_JAVA=openjdk13,ES_RUNTIME_JAVA=openjdk13,nodes=general-purpose/14/console
[7.5.3] https://gradle-enterprise.elastic.co/s/cjz5do6vk2pkq
[6.8.7] https://gradle-enterprise.elastic.co/s/kgeuak6okd3ew
https://gradle-enterprise.elastic.co/s/decph63z42af6

java.lang.AssertionError
	at __randomizedtesting.SeedInfo.seed([DDE5C7E2534A19E2:A19B062D03AD2201]:0)
	at org.junit.Assert.fail(Assert.java:86)
	at org.junit.Assert.assertTrue(Assert.java:41)
	at org.junit.Assert.assertFalse(Assert.java:64)
	at org.junit.Assert.assertFalse(Assert.java:74)
	at org.elasticsearch.cluster.coordination.RareClusterStateIT.testDelayedMappingPropagationOnReplica(RareClusterStateIT.java:371)

REPRODUCE WITH: ./gradlew ':server:integTest' --tests "org.elasticsearch.cluster.coordination.RareClusterStateIT.testDelayedMappingPropagationOnReplica" \
  -Dtests.seed=DDE5C7E2534A19E2 \
  -Dtests.security.manager=true \
  -Dtests.locale=zh-Hans-CN \
  -Dtests.timezone=America/Kentucky/Louisville \
  -Dcompiler.java=13 \
  -Druntime.java=13

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-01-22T13:40:02Z

Pinging @elastic/es-distributed (:Distributed/Distributed)

astefan · 2020-01-22T13:41:53Z

Actually, after running this with -Dtests.iters=500, it did fail around 110th test run.

original-brownbear · 2020-01-23T21:40:18Z

This is relatively easy to reproduce by adding a wait in org.elasticsearch.gateway.PersistedClusterStateService.Writer#writeIncrementalStateAndCommit. Making the node that has its cluster state update thread blocked take even a trivial amount of time before committing the CS leads to a situation where the full CS is sent to it and our hacky publication cancelling stops working. I'll try to find a fix for this tomorrow :)

Wait for the cluster to have settled down and have the same accepted version on all nodes before executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test. Closes elastic#51308

Wait for the cluster to have settled down and have the same accepted version on all nodes before executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test. Closes #51308

Wait for the cluster to have settled down and have the same accepted version on all nodes before executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test. Closes elastic#51308

) Wait for the cluster to have settled down and have the same accepted version on all nodes before executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test. Closes #51308

Wait for the cluster to have settled down and have the same accepted version on all nodes before executing and cancelling request so that a slow CS accept on one node doesn't make it fall behind and then get sent the full CS because of the diff-version mismatch, breaking the mechanics of this test. Closes #51308

astefan added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Jan 22, 2020

original-brownbear self-assigned this Jan 22, 2020

original-brownbear mentioned this issue Jan 24, 2020

Fix RareClusterStateIT Cancelling Publication too Early #51429

Merged

original-brownbear closed this as completed in #51429 Jan 24, 2020

original-brownbear mentioned this issue Jan 24, 2020

Fix RareClusterStateIT Cancelling Publication too Early (#51429) #51434

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] failure in RareClusterStateIT.testDelayedMappingPropagationOnReplica #51308

[CI] failure in RareClusterStateIT.testDelayedMappingPropagationOnReplica #51308

astefan commented Jan 22, 2020

elasticmachine commented Jan 22, 2020

astefan commented Jan 22, 2020

original-brownbear commented Jan 23, 2020

[CI] failure in RareClusterStateIT.testDelayedMappingPropagationOnReplica #51308

[CI] failure in RareClusterStateIT.testDelayedMappingPropagationOnReplica #51308

Comments

astefan commented Jan 22, 2020

elasticmachine commented Jan 22, 2020

astefan commented Jan 22, 2020

original-brownbear commented Jan 23, 2020