[CI] RecoveryIT.testRecoveryWithConcurrentIndexing fails on master expecting more documents #27650

dakrone · 2017-12-04T16:52:25Z

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/931/console

FAILURE 2.01s | RecoveryIT.testRecoveryWithConcurrentIndexing <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <60>
   >      but: was <110>
   > 	at __randomizedtesting.SeedInfo.seed([65E63137F3BA894E:E5B7F564711E5784]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.upgrades.RecoveryIT.assertCount(RecoveryIT.java:204)
   > 	at org.elasticsearch.upgrades.RecoveryIT.testRecoveryWithConcurrentIndexing(RecoveryIT.java:192)
   > 	at java.lang.Thread.run(Thread.java:748)

Expecting that the first node have 60 documents, instead has 110:

            case UPGRADED:
                updateIndexSetting(index, Settings.builder().put(INDEX_ROUTING_ALLOCATION_ENABLE_SETTING.getKey(), (String)null));
                asyncIndexDocs(index, 60, 50).get();
                ensureGreen(index);
                assertOK(client().performRequest("POST", index + "/_refresh"));
                assertCount(index, "_only_nodes:" + nodes.get(0), 60); // <-- failure here
                assertCount(index, "_only_nodes:" + nodes.get(1), 60);
                break;

Reproduces for me with:

gradle :qa:rolling-upgrade:v6.2.0-SNAPSHOT#upgradedClusterTestRunner -Dtests.seed=65E63137F3BA894E -Dtests.class=org.elasticsearch.upgrades.RecoveryIT -Dtests.method="testRecoveryWithConcurrentIndexing" -Dtests.security.manager=true -Dtests.locale=sl -Dtests.timezone=SST

The text was updated successfully, but these errors were encountered:

bleskes · 2017-12-04T17:05:20Z

This is basically a git cherry pick fail. I fixed it only to lose it when I was cleaning up the commits... Thx Lee!

imotov · 2018-09-14T11:05:41Z

@bleskes it looks like it failed again with exactly the same error message:

java.lang.AssertionError: 
Expected: <110>
     but: was <60>
	at __randomizedtesting.SeedInfo.seed([887030CB7438355C:821F498F69CEB96]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.upgrades.RecoveryIT.assertCount(RecoveryIT.java:168)
	at org.elasticsearch.upgrades.RecoveryIT.testRecoveryWithConcurrentIndexing(RecoveryIT.java:156)

REPRODUCE WITH: ./gradlew :qa:rolling-upgrade:v5.6.12-SNAPSHOT#upgradedClusterTestRunner \
  -Dtests.seed=887030CB7438355C \
  -Dtests.class=org.elasticsearch.upgrades.RecoveryIT \
  -Dtests.method="testRecoveryWithConcurrentIndexing" \
  -Dtests.security.manager=true \
  -Dtests.locale=fr-FR \
  -Dtests.timezone=CNT \
  -Dcompiler.java=10 \
  -Druntime.java=8FIPS \
  -Djavax.net.ssl.keyStorePassword=password \
  -Djavax.net.ssl.trustStorePassword=password

CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java8fips,nodes=virtual&&linux/297/console

Should this be cherry-picked somewhere else as well?

dnhatn · 2018-09-20T00:48:40Z

I think this test fails because we rollback a replica to the global checkpoint(#33473) but do not restore its history on promotion. This should be fixed by #33616.

dnhatn · 2018-09-20T17:23:18Z

Fixed by #33616.

dnhatn · 2018-09-21T00:22:05Z

This fails on 6.x: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/46/console

FAILURE 1.54s | RecoveryIT.testRecoveryWithConcurrentIndexing <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <110>
   >      but: was <60>
   > 	at __randomizedtesting.SeedInfo.seed([AD22BB7534737F0E:2D737F26B6D7A1C4]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.upgrades.RecoveryIT.assertCount(RecoveryIT.java:168)
   > 	at org.elasticsearch.upgrades.RecoveryIT.testRecoveryWithConcurrentIndexing(RecoveryIT.java:156)

Log: testRecoveryWithConcurrentIndexing.txt.zip

elasticmachine · 2018-09-21T00:22:15Z

Pinging @elastic/es-distributed

romseygeek · 2018-09-21T08:33:41Z

Another failure here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=centos/14/console

Failed immediately after it had been unmuted :( Relates to #27650

bleskes · 2018-09-21T19:03:11Z

I think this has to do with the rollback of the engine on a replica in the case where the new primary was promoted but existing ops don't have seq# (primary was on an old node). In that case, no resync happens and we lose the data that wasn't committed on the replica. I discussed this @dnhatn and he will work on fixes, but with a lesser priority then CCR. We should consider this a blocker. Marking as such.

Today we use the version of a DirectoryReader as a component of the key of IndicesRequestCache. This usage is perfectly fine since the version is advanced every time a new change is made into IndexWriter. In other words, two DirectoryReader with the same version should have the same content. However, this invariant is only guaranteed in the context of a single IndexWriter because the version is reset to the committed version value when IndexWriter is re-opened. Since elastic#33473, each IndexShard may have more than one IndexWriter, and using the version of a DirectoryReader as a part of the cache key can cause IndicesRequestCache to return stale cached values. For example, in elastic#27650, we rollback the engine (i.e., re-open IndexWriter), index new documents, refresh, then make a count request, but the search layer mistakenly returns the count of the DirectoryReader of the previous IndexWriter because the current DirectoryReader has the same version to the old DirectoryReader. This is possible because these two readers come from different IndexWriters. This commit replaces the the version with the reader cache key of IndexReader as a component of the cache key of IndicesRequestCache. Closes elastic#27650 Relates elastic#33473

Today we index the same number of documents (50 documents) in each round of the rolling upgrade tests. If the actual count does not match, we can not guess the problematic round. Relates elastic#27650

Today we index the same number of documents (50 documents) in each round of the rolling upgrade tests. If the actual count does not match, we can not guess the problematic round. Relates #27650

Today we use the version of a DirectoryReader as a component of the key of IndicesRequestCache. This usage is perfectly fine since the version is advanced every time a new change is made into IndexWriter. In other words, two DirectoryReaders with the same version should have the same content. However, this invariant is only guaranteed in the context of a single IndexWriter because the version is reset to the committed version value when IndexWriter is re-opened. Since #33473, each IndexShard may have more than one IndexWriter, and using the version of a DirectoryReader as a part of the cache key can cause IndicesRequestCache to return stale cached values. For example, in #27650, we rollback the engine (i.e., re-open IndexWriter), index new documents, refresh, then make a count request, but the search layer mistakenly returns the count of the DirectoryReader of the previous IndexWriter because the current DirectoryReader has the same version of the old DirectoryReader even their documents are different. This is possible because these two readers come from different IndexWriters. This commit replaces the the version with the reader cache key of IndexReader as a component of the cache key of IndicesRequestCache. Closes #27650 Relates #33473

Today we index the same number of documents (50 documents) in each round of the rolling upgrade tests. If the actual count does not match, we can not guess the problematic round. Relates #27650

Today we use the version of a DirectoryReader as a component of the key of IndicesRequestCache. This usage is perfectly fine since the version is advanced every time a new change is made into IndexWriter. In other words, two DirectoryReaders with the same version should have the same content. However, this invariant is only guaranteed in the context of a single IndexWriter because the version is reset to the committed version value when IndexWriter is re-opened. Since #33473, each IndexShard may have more than one IndexWriter, and using the version of a DirectoryReader as a part of the cache key can cause IndicesRequestCache to return stale cached values. For example, in #27650, we rollback the engine (i.e., re-open IndexWriter), index new documents, refresh, then make a count request, but the search layer mistakenly returns the count of the DirectoryReader of the previous IndexWriter because the current DirectoryReader has the same version of the old DirectoryReader even their documents are different. This is possible because these two readers come from different IndexWriters. This commit replaces the the version with the reader cache key of IndexReader as a component of the cache key of IndicesRequestCache. Closes #27650 Relates #33473

dakrone added the >test-failure Triaged test failures from CI label Dec 4, 2017

dakrone assigned bleskes Dec 4, 2017

bleskes closed this as completed in 0b50b31 Dec 4, 2017

imotov reopened this Sep 14, 2018

dnhatn assigned dnhatn and unassigned bleskes Sep 20, 2018

dnhatn closed this as completed Sep 20, 2018

dnhatn reopened this Sep 21, 2018

dnhatn added the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Sep 21, 2018

romseygeek added a commit that referenced this issue Sep 21, 2018

Mute RecoveryIT.testRecoveryWithConcurrentIndexing

3f741b2

Failed immediately after it had been unmuted :( Relates to #27650

bleskes added blocker v6.5.0 labels Sep 21, 2018

dnhatn mentioned this issue Oct 1, 2018

Replace version with reader cache key in IndicesRequestCache #34189

Merged

dnhatn mentioned this issue Oct 1, 2018

TEST: Index diff num docs in rolling upgrade tests #34191

Merged

dnhatn closed this as completed in #34189 Oct 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] RecoveryIT.testRecoveryWithConcurrentIndexing fails on master expecting more documents #27650

[CI] RecoveryIT.testRecoveryWithConcurrentIndexing fails on master expecting more documents #27650

dakrone commented Dec 4, 2017

bleskes commented Dec 4, 2017

imotov commented Sep 14, 2018

dnhatn commented Sep 20, 2018 •

edited

Loading

dnhatn commented Sep 20, 2018

dnhatn commented Sep 21, 2018

elasticmachine commented Sep 21, 2018

romseygeek commented Sep 21, 2018

bleskes commented Sep 21, 2018

[CI] RecoveryIT.testRecoveryWithConcurrentIndexing fails on master expecting more documents #27650

[CI] RecoveryIT.testRecoveryWithConcurrentIndexing fails on master expecting more documents #27650

Comments

dakrone commented Dec 4, 2017

bleskes commented Dec 4, 2017

imotov commented Sep 14, 2018

dnhatn commented Sep 20, 2018 • edited Loading

dnhatn commented Sep 20, 2018

dnhatn commented Sep 21, 2018

elasticmachine commented Sep 21, 2018

romseygeek commented Sep 21, 2018

bleskes commented Sep 21, 2018

dnhatn commented Sep 20, 2018 •

edited

Loading