Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] RecoveryIT.testRecoveryWithConcurrentIndexing fails on master expecting more documents #27650

Closed
dakrone opened this issue Dec 4, 2017 · 8 comments
Assignees
Labels
blocker :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v6.5.0

Comments

@dakrone
Copy link
Member

dakrone commented Dec 4, 2017

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/931/console

FAILURE 2.01s | RecoveryIT.testRecoveryWithConcurrentIndexing <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <60>
   >      but: was <110>
   > 	at __randomizedtesting.SeedInfo.seed([65E63137F3BA894E:E5B7F564711E5784]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.upgrades.RecoveryIT.assertCount(RecoveryIT.java:204)
   > 	at org.elasticsearch.upgrades.RecoveryIT.testRecoveryWithConcurrentIndexing(RecoveryIT.java:192)
   > 	at java.lang.Thread.run(Thread.java:748)

Expecting that the first node have 60 documents, instead has 110:

            case UPGRADED:
                updateIndexSetting(index, Settings.builder().put(INDEX_ROUTING_ALLOCATION_ENABLE_SETTING.getKey(), (String)null));
                asyncIndexDocs(index, 60, 50).get();
                ensureGreen(index);
                assertOK(client().performRequest("POST", index + "/_refresh"));
                assertCount(index, "_only_nodes:" + nodes.get(0), 60); // <-- failure here
                assertCount(index, "_only_nodes:" + nodes.get(1), 60);
                break;

Reproduces for me with:

gradle :qa:rolling-upgrade:v6.2.0-SNAPSHOT#upgradedClusterTestRunner -Dtests.seed=65E63137F3BA894E -Dtests.class=org.elasticsearch.upgrades.RecoveryIT -Dtests.method="testRecoveryWithConcurrentIndexing" -Dtests.security.manager=true -Dtests.locale=sl -Dtests.timezone=SST
@dakrone dakrone added the >test-failure Triaged test failures from CI label Dec 4, 2017
@bleskes
Copy link
Contributor

bleskes commented Dec 4, 2017

This is basically a git cherry pick fail. I fixed it only to lose it when I was cleaning up the commits... Thx Lee!

@bleskes bleskes closed this as completed in 0b50b31 Dec 4, 2017
@imotov
Copy link
Contributor

imotov commented Sep 14, 2018

@bleskes it looks like it failed again with exactly the same error message:

java.lang.AssertionError: 
Expected: <110>
     but: was <60>
	at __randomizedtesting.SeedInfo.seed([887030CB7438355C:821F498F69CEB96]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.upgrades.RecoveryIT.assertCount(RecoveryIT.java:168)
	at org.elasticsearch.upgrades.RecoveryIT.testRecoveryWithConcurrentIndexing(RecoveryIT.java:156)
REPRODUCE WITH: ./gradlew :qa:rolling-upgrade:v5.6.12-SNAPSHOT#upgradedClusterTestRunner \
  -Dtests.seed=887030CB7438355C \
  -Dtests.class=org.elasticsearch.upgrades.RecoveryIT \
  -Dtests.method="testRecoveryWithConcurrentIndexing" \
  -Dtests.security.manager=true \
  -Dtests.locale=fr-FR \
  -Dtests.timezone=CNT \
  -Dcompiler.java=10 \
  -Druntime.java=8FIPS \
  -Djavax.net.ssl.keyStorePassword=password \
  -Djavax.net.ssl.trustStorePassword=password

CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java8fips,nodes=virtual&&linux/297/console

Should this be cherry-picked somewhere else as well?

@imotov imotov reopened this Sep 14, 2018
@dnhatn dnhatn assigned dnhatn and unassigned bleskes Sep 20, 2018
@dnhatn
Copy link
Member

dnhatn commented Sep 20, 2018

I think this test fails because we rollback a replica to the global checkpoint(#33473) but do not restore its history on promotion. This should be fixed by #33616.

@dnhatn
Copy link
Member

dnhatn commented Sep 20, 2018

Fixed by #33616.

@dnhatn dnhatn closed this as completed Sep 20, 2018
@dnhatn
Copy link
Member

dnhatn commented Sep 21, 2018

This fails on 6.x: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/46/console

FAILURE 1.54s | RecoveryIT.testRecoveryWithConcurrentIndexing <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: 
   > Expected: <110>
   >      but: was <60>
   > 	at __randomizedtesting.SeedInfo.seed([AD22BB7534737F0E:2D737F26B6D7A1C4]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.upgrades.RecoveryIT.assertCount(RecoveryIT.java:168)
   > 	at org.elasticsearch.upgrades.RecoveryIT.testRecoveryWithConcurrentIndexing(RecoveryIT.java:156)

Log: testRecoveryWithConcurrentIndexing.txt.zip

@dnhatn dnhatn reopened this Sep 21, 2018
@dnhatn dnhatn added the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Sep 21, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@romseygeek
Copy link
Contributor

romseygeek added a commit that referenced this issue Sep 21, 2018
Failed immediately after it had been unmuted :(

Relates to #27650
@bleskes
Copy link
Contributor

bleskes commented Sep 21, 2018

I think this has to do with the rollback of the engine on a replica in the case where the new primary was promoted but existing ops don't have seq# (primary was on an old node). In that case, no resync happens and we lose the data that wasn't committed on the replica. I discussed this @dnhatn and he will work on fixes, but with a lesser priority then CCR. We should consider this a blocker. Marking as such.

dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Oct 1, 2018
Today we use the version of a DirectoryReader as a component of the key
of IndicesRequestCache. This usage is perfectly fine since the version
is advanced every time a new change is made into IndexWriter. In other
words, two DirectoryReader with the same version should have the same
content. However, this invariant is only guaranteed in the context of a
single IndexWriter because the version is reset to the committed version
value when IndexWriter is re-opened.

Since elastic#33473, each IndexShard may have more than one IndexWriter, and
using the version of a DirectoryReader as a part of the cache key can
cause IndicesRequestCache to return stale cached values. For example, in
elastic#27650, we rollback the engine (i.e., re-open IndexWriter), index new
documents, refresh, then make a count request, but the search layer
mistakenly returns the count of the DirectoryReader of the previous
IndexWriter because the current DirectoryReader has the same version to
the old DirectoryReader. This is possible because these two readers come
from different IndexWriters.

This commit replaces the the version with the reader cache key of
IndexReader as a component of the cache key of IndicesRequestCache.

Closes elastic#27650
Relates elastic#33473
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Oct 1, 2018
Today we index the same number of documents (50 documents) in each round
of the rolling upgrade tests. If the actual count does not match, we can
not guess the problematic round.

Relates elastic#27650
dnhatn added a commit that referenced this issue Oct 2, 2018
Today we index the same number of documents (50 documents) in each round
of the rolling upgrade tests. If the actual count does not match, we can
not guess the problematic round.

Relates #27650
dnhatn added a commit that referenced this issue Oct 2, 2018
Today we index the same number of documents (50 documents) in each round
of the rolling upgrade tests. If the actual count does not match, we can
not guess the problematic round.

Relates #27650
dnhatn added a commit that referenced this issue Oct 4, 2018
Today we use the version of a DirectoryReader as a component of the key
of IndicesRequestCache. This usage is perfectly fine since the version
is advanced every time a new change is made into IndexWriter. In other
words, two DirectoryReaders with the same version should have the same
content. However, this invariant is only guaranteed in the context of a
single IndexWriter because the version is reset to the committed version
value when IndexWriter is re-opened.

Since #33473, each IndexShard may have more than one IndexWriter, and
using the version of a DirectoryReader as a part of the cache key can
cause IndicesRequestCache to return stale cached values. For example, in
#27650, we rollback the engine (i.e., re-open IndexWriter), index new
documents, refresh, then make a count request, but the search layer
mistakenly returns the count of the DirectoryReader of the previous
IndexWriter because the current DirectoryReader has the same version of
the old DirectoryReader even their documents are different. This is
possible because these two readers come from different IndexWriters.

This commit replaces the the version with the reader cache key of
IndexReader as a component of the cache key of IndicesRequestCache.

Closes #27650
Relates #33473
dnhatn added a commit that referenced this issue Oct 4, 2018
Today we use the version of a DirectoryReader as a component of the key
of IndicesRequestCache. This usage is perfectly fine since the version
is advanced every time a new change is made into IndexWriter. In other
words, two DirectoryReaders with the same version should have the same
content. However, this invariant is only guaranteed in the context of a
single IndexWriter because the version is reset to the committed version
value when IndexWriter is re-opened.

Since #33473, each IndexShard may have more than one IndexWriter, and
using the version of a DirectoryReader as a part of the cache key can
cause IndicesRequestCache to return stale cached values. For example, in
#27650, we rollback the engine (i.e., re-open IndexWriter), index new
documents, refresh, then make a count request, but the search layer
mistakenly returns the count of the DirectoryReader of the previous
IndexWriter because the current DirectoryReader has the same version of
the old DirectoryReader even their documents are different. This is
possible because these two readers come from different IndexWriters.

This commit replaces the the version with the reader cache key of
IndexReader as a component of the cache key of IndicesRequestCache.

Closes #27650
Relates #33473
kcm pushed a commit that referenced this issue Oct 30, 2018
Today we index the same number of documents (50 documents) in each round
of the rolling upgrade tests. If the actual count does not match, we can
not guess the problematic round.

Relates #27650
kcm pushed a commit that referenced this issue Oct 30, 2018
Today we use the version of a DirectoryReader as a component of the key
of IndicesRequestCache. This usage is perfectly fine since the version
is advanced every time a new change is made into IndexWriter. In other
words, two DirectoryReaders with the same version should have the same
content. However, this invariant is only guaranteed in the context of a
single IndexWriter because the version is reset to the committed version
value when IndexWriter is re-opened.

Since #33473, each IndexShard may have more than one IndexWriter, and
using the version of a DirectoryReader as a part of the cache key can
cause IndicesRequestCache to return stale cached values. For example, in
#27650, we rollback the engine (i.e., re-open IndexWriter), index new
documents, refresh, then make a count request, but the search layer
mistakenly returns the count of the DirectoryReader of the previous
IndexWriter because the current DirectoryReader has the same version of
the old DirectoryReader even their documents are different. This is
possible because these two readers come from different IndexWriters.

This commit replaces the the version with the reader cache key of
IndexReader as a component of the cache key of IndicesRequestCache.

Closes #27650
Relates #33473
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v6.5.0
Projects
None yet
Development

No branches or pull requests

6 participants