Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] IndexShardIT » testMaybeFlush #52223

Closed
hendrikmuhs opened this issue Feb 11, 2020 · 12 comments · Fixed by #52247 or #54839
Closed

[CI] IndexShardIT » testMaybeFlush #52223

hendrikmuhs opened this issue Feb 11, 2020 · 12 comments · Fixed by #52247 or #54839
Assignees
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >test-failure Triaged test failures from CI

Comments

@hendrikmuhs
Copy link

failed on master:

https://gradle-enterprise.elastic.co/s/izpnat3y374v4

expected:<0> but was:<3>
java.lang.AssertionError
: 
expected:<0> but was:<3>
Close stacktrace
at __randomizedtesting.SeedInfo.seed([ACB0D4A252C38F40:2DB52A569EBDD90C]:0)
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:834)
at org.junit.Assert.assertEquals(Assert.java:645)
at org.junit.Assert.assertEquals(Assert.java:631)
at org.elasticsearch.index.shard.IndexShardIT.testMaybeFlush(IndexShardIT.java:352)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:566)
at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1754)
at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:942)
at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:978)
at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:992)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:819)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:470)
at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:951)
at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:836)
at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:887)
at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:898)
at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:370)
at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:826)
at java.lang.Thread.run(Thread.java:834)
[2020-02-12T03:56:52,902][INFO ][o.e.i.s.IndexShardIT     ] [testMaybeFlush] before test
[2020-02-12T03:56:52,908][INFO ][o.e.c.m.MetaDataCreateIndexService] [node_s_0] [test] creating index, cause [api], templates [one_shard_index_template, random-soft-deletes-template], shards [1]/[0], mappings []
[2020-02-12T03:56:53,059][INFO ][o.e.c.r.a.AllocationService] [node_s_0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[test][0]]]).
[2020-02-12T03:56:53,133][INFO ][o.e.c.s.IndexScopedSettings] [node_s_0] [test] updating [index.translog.flush_threshold_size] from [512mb] to [135b]
[2020-02-12T03:56:53,181][INFO ][o.e.c.s.IndexScopedSettings] [node_s_0] [test] updating [index.translog.flush_threshold_size] from [512mb] to [135b]
[2020-02-12T03:56:53,188][INFO ][o.e.c.m.MetaDataMappingService] [node_s_0] [test/1CLpUW3xTsuEeLFtC4nV7g] create_mapping
[2020-02-12T03:56:53,329][INFO ][o.e.i.s.IndexShardIT     ] [testMaybeFlush] after test
[2020-02-12T03:56:53,329][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_s_0] [test/1CLpUW3xTsuEeLFtC4nV7g] deleting index
@hendrikmuhs hendrikmuhs added >test-failure Triaged test failures from CI :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Feb 11, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Engine)

@dnhatn dnhatn self-assigned this Feb 11, 2020
@dnhatn
Copy link
Member

dnhatn commented Feb 11, 2020

Relates to #51905.

dnhatn added a commit that referenced this issue Feb 12, 2020
Since #51905, we use the local checkpoint of the safe commit to 
calculate the number of uncommitted operations of a translog stats. If a
periodic flush triggered by afterWriteOperation completes before we sync
translog, then the last commit is not safe. We also need to sync
translog from Engine instead of the translog so that we can advance the
safe commit.

Relates #51905
Closes #52223
@astefan
Copy link
Contributor

astefan commented Feb 17, 2020

@astefan astefan reopened this Feb 17, 2020
@ywelsch ywelsch self-assigned this Feb 19, 2020
dnhatn added a commit that referenced this issue Feb 25, 2020
We aren't able to reproduce or figure out the reason that failed this test.
This commit adds more assertions so we can narrow the scope.

Relates #52223
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 26, 2020
Since elastic#51905, we use the local checkpoint of the safe commit to
calculate the number of uncommitted operations of a translog stats. If a
periodic flush triggered by afterWriteOperation completes before we sync
translog, then the last commit is not safe. We also need to sync
translog from Engine instead of the translog so that we can advance the
safe commit.

Relates elastic#51905
Closes elastic#52223
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue Feb 26, 2020
We aren't able to reproduce or figure out the reason that failed this test.
This commit adds more assertions so we can narrow the scope.

Relates elastic#52223
dnhatn added a commit that referenced this issue Feb 26, 2020
Since #51905, we use the local checkpoint of the safe commit to
calculate the number of uncommitted operations of a translog stats. If a
periodic flush triggered by afterWriteOperation completes before we sync
translog, then the last commit is not safe. We also need to sync
translog from Engine instead of the translog so that we can advance the
safe commit.

Relates #51905
Closes #52223
dnhatn added a commit that referenced this issue Feb 26, 2020
We aren't able to reproduce or figure out the reason that failed this test.
This commit adds more assertions so we can narrow the scope.

Relates #52223
@ywelsch
Copy link
Contributor

ywelsch commented Mar 3, 2020

@dnhatn has added more assertions here, but there haven't been any recent failures. I'll close this for now, but please report and reopen issue on any future failures.

@ywelsch ywelsch closed this as completed Mar 3, 2020
@tlrx
Copy link
Member

tlrx commented Mar 25, 2020

This test failed today on 7.x with a similar assertion error:

java.lang.AssertionError: last commit [{translog_uuid=-eXKsvspTWe4_IpuWE2ZJA, history_uuid=Q7T02qpRT3-59zTo-2JICw, local_checkpoint=2, max_seq_no=2, max_unsafe_auto_id_timestamp=-1}]
Expected: <0>
     but: was <3>
	at __randomizedtesting.SeedInfo.seed([1D69D57C6EE18EC5:9C6C2B88A29FD889]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.elasticsearch.index.shard.IndexShardIT.testMaybeFlush(IndexShardIT.java:371)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:834)

Build scan: https://gradle-enterprise.elastic.co/s/i7xtizngehx4u

It does not reproduce locally:

 ./gradlew ':server:integTest' --tests "org.elasticsearch.index.shard.IndexShardIT.testMaybeFlush" \
  -Dtests.seed=1D69D57C6EE18EC5 \
  -Dtests.security.manager=true \
  -Dtests.locale=es-US \
  -Dtests.timezone=America/Marigot \
  -Dcompiler.java=13

@tlrx tlrx reopened this Mar 25, 2020
dnhatn added a commit that referenced this issue Apr 1, 2020
dnhatn added a commit that referenced this issue Apr 1, 2020
@dnhatn
Copy link
Member

dnhatn commented Apr 1, 2020

We still do not have enough info for this test failure. I've enabled the debug logs of the engine in this test. I am closing this issue again. Please reopen when it fails.

@dnhatn dnhatn closed this as completed Apr 1, 2020
@gwbrown
Copy link
Contributor

gwbrown commented Apr 1, 2020

This failed again in 7.x. Build scan, Jenkins

@gwbrown gwbrown reopened this Apr 1, 2020
@williamrandolph
Copy link
Contributor

And again, in 7.x: Build scan, Jenkins

@mayya-sharipova
Copy link
Contributor

And again, in 7.x: Build scan, Jenkins

@dimitris-athanasiou
Copy link
Contributor

Ang again, in 7.x: Build scan, Jenkins

@dnhatn
Copy link
Member

dnhatn commented Apr 6, 2020

I am looking at this issue now.

@dnhatn
Copy link
Member

dnhatn commented Apr 6, 2020

I've opened #54839.

dnhatn added a commit that referenced this issue Apr 7, 2020
IndexShardIT#testMaybeFlush relies on the assumption that the safe commit 
and translog deletion policy have advanced after IndexShard#sync returns . 
This assumption does not hold if there's a race with the global checkpoint sync.
Closes #52223
dnhatn added a commit that referenced this issue Apr 8, 2020
IndexShardIT#testMaybeFlush relies on the assumption that the safe commit
and translog deletion policy have advanced after IndexShard#sync returns .
This assumption does not hold if there's a race with the global checkpoint sync.

Closes #52223
dnhatn added a commit that referenced this issue Apr 8, 2020
IndexShardIT#testMaybeFlush relies on the assumption that the safe commit
and translog deletion policy have advanced after IndexShard#sync returns .
This assumption does not hold if there's a race with the global checkpoint sync.

Closes #52223
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >test-failure Triaged test failures from CI
Projects
None yet
10 participants