Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] IndexRecoveryIT.testRerouteRecovery : Paths exist that should have been deleted #32686

Closed
albertzaharovits opened this issue Aug 7, 2018 · 11 comments
Assignees
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v6.4.1 v7.0.0-beta1

Comments

@albertzaharovits
Copy link
Contributor

Build failures: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.4+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java10,nodes=virtual&&linux/21/console

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=java10,ES_RUNTIME_JAVA=java11,nodes=virtual&&linux/224/console

Root cause:

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=5109, name=elasticsearch[node_t0][clusterApplierService#updateTask][T#1], state=RUNNABLE, group=TGRP-IndexRecoveryIT]
	at __randomizedtesting.SeedInfo.seed([E8CEC7E4E4750104:456574B0D12D6346]:0)
Caused by: java.lang.AssertionError: Paths exist that should have been deleted: [/var/lib/jenkins/workspace/elastic+elasticsearch+6.4+matrix-java-periodic/ES_BUILD_JAVA/java10/ES_RUNTIME_JAVA/java10/nodes/virtual&&linux/server/build/testrun/integTest/J1/temp/org.elasticsearch.indices.recovery.IndexRecoveryIT_E8CEC7E4E4750104-001/tempDir-004/data/nodes/0/indices/RJoRvWIVRDyM1toTfMV2tw/0]
	at __randomizedtesting.SeedInfo.seed([E8CEC7E4E4750104]:0)
	at org.elasticsearch.env.NodeEnvironment.assertPathsDoNotExist(NodeEnvironment.java:469)
	at org.elasticsearch.env.NodeEnvironment.deleteShardDirectoryUnderLock(NodeEnvironment.java:459)
	at org.elasticsearch.env.NodeEnvironment.deleteShardDirectorySafe(NodeEnvironment.java:399)
	at org.elasticsearch.indices.IndicesService.deleteShardStore(IndicesService.java:770)
	at org.elasticsearch.indices.store.IndicesStore$ShardActiveResponseHandler.lambda$allNodesResponded$2(IndicesStore.java:288)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$runOnApplierThread$0(ClusterApplierService.java:306)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.apply(ClusterApplierService.java:155)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:399)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:160)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:624)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1135)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:844)

It does not reproduce:

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=B4F936D0EF2B4D07 \
  -Dtests.class=org.elasticsearch.indices.recovery.IndexRecoveryIT \
  -Dtests.method="testRerouteRecovery" \
  -Dtests.security.manager=true \
  -Dtests.locale=fr-TN \
  -Dtests.timezone=Eire
@albertzaharovits albertzaharovits added >test-failure Triaged test failures from CI :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.0.0 v6.4.1 labels Aug 7, 2018
@ywelsch
Copy link
Contributor

ywelsch commented Aug 14, 2018

Relates to #29140

@javanna
Copy link
Member

javanna commented Aug 28, 2018

Here is a different test that failed with the same error: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.4+periodic/190/console

jpountz added a commit that referenced this issue Sep 5, 2018
jpountz added a commit that referenced this issue Sep 5, 2018
jpountz added a commit that referenced this issue Sep 5, 2018
@original-brownbear original-brownbear self-assigned this Sep 24, 2018
@original-brownbear
Copy link
Member

original-brownbear commented Sep 27, 2018

I figured out where this is coming from but I'm not sure how to fix it.

The issue is, that org.elasticsearch.gateway.MetaDataStateFormat#read will create a new SimpleFSDirectory for the for the _state directory without holding any lock to the shard. If this happens after the shard directory is deleted in org.elasticsearch.env.NodeEnvironment#deleteShardDirectoryUnderLock but before the assertion that the path is gone is made, then the shard path exists again containing nothing but an empty _state directory and the assertion fails.

I think ideally we should avoid creating the _state directory in org.elasticsearch.gateway.MetaDataStateFormat#read because it's a read only operation but since FSDirectory will always create its underlying path if it doesn't exist I'm not sure how to accomplish that (without adding additional locking).

@DaveCTurner
Copy link
Contributor

This is a long-standing and, unfortunately, complicated issue. The fix you propose was #19338 and an alternative was #29140 (and see linked issues there for further discussion) but neither of these fixes were satisfactory.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Sep 27, 2018
* Allow empty state directory to prevent test from failing
* Closes elastic#32686
original-brownbear added a commit that referenced this issue Sep 28, 2018
* TESTS: Relax Assertion About Deleting Shard Dir

* Allow empty state directory to prevent test from failing
* Closes #32686
@droberts195
Copy link
Contributor

A failure occurred in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=oraclelinux/38/console that suggests #34120 doesn't fix this in every case.

The error is this:

06:45:17 ERROR   0.76s J1 | IndicesLifecycleListenerIT.testIndexStateShardChanged <<< FAILURES!
06:45:17    > Throwable #1: com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=10795, name=elasticsearch[node_t0][clusterApplierService#updateTask][T#1], state=RUNNABLE, group=TGRP-IndicesLifecycleListenerIT]
06:45:17    > 	at __randomizedtesting.SeedInfo.seed([A8D19CF9DD9EFD43:7E04533A46A0D37A]:0)
06:45:17    > Caused by: java.lang.AssertionError: Paths exist that should have been deleted: [/var/lib/jenkins/workspace/elastic+elasticsearch+6.x+multijob-unix-compatibility/os/oraclelinux/server/build/testrun/integTest/J1/temp/org.elasticsearch.indices.IndicesLifecycleListenerIT_A8D19CF9DD9EFD43-001/tempDir-002/d1/nodes/0/indices/2HM7dFt2TTqo3nNIgS8seA/2, /var/lib/jenkins/workspace/elastic+elasticsearch+6.x+multijob-unix-compatibility/os/oraclelinux/server/build/testrun/integTest/J1/temp/org.elasticsearch.indices.IndicesLifecycleListenerIT_A8D19CF9DD9EFD43-001/tempDir-002/d0/nodes/0/indices/2HM7dFt2TTqo3nNIgS8seA/2]

Looking elsewhere in the log file it seems that 2HM7dFt2TTqo3nNIgS8seA/2 is the state directory, as there are messages about .../2HM7dFt2TTqo3nNIgS8seA/2/_state/state-1.st.

The repro command for this failure was:

./gradlew :server:integTest \
  -Dtests.seed=A8D19CF9DD9EFD43 \
  -Dtests.class=org.elasticsearch.indices.IndicesLifecycleListenerIT \
  -Dtests.method="testIndexStateShardChanged" \
  -Dtests.security.manager=true \
  -Dtests.locale=es-AR \
  -Dtests.timezone=America/Toronto \
  -Dcompiler.java=10 \
  -Druntime.java=8

This didn't reproduce locally.

@droberts195 droberts195 reopened this Oct 3, 2018
@DaveCTurner
Copy link
Contributor

It looks like #34120 hasn't been backported to 6.x yet. @original-brownbear could you take a look?

@original-brownbear
Copy link
Member

@DaveCTurner yea sorry about that, will do either later today or tomorrow at the very latest (sorry public holiday here today and a little busy because of it :))

@DaveCTurner
Copy link
Contributor

Oh I forgot about the holiday there today. Enjoy your day off and don't worry about this, I'll try and take care of it.

DaveCTurner pushed a commit that referenced this issue Oct 3, 2018
* TESTS: Relax Assertion About Deleting Shard Dir

* Allow empty state directory to prevent test from failing
* Closes #32686
DaveCTurner pushed a commit that referenced this issue Oct 3, 2018
* TESTS: Relax Assertion About Deleting Shard Dir

* Allow empty state directory to prevent test from failing
* Closes #32686
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Oct 3, 2018
* TESTS: Relax Assertion About Deleting Shard Dir
* Allow empty state directory to prevent test from failing
* Closes elastic#32686

Backport of elastic#34120 and some associated changes.

Co-authored-by: Armin Braun <[email protected]>
DaveCTurner added a commit that referenced this issue Oct 4, 2018
* TESTS: Relax Assertion About Deleting Shard Dir
* Allow empty state directory to prevent test from failing
* Closes #32686

Backport of #34120 and some associated changes.

Co-authored-by: Armin Braun <[email protected]>
@original-brownbear
Copy link
Member

Closing this since the backporting was handling by @DaveCTurner (thanks for that! :)).

kcm pushed a commit that referenced this issue Oct 30, 2018
* TESTS: Relax Assertion About Deleting Shard Dir

* Allow empty state directory to prevent test from failing
* Closes #32686
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >test-failure Triaged test failures from CI v6.4.1 v7.0.0-beta1
Projects
None yet
Development

No branches or pull requests

9 participants