Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zen2: Persist cluster states the old way on non-master-eligible nodes #36247

Merged
merged 5 commits into from
Dec 5, 2018

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Dec 5, 2018

The shard deletion logic (triggered by IndicesStore), which also leads to index metadata deletion on non-master-eligible data nodes, currently races against the new cluster state persistence logic triggered by accepting cluster states. One thread is writing the index metadata while another one is deleting the index metadata, leading to exceptions and assertions tripping (see below). The solution proposed by this PR is to move the cluster state persistence of non-master-eligible nodes back to the cluster applier service, just as it used to be for Zen1. This ensures that the index metadata deletion logic, which is triggered by the shard deletion logic, runs on the same thread on which we persist the cluster state.

Test failure:
https://elasticsearch-ci.elastic.co/view/All/job/elastic+elasticsearch+zen2+feature-branch-periodic/72/testReport/junit/org.elasticsearch.indices.recovery/IndexPrimaryRelocationIT/testPrimaryRelocationWhileIndexing/

[2018-12-04T21:56:29,243][WARN ][o.e.g.GatewayMetaState   ] [node_td4] Exception occurred when setting last accepted state
org.elasticsearch.gateway.WriteStateException: [[test/fMoV8ZJiRqW9eSrmF0MxBA]]: failed to write index state
    at org.elasticsearch.gateway.MetaStateService.writeIndex(MetaStateService.java:229) ~[main/:?]
    at org.elasticsearch.gateway.GatewayMetaState$AtomicClusterStateWriter.writeIndex(GatewayMetaState.java:296) ~[main/:?]
    at org.elasticsearch.gateway.GatewayMetaState$WriteNewIndexMetaData.execute(GatewayMetaState.java:634) ~[main/:?]
    at org.elasticsearch.gateway.GatewayMetaState.writeIndicesMetadata(GatewayMetaState.java:364) ~[main/:?]
    at org.elasticsearch.gateway.GatewayMetaState.updateClusterState(GatewayMetaState.java:338) ~[main/:?]
    at org.elasticsearch.gateway.GatewayMetaState.setLastAcceptedState(GatewayMetaState.java:243) ~[main/:?]
    at org.elasticsearch.cluster.coordination.CoordinationState.handlePublishRequest(CoordinationState.java:332) ~[main/:?]
    at org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:250) ~[main/:?]
    at org.elasticsearch.cluster.coordination.PublicationTransportHandler.handleIncomingPublishRequest(PublicationTransportHandler.java:412) ~[main/:?]
    at org.elasticsearch.cluster.coordination.PublicationTransportHandler.lambda$new$0(PublicationTransportHandler.java:90) ~[main/:?]
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) [main/:?]
    at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1346) [main/:?]
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [main/:?]
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
Caused by: org.elasticsearch.gateway.WriteStateException: exception during looking up new generation id
    at org.elasticsearch.gateway.MetaDataStateFormat.write(MetaDataStateFormat.java:225) ~[main/:?]
    at org.elasticsearch.gateway.MetaDataStateFormat.write(MetaDataStateFormat.java:209) ~[main/:?]
    at org.elasticsearch.gateway.MetaStateService.writeIndex(MetaStateService.java:224) ~[main/:?]
    ... 16 more
Caused by: java.nio.file.NoSuchFileException: /var/lib/jenkins/workspace/elastic+elasticsearch+zen2+feature-branch-periodic/server/build/testrun/integTest/J1/temp/org.elasticsearch.indices.recovery.IndexPrimaryRelocationIT_2028599FCBC75957-001/tempDir-002/data/nodes/4/indices/fMoV8ZJiRqW9eSrmF0MxBA/_state
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) ~[?:?]
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
    at sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) ~[?:?]
    at org.apache.lucene.mockfile.WindowsFS.getKey(WindowsFS.java:59) ~[lucene-test-framework-8.0.0-snapshot-67cdd21996.jar:8.0.0-snapshot-67cdd21996 67cdd21996f716ffb137bbcb8f826794a2632be7 - jimczi - 2018-11-22 09:58:50]
    at org.apache.lucene.mockfile.WindowsFS.onOpen(WindowsFS.java:66) ~[lucene-test-framework-8.0.0-snapshot-67cdd21996.jar:8.0.0-snapshot-67cdd21996 67cdd21996f716ffb137bbcb8f826794a2632be7 - jimczi - 2018-11-22 09:58:50]
    at org.apache.lucene.mockfile.HandleTrackingFS.callOpenHook(HandleTrackingFS.java:81) ~[lucene-test-framework-8.0.0-snapshot-67cdd21996.jar:8.0.0-snapshot-67cdd21996 67cdd21996f716ffb137bbcb8f826794a2632be7 - jimczi - 2018-11-22 09:58:50]
    at org.apache.lucene.mockfile.HandleTrackingFS.newDirectoryStream(HandleTrackingFS.java:315) ~[lucene-test-framework-8.0.0-snapshot-67cdd21996.jar:8.0.0-snapshot-67cdd21996 67cdd21996f716ffb137bbcb8f826794a2632be7 - jimczi - 2018-11-22 09:58:50]
    at java.nio.file.Files.newDirectoryStream(Files.java:525) ~[?:1.8.0_192]
    at org.elasticsearch.gateway.MetaDataStateFormat.findMaxGenerationId(MetaDataStateFormat.java:353) ~[main/:?]
    at org.elasticsearch.gateway.MetaDataStateFormat.write(MetaDataStateFormat.java:222) ~[main/:?]
    at org.elasticsearch.gateway.MetaDataStateFormat.write(MetaDataStateFormat.java:209) ~[main/:?]
    at org.elasticsearch.gateway.MetaStateService.writeIndex(MetaStateService.java:224) ~[main/:?]
    ... 16 more

triggering the following assertion:

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=3065, name=elasticsearch[node_td4][generic][T#3], state=RUNNABLE, group=TGRP-IndexPrimaryRelocationIT]
Caused by: java.lang.AssertionError
    at __randomizedtesting.SeedInfo.seed([2028599FCBC75957]:0)
    at org.elasticsearch.cluster.coordination.CoordinationState.handlePublishRequest(CoordinationState.java:333)
    at org.elasticsearch.cluster.coordination.Coordinator.handlePublishRequest(Coordinator.java:250)
    at org.elasticsearch.cluster.coordination.PublicationTransportHandler.handleIncomingPublishRequest(PublicationTransportHandler.java:412)
    at org.elasticsearch.cluster.coordination.PublicationTransportHandler.lambda$new$0(PublicationTransportHandler.java:90)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
    at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1346)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

@ywelsch ywelsch added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Dec 5, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@andrershov andrershov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@ywelsch ywelsch merged commit 0b9efff into elastic:zen2 Dec 5, 2018
@ywelsch ywelsch mentioned this pull request Dec 5, 2018
61 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants