Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shard stuck in findSafeCommitPoint / Commit list must not empty #57091

Closed
Aketzu opened this issue May 24, 2020 · 4 comments
Closed

Shard stuck in findSafeCommitPoint / Commit list must not empty #57091

Aketzu opened this issue May 24, 2020 · 4 comments
Assignees
Labels
blocker >bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@Aketzu
Copy link

Aketzu commented May 24, 2020

Elasticsearch version (bin/elasticsearch --version): Version: 7.7.0, Build: default/deb/81a1e9eda8e6183f5237786246f6dced26a10eaf/2020-05-12T02:01:37.602180Z, JVM: 14

Plugins installed: []

JVM version (java -version): (Elasticsearch bundled) openjdk 14 2020-03-17

OS version (uname -a if on a Unix-like system): Debian 9.12, Linux elastic-n1 4.9.0-12-amd64 1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

At some point a few indexes just didn't want to recover anymore resulting in data loss. I wouldn't want that to happen.

Total of 24 shards in 4 indices failed. All indices were system indices (.ml-state, .ml-notifications, .ml-anomalies-shared, .logstash) and most really small (~1MB, ml-anomalies-shared 250MB)

Steps to reproduce:

Not entirely sure how it happened and when exactly. Might have been caused by earlier corruption or cluster problems.

I had to delete the affected indices to get reporting working. I did take copies of the files before deletion if anybody wants to take a look at them.

(Restored from snapshot) .logstash index reports version.created: 6010199.

Timeline

19:10:01 elastic-n4 automated updates for bind9 cause Debian to toggle interface down & up. This results in DHCP giving new address [which is really a facepalm]
19:10:34 elastic-n4 (master) lost communications with rest of the cluster
19:10:36 elastic-n1 elected as new master
19:14:47 - 19:16:38 elastic-n1 busyloops (30/sec) monitoring exporter failures due to elastic-n2 queue full
19:15:22 - 19:16:39 elastic-n3 reports master lost and elastic-n1 election twice
19:25:45 elastic-n4 starts spamming elastic-n1 with join requests to wrong IP
20:14:41 restarted elastic-n4; communications finally recover
20:25:11 .ml-state recovery from n1 to n4 fails (Commit list must not empty); index yellow
21:36:56 restarted elastic-n1
21:54:39 .ml-state recovery fails on n1 (Commit list must not empty); index red

I did /_flush/synced at some point but don't remember when. Likely before elastic-n4 restart

Provide logs (if relevant):

[2020-05-22T20:25:11,075][WARN ][o.e.c.r.a.AllocationService] [elastic-n1.lan.vilant.com] failing shard [failed shard, shard [.ml-state][2], node[p8qQrkI8Q7ag7Ddt7o6myQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=xE6ysNXISBust7s1fw7ZMg], unassigned_info[[reason=NODE_LEFT], at[2020-05-22T16:11:36.297Z], delayed=false, details[node_left [p8qQrkI8Q7ag7Ddt7o6myQ]], allocation_status[no_attempt]], message [failed recovery], failure [RecoveryFailedException[[.ml-state][2]: Recovery failed from {elastic-n1.lan.vilant.com}{mIA4_vOvRPSHiJMWkpU-jw}{F8yJhCiXSz6w2YmK_wFQEg}{10.6.1.175}{10.6.1.175:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=14690267136, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elastic-n4.lan.vilant.com}{p8qQrkI8Q7ag7Ddt7o6myQ}{ymEcasqPSOqfaozi3W70SA}{10.6.0.162}{10.6.0.162:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=21017317376, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elastic-n1.lan.vilant.com][10.6.1.175:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elastic-n4.lan.vilant.com][10.6.0.162:9300][internal:index/shard/recovery/prepare_translog]]; nested: IllegalArgumentException[Commit list must not empty]; ], markAsStale [true]]
org.elasticsearch.indices.recovery.RecoveryFailedException: [.ml-state][2]: Recovery failed from {elastic-n1.lan.vilant.com}{mIA4_vOvRPSHiJMWkpU-jw}{F8yJhCiXSz6w2YmK_wFQEg}{10.6.1.175}{10.6.1.175:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=14690267136, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elastic-n4.lan.vilant.com}{p8qQrkI8Q7ag7Ddt7o6myQ}{ymEcasqPSOqfaozi3W70SA}{10.6.0.162}{10.6.0.162:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=21017317376, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:247) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:292) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:244) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.7.0.jar:7.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [elastic-n1.lan.vilant.com][10.6.1.175:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] prepare target for translog failed
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$prepareTargetForTranslog$32(RecoverySourceHandler.java:626) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:173) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.7.0.jar:7.7.0]
        ... 6 more
Caused by: org.elasticsearch.transport.RemoteTransportException: [elastic-n4.lan.vilant.com][10.6.0.162:9300][internal:index/shard/recovery/prepare_translog]
Caused by: java.lang.IllegalArgumentException: Commit list must not empty
        at org.elasticsearch.index.engine.CombinedDeletionPolicy.findSafeCommitPoint(CombinedDeletionPolicy.java:182) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.store.Store.trimUnsafeCommits(Store.java:1524) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.engine.InternalEngine.trimUnsafeCommits(InternalEngine.java:2865) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:227) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:200) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1625) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1603) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$prepareForTranslogOperations$0(RecoveryTarget.java:287) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:325) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:285) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:383) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:377) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:306) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:385) ~[?:?]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:190) ~[?:?]
        at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:129) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:324) ~[?:?]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:386) ~[?:?]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:397) ~[?:?]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:321) ~[?:?]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:263) ~[?:?]
        at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:174) ~[?:?]
        at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:120) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:313) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.7.0.jar:7.7.0]
        ... 3 more

https://github.com/elastic/elasticsearch/blob/v7.7.0/server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java#L182

https://github.com/elastic/elasticsearch/blob/v7.7.0/server/src/main/java/org/elasticsearch/index/store/Store.java#L1524

@Aketzu Aketzu added >bug needs:triage Requires assignment of a team area label labels May 24, 2020
@ywelsch ywelsch added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed needs:triage Requires assignment of a team area label labels May 25, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Engine)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 25, 2020
@ywelsch
Copy link
Contributor

ywelsch commented May 25, 2020

Can you provide the full logs somewhere? The log entries you showed only cover the failed peer recovery (i.e. recovering a replica failed) where your cluster was yellow, not red.

Can you also provide some details about the filesystem that you're using? It sounds like some files just got wiped on disk (perhaps by an external process?) Further, it would be interesting to get a directory listing from elastic-n1, in particular the folder that has the index uuid of the .ml-state index. You can find out the index uuid by looking at GET _cat/indices

@ywelsch
Copy link
Contributor

ywelsch commented May 26, 2020

We've identified the bug here, which looks to affect indices that were created in an ES version before 6.2. These indices can under certain conditions fail to be opened up in ES 7.7.0. We're working on a fix.

dnhatn added a commit that referenced this issue May 27, 2020
If an upgraded node is restarted multiple times without flushing a new 
index commit, then we will wrongly exclude all commits from the starting
commits. This bug is reproducible with these minimal steps: (1) create
an empty index on 6.1.4 with translog retention disabled, (2) upgrade 
the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem
is that with the new translog policy can trim translog without having a
new index commit, while the existing commit still refers to the previous
translog generation.

Closes #57091
dnhatn added a commit to dnhatn/elasticsearch that referenced this issue May 27, 2020
If an upgraded node is restarted multiple times without flushing a new 
index commit, then we will wrongly exclude all commits from the starting
commits. This bug is reproducible with these minimal steps: (1) create
an empty index on 6.1.4 with translog retention disabled, (2) upgrade 
the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem
is that with the new translog policy can trim translog without having a
new index commit, while the existing commit still refers to the previous
translog generation.

Closes elastic#57091
@dnhatn
Copy link
Member

dnhatn commented May 27, 2020

Fixed in #57187. @Aketzu Thanks for reporting the issue.

@dnhatn dnhatn closed this as completed May 27, 2020
dnhatn added a commit that referenced this issue May 27, 2020
If an upgraded node is restarted multiple times without flushing a new
index commit, then we will wrongly exclude all commits from the starting
commits. This bug is reproducible with these minimal steps: (1) create
an empty index on 6.1.4 with translog retention disabled, (2) upgrade
the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem
is that with the new translog policy can trim translog without having a
new index commit, while the existing commit still refers to the previous
translog generation.

Closes #57091
dnhatn added a commit that referenced this issue May 27, 2020
If an upgraded node is restarted multiple times without flushing a new
index commit, then we will wrongly exclude all commits from the starting
commits. This bug is reproducible with these minimal steps: (1) create
an empty index on 6.1.4 with translog retention disabled, (2) upgrade
the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem
is that with the new translog policy can trim translog without having a
new index commit, while the existing commit still refers to the previous
translog generation.

Closes #57091
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker >bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

4 participants