Shard stuck in findSafeCommitPoint / Commit list must not empty #57091

Aketzu · 2020-05-24T08:31:52Z

Elasticsearch version (bin/elasticsearch --version): Version: 7.7.0, Build: default/deb/81a1e9eda8e6183f5237786246f6dced26a10eaf/2020-05-12T02:01:37.602180Z, JVM: 14

Plugins installed: []

JVM version (java -version): (Elasticsearch bundled) openjdk 14 2020-03-17

OS version (uname -a if on a Unix-like system): Debian 9.12, Linux elastic-n1 4.9.0-12-amd64 1 SMP Debian 4.9.210-1 (2020-01-20) x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

At some point a few indexes just didn't want to recover anymore resulting in data loss. I wouldn't want that to happen.

Total of 24 shards in 4 indices failed. All indices were system indices (.ml-state, .ml-notifications, .ml-anomalies-shared, .logstash) and most really small (~1MB, ml-anomalies-shared 250MB)

Steps to reproduce:

Not entirely sure how it happened and when exactly. Might have been caused by earlier corruption or cluster problems.

I had to delete the affected indices to get reporting working. I did take copies of the files before deletion if anybody wants to take a look at them.

(Restored from snapshot) .logstash index reports version.created: 6010199.

Timeline

19:10:01 elastic-n4 automated updates for bind9 cause Debian to toggle interface down & up. This results in DHCP giving new address [which is really a facepalm]
19:10:34 elastic-n4 (master) lost communications with rest of the cluster
19:10:36 elastic-n1 elected as new master
19:14:47 - 19:16:38 elastic-n1 busyloops (30/sec) monitoring exporter failures due to elastic-n2 queue full
19:15:22 - 19:16:39 elastic-n3 reports master lost and elastic-n1 election twice
19:25:45 elastic-n4 starts spamming elastic-n1 with join requests to wrong IP
20:14:41 restarted elastic-n4; communications finally recover
20:25:11 .ml-state recovery from n1 to n4 fails (Commit list must not empty); index yellow
21:36:56 restarted elastic-n1
21:54:39 .ml-state recovery fails on n1 (Commit list must not empty); index red

I did /_flush/synced at some point but don't remember when. Likely before elastic-n4 restart

Provide logs (if relevant):

[2020-05-22T20:25:11,075][WARN ][o.e.c.r.a.AllocationService] [elastic-n1.lan.vilant.com] failing shard [failed shard, shard [.ml-state][2], node[p8qQrkI8Q7ag7Ddt7o6myQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=xE6ysNXISBust7s1fw7ZMg], unassigned_info[[reason=NODE_LEFT], at[2020-05-22T16:11:36.297Z], delayed=false, details[node_left [p8qQrkI8Q7ag7Ddt7o6myQ]], allocation_status[no_attempt]], message [failed recovery], failure [RecoveryFailedException[[.ml-state][2]: Recovery failed from {elastic-n1.lan.vilant.com}{mIA4_vOvRPSHiJMWkpU-jw}{F8yJhCiXSz6w2YmK_wFQEg}{10.6.1.175}{10.6.1.175:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=14690267136, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elastic-n4.lan.vilant.com}{p8qQrkI8Q7ag7Ddt7o6myQ}{ymEcasqPSOqfaozi3W70SA}{10.6.0.162}{10.6.0.162:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=21017317376, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elastic-n1.lan.vilant.com][10.6.1.175:9300][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[elastic-n4.lan.vilant.com][10.6.0.162:9300][internal:index/shard/recovery/prepare_translog]]; nested: IllegalArgumentException[Commit list must not empty]; ], markAsStale [true]]
org.elasticsearch.indices.recovery.RecoveryFailedException: [.ml-state][2]: Recovery failed from {elastic-n1.lan.vilant.com}{mIA4_vOvRPSHiJMWkpU-jw}{F8yJhCiXSz6w2YmK_wFQEg}{10.6.1.175}{10.6.1.175:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=14690267136, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elastic-n4.lan.vilant.com}{p8qQrkI8Q7ag7Ddt7o6myQ}{ymEcasqPSOqfaozi3W70SA}{10.6.0.162}{10.6.0.162:9300}{dilmrt}{rack_id=Espoo_1, ml.machine_memory=21017317376, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.lambda$doRecovery$2(PeerRecoveryTargetService.java:247) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$1.handleException(PeerRecoveryTargetService.java:292) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.InboundHandler.lambda$handleException$2(InboundHandler.java:244) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.7.0.jar:7.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.transport.RemoteTransportException: [elastic-n1.lan.vilant.com][10.6.1.175:9300][internal:index/shard/recovery/start_recovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: Phase[1] prepare target for translog failed
        at org.elasticsearch.indices.recovery.RecoverySourceHandler.lambda$prepareTargetForTranslog$32(RecoverySourceHandler.java:626) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:173) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.PlainTransportFuture.handleException(PlainTransportFuture.java:97) ~[elasticsearch-7.7.0.jar:7.7.0]
        ... 6 more
Caused by: org.elasticsearch.transport.RemoteTransportException: [elastic-n4.lan.vilant.com][10.6.0.162:9300][internal:index/shard/recovery/prepare_translog]
Caused by: java.lang.IllegalArgumentException: Commit list must not empty
        at org.elasticsearch.index.engine.CombinedDeletionPolicy.findSafeCommitPoint(CombinedDeletionPolicy.java:182) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.store.Store.trimUnsafeCommits(Store.java:1524) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.engine.InternalEngine.trimUnsafeCommits(InternalEngine.java:2865) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:227) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:200) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1625) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.index.shard.IndexShard.openEngineAndSkipTranslogRecovery(IndexShard.java:1603) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.recovery.RecoveryTarget.lambda$prepareForTranslogOperations$0(RecoveryTarget.java:287) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.action.ActionListener.completeWith(ActionListener.java:325) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.recovery.RecoveryTarget.prepareForTranslogOperations(RecoveryTarget.java:285) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:383) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$PrepareForTranslogOperationsRequestHandler.messageReceived(PeerRecoveryTargetService.java:377) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:225) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:306) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.authorizeSystemUser(AuthorizationService.java:385) ~[?:?]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.authorize(AuthorizationService.java:190) ~[?:?]
        at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.lambda$inbound$1(ServerTransportFilter.java:129) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$authenticateAsync$2(AuthenticationService.java:324) ~[?:?]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lambda$lookForExistingAuthentication$6(AuthenticationService.java:386) ~[?:?]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.lookForExistingAuthentication(AuthenticationService.java:397) ~[?:?]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.authenticateAsync(AuthenticationService.java:321) ~[?:?]
        at org.elasticsearch.xpack.security.authc.AuthenticationService$Authenticator.access$000(AuthenticationService.java:263) ~[?:?]
        at org.elasticsearch.xpack.security.authc.AuthenticationService.authenticate(AuthenticationService.java:174) ~[?:?]
        at org.elasticsearch.xpack.security.transport.ServerTransportFilter$NodeProfile.inbound(ServerTransportFilter.java:120) ~[?:?]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:313) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:264) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.7.0.jar:7.7.0]
        ... 3 more

https://github.com/elastic/elasticsearch/blob/v7.7.0/server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java#L182

https://github.com/elastic/elasticsearch/blob/v7.7.0/server/src/main/java/org/elasticsearch/index/store/Store.java#L1524

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-05-25T09:25:31Z

Pinging @elastic/es-distributed (:Distributed/Engine)

ywelsch · 2020-05-25T09:37:22Z

Can you provide the full logs somewhere? The log entries you showed only cover the failed peer recovery (i.e. recovering a replica failed) where your cluster was yellow, not red.

Can you also provide some details about the filesystem that you're using? It sounds like some files just got wiped on disk (perhaps by an external process?) Further, it would be interesting to get a directory listing from elastic-n1, in particular the folder that has the index uuid of the .ml-state index. You can find out the index uuid by looking at GET _cat/indices

ywelsch · 2020-05-26T10:54:53Z

We've identified the bug here, which looks to affect indices that were created in an ES version before 6.2. These indices can under certain conditions fail to be opened up in ES 7.7.0. We're working on a fix.

If an upgraded node is restarted multiple times without flushing a new index commit, then we will wrongly exclude all commits from the starting commits. This bug is reproducible with these minimal steps: (1) create an empty index on 6.1.4 with translog retention disabled, (2) upgrade the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem is that with the new translog policy can trim translog without having a new index commit, while the existing commit still refers to the previous translog generation. Closes #57091

If an upgraded node is restarted multiple times without flushing a new index commit, then we will wrongly exclude all commits from the starting commits. This bug is reproducible with these minimal steps: (1) create an empty index on 6.1.4 with translog retention disabled, (2) upgrade the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem is that with the new translog policy can trim translog without having a new index commit, while the existing commit still refers to the previous translog generation. Closes elastic#57091

dnhatn · 2020-05-27T16:14:43Z

Fixed in #57187. @Aketzu Thanks for reporting the issue.

If an upgraded node is restarted multiple times without flushing a new index commit, then we will wrongly exclude all commits from the starting commits. This bug is reproducible with these minimal steps: (1) create an empty index on 6.1.4 with translog retention disabled, (2) upgrade the cluster to 7.7.0, (3) restart the upgraded the cluster. The problem is that with the new translog policy can trim translog without having a new index commit, while the existing commit still refers to the previous translog generation. Closes #57091

Aketzu added >bug needs:triage Requires assignment of a team area label labels May 24, 2020

ywelsch added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed needs:triage Requires assignment of a team area label labels May 25, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 25, 2020

ywelsch added feedback_needed and removed >bug labels May 25, 2020

ywelsch assigned dnhatn May 26, 2020

ywelsch added blocker >bug labels May 26, 2020

dnhatn mentioned this issue May 27, 2020

Fix trimUnsafeCommits for indices created before 6.2 #57187

Merged

dnhatn removed the feedback_needed label May 27, 2020

dnhatn mentioned this issue May 27, 2020

Fix trimUnsafeCommits for indices created before 6.2 #57222

Merged

dnhatn mentioned this issue May 27, 2020

Fix trimUnsafeCommits for indices created before 6.2 #57223

Merged

dnhatn closed this as completed May 27, 2020

seut mentioned this issue Jan 3, 2022

Big partitioned table isn't recovering on 4.6.3 on k8s cluster. crate/crate#11756

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shard stuck in findSafeCommitPoint / Commit list must not empty #57091

Shard stuck in findSafeCommitPoint / Commit list must not empty #57091

Aketzu commented May 24, 2020

elasticmachine commented May 25, 2020

ywelsch commented May 25, 2020

ywelsch commented May 26, 2020

dnhatn commented May 27, 2020

Shard stuck in findSafeCommitPoint / Commit list must not empty #57091

Shard stuck in findSafeCommitPoint / Commit list must not empty #57091

Comments

Aketzu commented May 24, 2020

elasticmachine commented May 25, 2020

ywelsch commented May 25, 2020

ywelsch commented May 26, 2020

dnhatn commented May 27, 2020