New master repeatedly reroute and fetch shard store of recovering replica #40107

dnhatn · 2019-03-15T16:49:13Z

If a master fails over and there's a replica performing phase1 of recovery, then the new master will repeatedly but fail to fetch shard store of the recovering replica until that replica completes phase 1. Some consequences of this:

Spam log files over and over
Flood the master node with many reroute task
If there's an unassigned shard (same shardId with the recovering replica), it won't be allocated until the recovery completes

/cc @ywelsch

elasticmachine · 2019-03-15T16:49:19Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-03-17T08:25:57Z

Relates #29140 (comment) and issues linked from there.

A shard that is undergoing peer recovery is subject to logging warnings of the form org.elasticsearch.action.FailedNodeException: Failed node [XYZ] ... Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ... These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e. there is an IndexShard instance, but no proper IndexCommit just yet). As these failures are currently bubbled up to the master, they cause unnecessary reroutes and confusion amongst users due to being logged as warnings. Closes #40107

…2287) A shard that is undergoing peer recovery is subject to logging warnings of the form org.elasticsearch.action.FailedNodeException: Failed node [XYZ] ... Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ... These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e. there is an IndexShard instance, but no proper IndexCommit just yet). As these failures are currently bubbled up to the master, they cause unnecessary reroutes and confusion amongst users due to being logged as warnings. Closes elastic#40107

PhaedrusTheGreek · 2019-09-18T17:15:25Z

Symptomatic error messages:

[2019-08-28T03:50:58,000][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [node1] [logstash-2019.08.25][8]: failed to list shard for shard_store on node [K89X-xcdsfsdvvDDFfSs]
org.elasticsearch.action.FailedNodeException: Failed node
...
Caused by: org.elasticsearch.transport.RemoteTransportException: [node1][192.168.0.11:9302][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[logstash-2019.08.25][8]]
...
Caused by: java.io.FileNotFoundException: no segments* file found in store

Related conditions:

peer recovery is slow
heap utilization is high

dnhatn added the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Mar 15, 2019

dnhatn self-assigned this Mar 15, 2019

dnhatn added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Mar 16, 2019

dnhatn mentioned this issue May 21, 2019

Avoid bubbling up failures from a shard that is recovering #42287

Merged

ywelsch closed this as completed in #42287 May 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New master repeatedly reroute and fetch shard store of recovering replica #40107

New master repeatedly reroute and fetch shard store of recovering replica #40107

dnhatn commented Mar 15, 2019

elasticmachine commented Mar 15, 2019

DaveCTurner commented Mar 17, 2019

PhaedrusTheGreek commented Sep 18, 2019

New master repeatedly reroute and fetch shard store of recovering replica #40107

New master repeatedly reroute and fetch shard store of recovering replica #40107

Comments

dnhatn commented Mar 15, 2019

elasticmachine commented Mar 15, 2019

DaveCTurner commented Mar 17, 2019

PhaedrusTheGreek commented Sep 18, 2019