Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New master repeatedly reroute and fetch shard store of recovering replica #40107

Closed
dnhatn opened this issue Mar 15, 2019 · 3 comments · Fixed by #42287
Closed

New master repeatedly reroute and fetch shard store of recovering replica #40107

dnhatn opened this issue Mar 15, 2019 · 3 comments · Fixed by #42287
Assignees
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes)

Comments

@dnhatn
Copy link
Member

dnhatn commented Mar 15, 2019

If a master fails over and there's a replica performing phase1 of recovery, then the new master will repeatedly but fail to fetch shard store of the recovering replica until that replica completes phase 1. Some consequences of this:

  • Spam log files over and over
  • Flood the master node with many reroute task
  • If there's an unassigned shard (same shardId with the recovering replica), it won't be allocated until the recovery completes

/cc @ywelsch

@dnhatn dnhatn added the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Mar 15, 2019
@dnhatn dnhatn self-assigned this Mar 15, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn dnhatn added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Mar 16, 2019
@DaveCTurner
Copy link
Contributor

Relates #29140 (comment) and issues linked from there.

ywelsch added a commit that referenced this issue May 22, 2019
A shard that is undergoing peer recovery is subject to logging warnings of the form

org.elasticsearch.action.FailedNodeException: Failed node [XYZ]
...
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ...

These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e.
there is an IndexShard instance, but no proper IndexCommit just yet).
As these failures are currently bubbled up to the master, they cause unnecessary reroutes and
confusion amongst users due to being logged as warnings.

Closes  #40107
ywelsch added a commit that referenced this issue May 22, 2019
A shard that is undergoing peer recovery is subject to logging warnings of the form

org.elasticsearch.action.FailedNodeException: Failed node [XYZ]
...
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ...

These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e.
there is an IndexShard instance, but no proper IndexCommit just yet).
As these failures are currently bubbled up to the master, they cause unnecessary reroutes and
confusion amongst users due to being logged as warnings.

Closes  #40107
ywelsch added a commit that referenced this issue May 22, 2019
A shard that is undergoing peer recovery is subject to logging warnings of the form

org.elasticsearch.action.FailedNodeException: Failed node [XYZ]
...
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ...

These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e.
there is an IndexShard instance, but no proper IndexCommit just yet).
As these failures are currently bubbled up to the master, they cause unnecessary reroutes and
confusion amongst users due to being logged as warnings.

Closes  #40107
gurkankaymak pushed a commit to gurkankaymak/elasticsearch that referenced this issue May 27, 2019
…2287)

A shard that is undergoing peer recovery is subject to logging warnings of the form

org.elasticsearch.action.FailedNodeException: Failed node [XYZ]
...
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in ...

These failures are actually harmless, and expected to happen while a peer recovery is ongoing (i.e.
there is an IndexShard instance, but no proper IndexCommit just yet).
As these failures are currently bubbled up to the master, they cause unnecessary reroutes and
confusion amongst users due to being logged as warnings.

Closes  elastic#40107
@PhaedrusTheGreek
Copy link
Contributor

Symptomatic error messages:

[2019-08-28T03:50:58,000][WARN ][o.e.g.GatewayAllocator$InternalReplicaShardAllocator] [node1] [logstash-2019.08.25][8]: failed to list shard for shard_store on node [K89X-xcdsfsdvvDDFfSs]
org.elasticsearch.action.FailedNodeException: Failed node
...
Caused by: org.elasticsearch.transport.RemoteTransportException: [node1][192.168.0.11:9302][internal:cluster/nodes/indices/shard/store[n]]
Caused by: org.elasticsearch.ElasticsearchException: Failed to list store metadata for shard [[logstash-2019.08.25][8]]
...
Caused by: java.io.FileNotFoundException: no segments* file found in store

Related conditions:

  • peer recovery is slow
  • heap utilization is high

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants