Primary shard recovery time slower in 6.3.2 than 5.6.7 #33198

ahadadi · 2018-08-28T10:43:07Z

Elasticsearch version (6.3.2)

Description of the problem including expected versus actual behavior:
When upgrading from 5.6.7 to 6.3.2, we've noticed that relocating a primary shard takes longer.
It seems that in 6.3.2, when relocating a primary shard to a different node, translog operations are being replayed. This happens even if the shard on the source node was successfully flushed, which means the translog does not contain any operation that is not already contained in the files being copied to the target node.
In 5.6.7 the translog is emptied when flush takes place, so translog operations are not replayed during relocation.

The expected behavior is that the recovery of a flushed shard to an empty target node will not entail translog replay, only copying files.

Steps to reproduce:

Create an index with a single primary shard and no replicas.
Index 1M documents.
Flush the index.
Relocate the index to a different node using e.g. "index.routing.allocation.require._name" with a different node name.

If you set org.elasticsearch.indices.recovery logging level to TRACE, you will see that a file based recovery is taking place, that the files are transferred and then the translog is being sent and replayed on the target node:
[2018-08-28T13:10:31,099][TRACE][o.e.i.r.RecoverySourceHandler] [node_td2] [index][0][recover to node_td1] sent batch of [10083][512kb] (total: [1000000]) translog operations

elasticmachine · 2018-08-28T11:10:30Z

Pinging @elastic/es-distributed

jasontedor · 2018-08-28T11:16:43Z

We replay the translog operations so that the relocated shard has a history of operations in its translog too. This history is important for operations-based recoveries (when a shard temporarily goes offline and only needs to replay some operations to catch up). By default, we are now retaining 512 MB or twelve hours of translog files for these purposes. We are making some improvements here as we work on relying on the translog less for history, see for example #33190. It will be awhile though until the behavior that PR is building on is the default. For now, you can adjust your translog retention policy, but the risk is that a shard will not have enough history in its translog for an operations-based recovery and you will have to fall back to file-based recoveries.

cc: @dnhatn

markharwood added the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Aug 28, 2018

jasontedor closed this as completed Aug 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primary shard recovery time slower in 6.3.2 than 5.6.7 #33198

Primary shard recovery time slower in 6.3.2 than 5.6.7 #33198

ahadadi commented Aug 28, 2018 •

edited

Loading

elasticmachine commented Aug 28, 2018

jasontedor commented Aug 28, 2018

Primary shard recovery time slower in 6.3.2 than 5.6.7 #33198

Primary shard recovery time slower in 6.3.2 than 5.6.7 #33198

Comments

ahadadi commented Aug 28, 2018 • edited Loading

elasticmachine commented Aug 28, 2018

jasontedor commented Aug 28, 2018

ahadadi commented Aug 28, 2018 •

edited

Loading