Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primary shard recovery time slower in 6.3.2 than 5.6.7 #33198

Closed
ahadadi opened this issue Aug 28, 2018 · 2 comments
Closed

Primary shard recovery time slower in 6.3.2 than 5.6.7 #33198

ahadadi opened this issue Aug 28, 2018 · 2 comments
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source.

Comments

@ahadadi
Copy link

ahadadi commented Aug 28, 2018

Elasticsearch version (6.3.2)

Description of the problem including expected versus actual behavior:
When upgrading from 5.6.7 to 6.3.2, we've noticed that relocating a primary shard takes longer.
It seems that in 6.3.2, when relocating a primary shard to a different node, translog operations are being replayed. This happens even if the shard on the source node was successfully flushed, which means the translog does not contain any operation that is not already contained in the files being copied to the target node.
In 5.6.7 the translog is emptied when flush takes place, so translog operations are not replayed during relocation.

The expected behavior is that the recovery of a flushed shard to an empty target node will not entail translog replay, only copying files.

Steps to reproduce:

  1. Create an index with a single primary shard and no replicas.
  2. Index 1M documents.
  3. Flush the index.
  4. Relocate the index to a different node using e.g. "index.routing.allocation.require._name" with a different node name.

If you set org.elasticsearch.indices.recovery logging level to TRACE, you will see that a file based recovery is taking place, that the files are transferred and then the translog is being sent and replayed on the target node:
[2018-08-28T13:10:31,099][TRACE][o.e.i.r.RecoverySourceHandler] [node_td2] [index][0][recover to node_td1] sent batch of [10083][512kb] (total: [1000000]) translog operations

@markharwood markharwood added the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Aug 28, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jasontedor
Copy link
Member

We replay the translog operations so that the relocated shard has a history of operations in its translog too. This history is important for operations-based recoveries (when a shard temporarily goes offline and only needs to replay some operations to catch up). By default, we are now retaining 512 MB or twelve hours of translog files for these purposes. We are making some improvements here as we work on relying on the translog less for history, see for example #33190. It will be awhile though until the behavior that PR is building on is the default. For now, you can adjust your translog retention policy, but the risk is that a shard will not have enough history in its translog for an operations-based recovery and you will have to fall back to file-based recoveries.

cc: @dnhatn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source.
Projects
None yet
Development

No branches or pull requests

4 participants