Add a timeout to local mapping change check #9575

bleskes · 2015-02-05T08:25:26Z

After phase1 of recovery is completed, we check that all pending mapping changes have been sent to the master and processed by the other nodes. This is needed in order to make sure that the target node has the latest mapping (we just copied over the corresponding lucene files). To make sure we do not miss updates, we do so under a local cluster state update task. At the moment we don't have a timeout when waiting on the task to be completed. If the local node update thread is very busy, this may stall the recovery for too long. This commit adds a timeout (equal to indices.recovery.internal_action_timeout) and upgrade the task urgency to IMMEDIATE. If the check times out , we fail the recovery.

Note that is PR is against 1.4 .

After phase1 of recovery is completed, we check that all pending mapping changes have been sent to the master and processed by the other nodes. This is needed in order to make sure that the target node has the latest mapping (we just copied over the corresponding lucene files). To make sure we do not miss updates, we do so under a local cluster state update task. At the moment we don't have a timeout when waiting on the task to be completed. If the local node update thread is very busy, this may stall the recovery for too long. This commit adds a time (equal to `indices.recovery.internal_action_timeout`) and upgrade the task urgency to `IMMEDIATE`

s1monw · 2015-02-05T14:02:12Z

I think the timeout is ok to add but returning as if everything is ok is wrong. IMO we need to fail this recovery altogether until we can run this check successfully.

bleskes · 2015-02-05T15:13:35Z

@s1monw @kimchy pushed an update

s1monw · 2015-02-06T00:06:32Z

LGTM thanks boaz

After phase1 of recovery is completed, we check that all pending mapping changes have been sent to the master and processed by the other nodes. This is needed in order to make sure that the target node has the latest mapping (we just copied over the corresponding lucene files). To make sure we do not miss updates, we do so under a local cluster state update task. At the moment we don't have a timeout when waiting on the task to be completed. If the local node update thread is very busy, this may stall the recovery for too long. This commit adds a timeout (equal to `indices.recovery.internal_action_timeout`) and upgrade the task urgency to `IMMEDIATE`. If we fail to perform the check, we fail the recovery. Closes #9575

After phase1 of recovery is completed, we check that all pending mapping changes have been sent to the master and processed by the other nodes. This is needed in order to make sure that the target node has the latest mapping (we just copied over the corresponding lucene files). To make sure we do not miss updates, we do so under a local cluster state update task. At the moment we don't have a timeout when waiting on the task to be completed. If the local node update thread is very busy, this may stall the recovery for too long. This commit adds a timeout (equal to `indices.recovery.internal_action_timeout`) and upgrade the task urgency to `IMMEDIATE`. If we fail to perform the check, we fail the recovery. Closes elastic#9575

bleskes added v1.4.3 v1.5.0 v2.0.0-beta1 >regression resiliency :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. review and removed >regression labels Feb 5, 2015

move to fail recovery upon mapping check timeout

084ae66

bleskes closed this in 2302222 Feb 6, 2015

bleskes deleted the recovery_mapping_check branch February 6, 2015 09:09

bleskes mentioned this pull request Feb 9, 2015

Shard stuck in relocating state with recovery stage=translog #9226

Closed

s1monw added v1.3.8 and removed review labels Feb 10, 2015

clintongormley added the >enhancement label Feb 10, 2015

bleskes mentioned this pull request Apr 11, 2015

updateMappingOnMaster never times out leaving replicas stuck in INITIALIZING #9066

Closed

clintongormley changed the title ~~Recovery: add a timeout to local mapping change check~~ Add a timeout to local mapping change check Jun 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a timeout to local mapping change check #9575

Add a timeout to local mapping change check #9575

bleskes commented Feb 5, 2015

s1monw commented Feb 5, 2015

bleskes commented Feb 5, 2015

s1monw commented Feb 6, 2015

Add a timeout to local mapping change check #9575

Add a timeout to local mapping change check #9575

Conversation

bleskes commented Feb 5, 2015

s1monw commented Feb 5, 2015

bleskes commented Feb 5, 2015

s1monw commented Feb 6, 2015