Rolling restarts and allocation disabling #19739

markwalkom · 2016-08-02T10:35:22Z

Here we state to use "cluster.routing.allocation.enable" : "none" before restarting nodes.

With the inclusion of index.unassigned.node_left.delayed_timeout in later versions, would it make sense to update our recommended practises to temporarily increase this setting, instead of entirely disabling allocation?

This was raised on this forum issue.

The text was updated successfully, but these errors were encountered:

cheruvian · 2016-08-04T01:04:41Z

Would also be useful to list (in the same area) the exception or response returned when a request to index to a primary shard for that host is attempted.

clintongormley · 2016-08-11T10:32:15Z

Yes, I think we should probably change this advice. In fact, it's not even a requirement to increase this timeout. If the shard starts reallocating and then a node joins with the shard intact, the reallocation should be cancelled. (not sure what happens with unsync'ed shards).

@ywelsch what do you think?

ywelsch · 2016-08-11T12:31:54Z

@clintongormley The cancelling only works for synced shards. If there has been write activity on the primary while the node was restarted (or 5 minutes prior to the restart with no explicit synced flush), the existing replica allocation is not cancelled. There is certainly room for future improvements in this area.

If there are writes to be expected on the indices while the node is restarted and it is likely that the node will miss the default delayed timeout, increasing the timeout is a solution. The risk with temporarily increasing the value of node_left.delayed_timeout though is to make sure that the setting is properly reset on all indices (it's an index-level setting) once the node has come back. An alternative would be to temporarily deactivate allocation of replicas (cluster.routing.allocation.enable = primaries) which is a cluster-level setting. Also note that disabling shard rebalancing (cluster.routing.rebalance.enable) might still be useful, which is a different setting than the one recommended by the current documentation (cluster.routing.allocation.enable). By only disabling rebalance, new indices can still be created (and even go to green) but existing active shards will not be moved around (except for situations where node goes above high disk watermark or filter allocation rules changed in hot/warm setup).

jmcarp · 2017-02-17T19:33:44Z

Am I understanding correctly that the current best practice for rolling restarts is to fiddle with delay allocation instead of setting cluster.routing.allocation.enable?

cc @cnelson @LinuxBozo

tomsommer · 2017-10-04T08:08:09Z

The documentation still hasn't been changed on this, and no agreement has been found.

To me, it sounds like index.unassigned.node_left.delayed_timeout as a general best-practice should be raised to a enough-time-to-restart-a-server value.

Is the suggestion per #19739 (comment) to set cluster.routing.allocation.enable = primaries and raise index.unassigned.node_left.delayed_timeout when doing rolling updates?

elasticmachine · 2018-04-24T10:25:42Z

Pinging @elastic/es-distributed

Clarify the “one minute” in the instructions to disable the shard allocation when doing maintenance to say that it is configurable. Add a note about making sure that no rebalancing occurs until the maintenance is complete. Relates elastic#19739.

DaveCTurner · 2018-04-30T08:19:26Z

I think the docs are right: it seems appropriate to use "cluster.routing.allocation.enable" : "none" while a node is down for maintenance. I do not think it's a good idea to do maintenance while racing against the node_left.delayed_timeout as a matter of course. A rolling restart implies the active involvement of a cluster administrator, who is expected to re-enable allocation manually as appropriate if the maintenance takes longer than anticipated.

Note that there are some alterations to this area of the docs in flight (#29670, #29671) which continue to recommend using cluster.routing.allocation.enable.

It could be problematic if rebalancing kicks in while the node is coming back into the cluster. By default rebalancing only has an effect once the cluster is green (cluster.routing.allocation.allow_rebalance is indices_all_active) but the effects of changing this setting are not very clear. I opened #30248 to fix this.

DaveCTurner · 2018-11-07T09:47:55Z

I think there's no more action to take on this issue, closing.

markwalkom added >docs General docs changes discuss labels Aug 2, 2016

clintongormley added help wanted adoptme and removed discuss labels Aug 12, 2016

jmcarp mentioned this issue Feb 17, 2017

Delay allocation cloudfoundry-community/logsearch-boshrelease#39

Merged

colings86 added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Apr 24, 2018

DaveCTurner mentioned this issue Apr 30, 2018

Improve allocation-disabling instructions #30248

Merged

DaveCTurner closed this as completed Nov 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rolling restarts and allocation disabling #19739

Rolling restarts and allocation disabling #19739

markwalkom commented Aug 2, 2016 •

edited

Loading

cheruvian commented Aug 4, 2016

clintongormley commented Aug 11, 2016

ywelsch commented Aug 11, 2016

jmcarp commented Feb 17, 2017

tomsommer commented Oct 4, 2017

elasticmachine commented Apr 24, 2018

DaveCTurner commented Apr 30, 2018

DaveCTurner commented Nov 7, 2018

Rolling restarts and allocation disabling #19739

Rolling restarts and allocation disabling #19739

Comments

markwalkom commented Aug 2, 2016 • edited Loading

cheruvian commented Aug 4, 2016

clintongormley commented Aug 11, 2016

ywelsch commented Aug 11, 2016

jmcarp commented Feb 17, 2017

tomsommer commented Oct 4, 2017

elasticmachine commented Apr 24, 2018

DaveCTurner commented Apr 30, 2018

DaveCTurner commented Nov 7, 2018

markwalkom commented Aug 2, 2016 •

edited

Loading