Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling restarts and allocation disabling #19739

Closed
markwalkom opened this issue Aug 2, 2016 · 8 comments
Closed

Rolling restarts and allocation disabling #19739

markwalkom opened this issue Aug 2, 2016 · 8 comments
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >docs General docs changes help wanted adoptme

Comments

@markwalkom
Copy link
Contributor

markwalkom commented Aug 2, 2016

Here we state to use "cluster.routing.allocation.enable" : "none" before restarting nodes.

With the inclusion of index.unassigned.node_left.delayed_timeout in later versions, would it make sense to update our recommended practises to temporarily increase this setting, instead of entirely disabling allocation?

This was raised on this forum issue.

@markwalkom markwalkom added >docs General docs changes discuss labels Aug 2, 2016
@cheruvian
Copy link

Would also be useful to list (in the same area) the exception or response returned when a request to index to a primary shard for that host is attempted.

@clintongormley
Copy link
Contributor

Yes, I think we should probably change this advice. In fact, it's not even a requirement to increase this timeout. If the shard starts reallocating and then a node joins with the shard intact, the reallocation should be cancelled. (not sure what happens with unsync'ed shards).

@ywelsch what do you think?

@ywelsch
Copy link
Contributor

ywelsch commented Aug 11, 2016

@clintongormley The cancelling only works for synced shards. If there has been write activity on the primary while the node was restarted (or 5 minutes prior to the restart with no explicit synced flush), the existing replica allocation is not cancelled. There is certainly room for future improvements in this area.

If there are writes to be expected on the indices while the node is restarted and it is likely that the node will miss the default delayed timeout, increasing the timeout is a solution. The risk with temporarily increasing the value of node_left.delayed_timeout though is to make sure that the setting is properly reset on all indices (it's an index-level setting) once the node has come back. An alternative would be to temporarily deactivate allocation of replicas (cluster.routing.allocation.enable = primaries) which is a cluster-level setting. Also note that disabling shard rebalancing (cluster.routing.rebalance.enable) might still be useful, which is a different setting than the one recommended by the current documentation (cluster.routing.allocation.enable). By only disabling rebalance, new indices can still be created (and even go to green) but existing active shards will not be moved around (except for situations where node goes above high disk watermark or filter allocation rules changed in hot/warm setup).

@clintongormley clintongormley added help wanted adoptme and removed discuss labels Aug 12, 2016
@jmcarp
Copy link

jmcarp commented Feb 17, 2017

Am I understanding correctly that the current best practice for rolling restarts is to fiddle with delay allocation instead of setting cluster.routing.allocation.enable?

cc @cnelson @LinuxBozo

@tomsommer
Copy link

The documentation still hasn't been changed on this, and no agreement has been found.

To me, it sounds like index.unassigned.node_left.delayed_timeout as a general best-practice should be raised to a enough-time-to-restart-a-server value.

Is the suggestion per #19739 (comment) to set cluster.routing.allocation.enable = primaries and raise index.unassigned.node_left.delayed_timeout when doing rolling updates?

@colings86 colings86 added the :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Apr 24, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner DaveCTurner added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) and removed :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. labels Apr 27, 2018
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue Apr 30, 2018
Clarify the “one minute” in the instructions to disable the shard allocation
when doing maintenance to say that it is configurable.

Add a note about making sure that no rebalancing occurs until the maintenance
is complete.

Relates elastic#19739.
@DaveCTurner
Copy link
Contributor

I think the docs are right: it seems appropriate to use "cluster.routing.allocation.enable" : "none" while a node is down for maintenance. I do not think it's a good idea to do maintenance while racing against the node_left.delayed_timeout as a matter of course. A rolling restart implies the active involvement of a cluster administrator, who is expected to re-enable allocation manually as appropriate if the maintenance takes longer than anticipated.

Note that there are some alterations to this area of the docs in flight (#29670, #29671) which continue to recommend using cluster.routing.allocation.enable.

It could be problematic if rebalancing kicks in while the node is coming back into the cluster. By default rebalancing only has an effect once the cluster is green (cluster.routing.allocation.allow_rebalance is indices_all_active) but the effects of changing this setting are not very clear. I opened #30248 to fix this.

@DaveCTurner
Copy link
Contributor

I think there's no more action to take on this issue, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >docs General docs changes help wanted adoptme
Projects
None yet
Development

No branches or pull requests

9 participants