Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay allocation #39

Merged
merged 5 commits into from
Feb 28, 2017
Merged

Conversation

jmcarp
Copy link
Member

@jmcarp jmcarp commented Feb 17, 2017

As @cnelson described in cloud-gov/product#673 (comment), restarting the cluster can lead to outages for a few reasons:

  • Since the monit health check on elastic verifies that the elastic process is running but not that the node has joined the cluster, bosh can start deploying the next node before the previous node has successfully joined the cluster.
  • Since elastic defaults to a one-minute node timeout before reallocating shards, and restarting a node might take more than one minute, the cluster can frantically start reallocating shards for no reason.

This patch addresses both issues. We added a post-start script that blocks until the node is listening on 9200--and, for data nodes, until the cluster is healthy. We also optionally increase the node timeout on drain, then restore it on post-deploy, to avoid "shard shuffle". This also means we don't need to rely on shard routing settings to keep the cluster healthy during restarts: elastic/elasticsearch#19739.

@axelaris axelaris self-assigned this Feb 18, 2017
@axelaris axelaris requested a review from Infra-Red February 20, 2017 10:02
@axelaris
Copy link
Member

Hi @jmcarp,
I believe this is a good idea.
However, I can see at least one potential issue here. Upon upgrading a cluster with this patch, each node will be drained using old version of script (disabling routing allocation). Since new post-start script does not enabling it back, cluster upgrade will fail :-(

@jmcarp
Copy link
Member Author

jmcarp commented Feb 20, 2017

Good point--how about leaving the routing allocation change in post-start for now, with a TODO to drop it in a future release?

@cnelson
Copy link
Contributor

cnelson commented Feb 24, 2017

@axelaris @Infra-Red Any more feedback on this (and this #38) PR?

@@ -5,6 +5,11 @@ set -e
out=$(mktemp health-XXXXXX)
remaining=<%= p("elasticsearch.health.timeout") %>

# Ensure shard allocation is enabled for updates from previous release
# TODO: Deprecate on next release
curl -X PUT -s <%= p('elasticsearch.master_hosts').first %>:9200/_cluster/settings
Copy link
Member

@axelaris axelaris Feb 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you missed \ at the line end :-(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@axelaris
Copy link
Member

Hi @cnelson,
just checked it and found a little issue in the code. Please check my comment.

@axelaris
Copy link
Member

Now all is seems to be working ok.
Could you please squash your commits for better tracking?

@jmcarp
Copy link
Member Author

jmcarp commented Feb 27, 2017

If you want this to be squashed into a single commit, the github "squash and merge" button should do it.

@axelaris axelaris merged commit e990bb3 into cloudfoundry-community:develop Feb 28, 2017
@axelaris
Copy link
Member

Ok, it seems thats working.
Thank you @jmcarp !

hannayurkevich pushed a commit to hannayurkevich/logsearch-boshrelease that referenced this pull request Mar 7, 2017
* Optionally delay allocation during restart.

* Wait for nodes to rejoin post-start.

* Move delay allocation restore to post-start.

h/t @cnelson

* Check local state if not data-only.

* Ensure shard allocation so that upgrades succeed.

h/t @axelaris
@geofffranks
Copy link
Contributor

Can we get a new final release built with these improvements?

@axelaris
Copy link
Member

Hi @geofffranks,
we're working on v5 release, where this PR will be included

axelaris added a commit that referenced this pull request May 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants