Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add optional http health check for elasticsearch. #35

Closed

Conversation

jmcarp
Copy link
Member

@jmcarp jmcarp commented Feb 14, 2017

I'm labelling this as WIP because I want to do more testing before I call it ready, but feel free to review as-is.

As @cnelson noticed, the fact that the elasticsearch process is running
does not mean that elasticsearch is healthy. For example, if nodes are
out of memory or can't join the cluster, the process will be active, but
the node is not healthy. This patch adds the option to monitor
elasticsearch health via an http check instead of the default
process-based check.

As @cnelson noticed, the fact that the elasticsearch process is running
does not mean that elasticsearch is healthy. For example, if nodes are
out of memory or can't join the cluster, the process will be active, but
the node is not healthy. This patch adds the option to monitor
elasticsearch health via an http check instead of the default
process-based check.
check host elasticsearch with address 127.0.0.1
start program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl start'" with timeout 120 seconds
stop program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl stop'"
if failed url http://127.0.0.1:9200/_cluster/health
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about adding a and content == '"status":"green"' check here so this will block until the cluster returns to a good state?

It'll slow down the deployment, but prevent the cluster from going red in this scenario:

  • Date node 1 goes down for an update
  • This causes it's primary shards to be marked unavailable (and the backup shards are activated).
  • The node finishes updating, ES restarts and rejoins the cluster (which is now in a yellow state as new backups shards are still being allocated) and monit marks the node up.
  • Data node 2 goes down for the update
  • This causes it's primary shards to be marked unavailable (and the backup shards are activated).
  • If any of node 2's primary shards were newly promoted backup shards from node 1 going down that haven't finished allocating backup shards yet, then the cluster is now red.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only question about this is what happens when we deploy a cluster from scratch. If we have three masters and set minimum masters to two, will the 0th master ever report a green cluster health?

Copy link
Contributor

@cnelson cnelson Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, the master will block until it can form a quorum. WDYT about only enabling the content check if p("elasticsearch.node.allow_master") == false && p("elasticsearch.node.allow_data") == true ?

That will still cause clusters to go red for people running running combo data and master nodes, but if they are doing that then they really don't care about running ES right anyway :trollface:

@jmcarp jmcarp force-pushed the monit-http-check branch 2 times, most recently from 53a6b8b to a17424d Compare February 14, 2017 22:05
check host elasticsearch with address 127.0.0.1
start program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl start'" with timeout 120 seconds
stop program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl stop'"
if failed url http://127.0.0.1:9200/_cluster/health
<% if not p("elasticsearch.node.allow_master") %>and content = '"status":"green"'<% end %>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be <% if not p("elasticsearch.node.allow_master") and p("elasticsearch.node.allow_data")%> there's no sense on blocking other non-data nodes (like parsers) until the cluster is green. Parsers specifically can still function when the cluster is yellow.

@jmcarp jmcarp force-pushed the monit-http-check branch 3 times, most recently from 91b6660 to 0feea44 Compare February 14, 2017 22:50
@jmcarp
Copy link
Member Author

jmcarp commented Feb 17, 2017

Replaced by #39. Monit doesn't seem to be up to the task--we're going with post-start instead.

@jmcarp jmcarp closed this Feb 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants