[WIP] Add optional http health check for elasticsearch. #35

jmcarp · 2017-02-14T17:30:04Z

I'm labelling this as WIP because I want to do more testing before I call it ready, but feel free to review as-is.

As @cnelson noticed, the fact that the elasticsearch process is running
does not mean that elasticsearch is healthy. For example, if nodes are
out of memory or can't join the cluster, the process will be active, but
the node is not healthy. This patch adds the option to monitor
elasticsearch health via an http check instead of the default
process-based check.

@cnelson

As @cnelson noticed, the fact that the elasticsearch process is running does not mean that elasticsearch is healthy. For example, if nodes are out of memory or can't join the cluster, the process will be active, but the node is not healthy. This patch adds the option to monitor elasticsearch health via an http check instead of the default process-based check.

cnelson · 2017-02-14T17:53:19Z

jobs/elasticsearch/monit

+check host elasticsearch with address 127.0.0.1
+  start program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl start'" with timeout 120 seconds
+  stop program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl stop'"
+  if failed url http://127.0.0.1:9200/_cluster/health


WDYT about adding a and content == '"status":"green"' check here so this will block until the cluster returns to a good state?

It'll slow down the deployment, but prevent the cluster from going red in this scenario:

Date node 1 goes down for an update

This causes it's primary shards to be marked unavailable (and the backup shards are activated).

The node finishes updating, ES restarts and rejoins the cluster (which is now in a yellow state as new backups shards are still being allocated) and monit marks the node up.

Data node 2 goes down for the update

This causes it's primary shards to be marked unavailable (and the backup shards are activated).

If any of node 2's primary shards were newly promoted backup shards from node 1 going down that haven't finished allocating backup shards yet, then the cluster is now red.

My only question about this is what happens when we deploy a cluster from scratch. If we have three masters and set minimum masters to two, will the 0th master ever report a green cluster health?

That's a good point, the master will block until it can form a quorum. WDYT about only enabling the content check if p("elasticsearch.node.allow_master") == false && p("elasticsearch.node.allow_data") == true ?

That will still cause clusters to go red for people running running combo data and master nodes, but if they are doing that then they really don't care about running ES right anyway

cnelson · 2017-02-14T22:10:39Z

jobs/elasticsearch/monit

 check host elasticsearch with address 127.0.0.1
  start program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl start'" with timeout 120 seconds
  stop program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl stop'"
  if failed url http://127.0.0.1:9200/_cluster/health
+    <% if not p("elasticsearch.node.allow_master") %>and content = '"status":"green"'<% end %>


I think this should be <% if not p("elasticsearch.node.allow_master") and p("elasticsearch.node.allow_data")%> there's no sense on blocking other non-data nodes (like parsers) until the cluster is green. Parsers specifically can still function when the cluster is yellow.

jmcarp · 2017-02-17T23:19:07Z

Replaced by #39. Monit doesn't seem to be up to the task--we're going with post-start instead.

cnelson reviewed Feb 14, 2017

View reviewed changes

jmcarp force-pushed the monit-http-check branch 2 times, most recently from 53a6b8b to a17424d Compare February 14, 2017 22:05

cnelson reviewed Feb 14, 2017

View reviewed changes

jmcarp force-pushed the monit-http-check branch 3 times, most recently from 91b6660 to 0feea44 Compare February 14, 2017 22:50

Check cluster health if not master.

d7b3996

jmcarp force-pushed the monit-http-check branch from 0feea44 to d7b3996 Compare February 14, 2017 23:22

jmcarp closed this Feb 17, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add optional http health check for elasticsearch. #35

[WIP] Add optional http health check for elasticsearch. #35

jmcarp commented Feb 14, 2017

cnelson Feb 14, 2017

jmcarp Feb 14, 2017

cnelson Feb 14, 2017 •

edited

Loading

cnelson Feb 14, 2017

jmcarp commented Feb 17, 2017

[WIP] Add optional http health check for elasticsearch. #35

[WIP] Add optional http health check for elasticsearch. #35

Conversation

jmcarp commented Feb 14, 2017

cnelson Feb 14, 2017

Choose a reason for hiding this comment

jmcarp Feb 14, 2017

Choose a reason for hiding this comment

cnelson Feb 14, 2017 • edited Loading

Choose a reason for hiding this comment

cnelson Feb 14, 2017

Choose a reason for hiding this comment

jmcarp commented Feb 17, 2017

cnelson Feb 14, 2017 •

edited

Loading