-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add optional http health check for elasticsearch. #35
Conversation
As @cnelson noticed, the fact that the elasticsearch process is running does not mean that elasticsearch is healthy. For example, if nodes are out of memory or can't join the cluster, the process will be active, but the node is not healthy. This patch adds the option to monitor elasticsearch health via an http check instead of the default process-based check.
jobs/elasticsearch/monit
Outdated
check host elasticsearch with address 127.0.0.1 | ||
start program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl start'" with timeout 120 seconds | ||
stop program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl stop'" | ||
if failed url http://127.0.0.1:9200/_cluster/health |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WDYT about adding a and content == '"status":"green"'
check here so this will block until the cluster returns to a good state?
It'll slow down the deployment, but prevent the cluster from going red in this scenario:
- Date node 1 goes down for an update
- This causes it's primary shards to be marked unavailable (and the backup shards are activated).
- The node finishes updating, ES restarts and rejoins the cluster (which is now in a yellow state as new backups shards are still being allocated) and monit marks the node up.
- Data node 2 goes down for the update
- This causes it's primary shards to be marked unavailable (and the backup shards are activated).
- If any of node 2's primary shards were newly promoted backup shards from node 1 going down that haven't finished allocating backup shards yet, then the cluster is now red.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only question about this is what happens when we deploy a cluster from scratch. If we have three masters and set minimum masters to two, will the 0th master ever report a green cluster health?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, the master will block until it can form a quorum. WDYT about only enabling the content check if p("elasticsearch.node.allow_master") == false && p("elasticsearch.node.allow_data") == true
?
That will still cause clusters to go red for people running running combo data and master nodes, but if they are doing that then they really don't care about running ES right anyway
53a6b8b
to
a17424d
Compare
jobs/elasticsearch/monit
Outdated
check host elasticsearch with address 127.0.0.1 | ||
start program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl start'" with timeout 120 seconds | ||
stop program "/var/vcap/jobs/elasticsearch/bin/monit_debugger elasticsearch_ctl '/var/vcap/jobs/elasticsearch/bin/elasticsearch_ctl stop'" | ||
if failed url http://127.0.0.1:9200/_cluster/health | ||
<% if not p("elasticsearch.node.allow_master") %>and content = '"status":"green"'<% end %> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be <% if not p("elasticsearch.node.allow_master") and p("elasticsearch.node.allow_data")%>
there's no sense on blocking other non-data nodes (like parsers) until the cluster is green. Parsers specifically can still function when the cluster is yellow.
91b6660
to
0feea44
Compare
0feea44
to
d7b3996
Compare
Replaced by #39. Monit doesn't seem to be up to the task--we're going with post-start instead. |
I'm labelling this as WIP because I want to do more testing before I call it ready, but feel free to review as-is.
As @cnelson noticed, the fact that the elasticsearch process is running
does not mean that elasticsearch is healthy. For example, if nodes are
out of memory or can't join the cluster, the process will be active, but
the node is not healthy. This patch adds the option to monitor
elasticsearch health via an http check instead of the default
process-based check.