-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster health should distinguish RED from "no master" #34897
Comments
Pinging @elastic/es-core-infra |
We discussed this and broadly agreed to do it. Note that we are not intending to change the meaning of We contemplated identifying the discovered master in the cluster health response. I think I'd rather just have a boolean |
I am currently working on it. |
…o-master (elastic#34897) Added boolean "discovered_master": true in the GET _cluster/health response to expose the presence of master.
We discussed this again recently and determined that |
Hi @DaveCTurner , thanks for responding on this. I tried running this and verified form code as well that This makes the API behavior inconsistent with the presence of Introducing the |
Asking a node to compute the cluster health locally is always possible, so 200 seems like the right response there. 503 here doesn't mean simply "unhealthy", it means "so unhealthy that I can't even answer your question".
This deserves some further investigation. Health requests should be pretty cheap, and there are benefits to getting an answer from the master. Are you really seeing the master struggling with the load here? Can you provide more details? |
nodes are available. The behaviour of the `/` endpoint changed[0] between 6.x and 7.x, whereby previously it would return a HTTP `503` response when the cluster was blocked, it now returns a HTTP `200` response even if there are no masters available. This change updates the behaviour of the `readinessProbe` command during normal running to verify that the local node is responding and that there are master nodes available. [1] The desired behaviour here is that if the data nodes are unable to talk to their master nodes for whatever reason, then the data nodes will become `Unready` and therefore be removed from the Service load-balancer until the master nodes are available again. Refs: [0] elastic/elasticsearch#29045 [1] elastic/elasticsearch#34897 (comment)
Hi @DaveCTurner replying back and requesting opening this thread again to get some more thoughts. We have seen instances where master node gets overwhelmed with too many (periodic) health requests being delegated to it from data nodes (without Even for non dedicated master setup, this can get aggravated with even with a small cluster sizes (less than 10 nodes). This could typically happens in scenarios when cluster has a font end service, or a load balancer or maybe some monitoring script which intends to look for node health periodically, to send or restrict traffic to node, based upon nodes ability to process the request. Without the presence of field As there could be multiple reason which could lead a node to this situation, such as leader check failures due to network partitioning etc. Please let me know if it still makes sense to have a dedicated |
Repeating my previous message, health requests should be pretty cheap, can you analyse this in more detail so we can understand better how and why it's struggling? |
nodes are available. The behaviour of the `/` endpoint changed[0] between 6.x and 7.x, whereby previously it would return a HTTP `503` response when the cluster was blocked, it now returns a HTTP `200` response even if there are no masters available. This change updates the behaviour of the `readinessProbe` command during normal running to verify that the local node is responding and that there are master nodes available. [1] The desired behaviour here is that if the data nodes are unable to talk to their master nodes for whatever reason, then the data nodes will become `Unready` and therefore be removed from the Service load-balancer until the master nodes are available again. Refs: [0] elastic/elasticsearch#29045 [1] elastic/elasticsearch#34897 (comment)
A
RED
cluster health means that there is either no master or else at least one primary is unassigned. However, in 7.0 there seems to be no simple way for clients to distinguish these two cases, and it would be useful to do so. For example:a per-node health check should verify the node-level property of cluster membership, but may not be interested in the cluster-wide property of whether all the primaries are assigned.
an orchestrator may wish to wait for a freshly-started node to join a cluster before performing some followup actions, again without caring about whether all the primaries are assigned. Today this often happens by waiting for the HTTP port to be open, but this is unreliable: if no master is discovered within
discovery.initial_state_timeout
(default 30s) then we open the HTTP port anyway.In the 6.x series a node responds to
GET /
with503 Service Unavailable
if it believes there to be no master (i.e. eitherNO_MASTER_BLOCK_*
orSTATE_NOT_RECOVERED_BLOCK
is present), which allows these cases to be distinguished. However #29045 changes this in 7.0 so that we will always respond200 OK
toGET /
(as is right and proper) so another mechanism is needed.I think we should expose the presence or absence of these blocks in the output to
GET _cluster/health
and add the ability to wait for their absence, e.g.:The text was updated successfully, but these errors were encountered: