Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should GET / return 503 in case of discovery.zen.no_master_block: write ? #8902

Closed
Mpdreamz opened this issue Dec 11, 2014 · 9 comments
Closed
Labels
:Core/Infra/Core Core issues without another label

Comments

@Mpdreamz
Copy link
Member

Given we have two nodes one (A) with:

node.master: false
discovery.zen.no_master_block: write

and (B) being a vanilla master node.

When we stop node (B), (A) is still allowed to service read requests.
However when calling

GET http://(A):9200/ HTTP/1.1

It currently returns:

HTTP/1.1 503 Service Unavailable

but is the service really unavailable in this case? Since we now explicitly allow you to configure for this state IMO it should return 200 OK with a possible boolean in the response signalling its in readonly mode.

A call to _search in this state also results in a 200 and not 503.

@martijnvg
Copy link
Member

@Mpdreamz Agreed, the main rest endpoint should return 200 in case there is no elected master. It doesn't report on cluster state related things, just configured cluster_name and a couple of node related stats.

@javanna
Copy link
Member

javanna commented Jan 16, 2015

Agreed, the main rest endpoint should return 200 in case there is no elected master.

should it always return 200 or only if we block writes? If we block all operations we should still return 503 maybe?

@martijnvg
Copy link
Member

true, we should return 503 when all operations are blocked, but in the case
of when just block writes we should return 200, because the node is still
partially operational.

On 16 January 2015 at 12:23, Luca Cavanna [email protected] wrote:

Agreed, the main rest endpoint should return 200 in case there is no
elected master.

should it always return 200 or only if we block writes? If we block all
operations we should still return 503 maybe?


Reply to this email directly or view it on GitHub
#8902 (comment)
.

Met vriendelijke groet,

Martijn van Groningen

@clintongormley clintongormley added >bug :Core/Infra/Core Core issues without another label v2.0.0-beta1 labels Jan 16, 2015
@bleskes
Copy link
Contributor

bleskes commented Jan 16, 2015

+1 . During master lost we go into a new master election which takes 3s (by default). During those 3s the node has a configured block - if it allows read we should indeed return 200. This is likely a transient state which will be solved before we start rejecting indexing requests (remember they wait up to 1m for the situation to be resolved).

@clintongormley
Copy link
Contributor

I'm not sure that this is the right thing to do. Imagine you're using sniffing. You try to perform a write and get back a 503 so you sniff and get back a 200, then you try the write again, get back a 503, etc etc.

That said, the above would work for reads. I know the python client aborts after 3 attempts, while the Perl client keeps going until it gets back a 503 on sniffing. Perhaps we should only sniff once before giving up.

@clintongormley clintongormley added discuss and removed help wanted adoptme labels Nov 26, 2016
@bleskes
Copy link
Contributor

bleskes commented Nov 26, 2016

@clintongormley how are other transient errors that are not reflected by / handled? for example - if the queues are full we return a 429 code. Does that have special handling? Another example - the circuit breaker throws a 503 too. That one is not reflected by /. Should it?

Another aspect to consider here - on master loss (since 1.4), all data nodes will have a master block for 3s. If you hit / at that moment, no node will be happy and that's I believe this ticket is about. I'm not saying that this is what we should do but I think this should be taken into account in the solution.

@clintongormley
Copy link
Contributor

We have said that a 503 response code should mean "retry on another node".

if the queues are full we return a 429 code. Does that have special handling?

Not in the Perl client, but not sure about the others. I think 429 should probably not retry but instead backoff.

the circuit breaker throws a 503 too

That means retry on another node.... This one is debatable. If you've sent the request that has triggered the circuit breaker, you could then replicate that bad behaviour across all nodes in the cluster by retrying.

@jasontedor
Copy link
Member

A 503 is completely broken behavior here. A REST status is a response for the given request. A 503 means "I am overloaded right now, I can not handle your request." That is completely out of alignment with a discovery.zen.no_master_block block. If the server can respond to the / request, it is not overloaded.

@jasontedor
Copy link
Member

I opened #29045.

jmlrt added a commit to jmlrt/helm-charts that referenced this issue Apr 17, 2020
This PR update readiness probe endpoint to check only `/` endpoint instead of `/_cluster/health?timeout=0s` when Elasticsearch is already running.
This revert to initial config which was changed in elastic#380 with the exception that 503 HTTP code is accepted for 6.x (see elastic/elasticsearch#8902 for more details about why 503 is OK on Elasticsearch 6.x).
jmlrt added a commit to jmlrt/helm-charts that referenced this issue Apr 17, 2020
This PR update readiness probe endpoint to check only `/` endpoint instead of `/_cluster/health?timeout=0s` when Elasticsearch is already running.
This revert to initial config which was changed in elastic#380 with the exception that 503 HTTP code is accepted for 6.x (see elastic/elasticsearch#8902 for more details about why 503 is OK on Elasticsearch 6.x).
galina-tochilkin pushed a commit to mtp-devops/3d-party-helm that referenced this issue Dec 20, 2022
This PR update readiness probe endpoint to check only `/` endpoint instead of `/_cluster/health?timeout=0s` when Elasticsearch is already running.
This revert to initial config which was changed in elastic/helm-charts#380 with the exception that 503 HTTP code is accepted for 6.x (see elastic/elasticsearch#8902 for more details about why 503 is OK on Elasticsearch 6.x).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label
Projects
None yet
Development

No branches or pull requests

7 participants