Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcast responses always return 200 - OK #29169

Closed
javanna opened this issue Mar 20, 2018 · 8 comments
Closed

Broadcast responses always return 200 - OK #29169

javanna opened this issue Mar 20, 2018 · 8 comments
Labels
:Core/Infra/REST API REST infrastructure and utilities >enhancement help wanted adoptme Team:Core/Infra Meta label for core/infra team

Comments

@javanna
Copy link
Member

javanna commented Mar 20, 2018

Besides RefreshResponse, all our BroadcastResponses always return 200, despite they may contain shard failures. Should we change that to not ignore the status from the response? Here is a list of the API that are affected:

  • upgrade
  • upgrade status
  • indices segments
  • clear cache
  • flush
  • validate query
  • recovery
  • indices stats
  • force merge

Only the refresh response calls the following method, which already isn't great as it picks the code from the first shard failure, but at least it doesn't always return 200:

    /**
     * The REST status that should be used for the response
     */
    public RestStatus getStatus() {
        if (failedShards > 0) {
            return shardFailures[0].status();
        } else {
            return RestStatus.OK;
        }
    }
@javanna javanna added discuss :Core/Infra/REST API REST infrastructure and utilities labels Mar 20, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@jasontedor
Copy link
Member

Relates #23059, relates #28522

@jasontedor
Copy link
Member

Good issue to discuss @javanna.

I think that we are doing the right thing here; a 200 OK from the coordinating node means: I got your request, it checked out okay, and I didn't blow up dutifully forwarding the request to other node(s) in the cluster nor in parsing their response(s), and here is a response back. I think we should reserve a 400 reply from the coordinating node for a bad request that was sent to the coordinating node (e.g., an unrecognized request parameter) and a 500 reply from the coordinating node for the coordinating node blowing up coordinating the request (as opposed to executing one of the sub-requests if the coordinating node holds any relevant shards).

I do conceptually think this is the right thing, and I also think that there are problems with alternatives:

  • if there are multiple failures, there is no good choice of which status code to use
  • a 400 reply for a bad request to the coordinating node would be indistinguishable from a 400 reply due to a bad sub-request (e.g., a non-existent index on one of the sub-requests in a bulk indexing request); the client will still have to parse the errors and find out what happened
  • a 500 reply for the coordinating node blowing up while coordinating (a stupid NPE) would be indistinguishable from a 500 reply due one of the shards blowing up handling a sub-request; the client will still have to parse the errors and find out what happened

In fact, I think the refresh behavior is wrong here; this should be considered a bug and fixed.

@bleskes
Copy link
Contributor

bleskes commented Mar 21, 2018

We bump every once in a while into the fact that there is no status code for "partial results". As @jasontedor noted we had similar discussions for the bulk API. I'm +1 on Jason's suggestion - "bundled" requests' status should reflect the result of coordination. For individual sub-request people should look at the body.

@javanna
Copy link
Member Author

javanna commented Mar 21, 2018

See 39e7c30 for the commit that introduced returning error if there are shard failures for refresh API. Not sure what the implications could be of changing this back. Seems like the getStatus method should have been added in RefreshResponse rather than BroadcastResponse?

When it comes to API like refresh and flush and force merge that change the state of the index, I struggle seeing this as "partial results". I do get the point of looking at the response body which holds failures. Should the behaviour change as well then if we ever change the default behaviour for the search API?

@bleskes
Copy link
Contributor

bleskes commented Mar 23, 2018

We discussed this some more and came up with these guidelines:

  1. _bulk, _search and _msearch are special and have their own semantics.
  2. Broadcast requests that should do something on behalf of the user (like flush and refresh) should return a non 2xx code if that things was not successfully done on any of the targets.
  3. Other broadcast requests (like node stats) should use the body to communicate a failure of one of the targets.

@jaymode
Copy link
Member

jaymode commented Dec 14, 2020

Broadcast requests that should do something on behalf of the user (like flush and refresh) should return a non 2xx code if that things was not successfully done on any of the targets.

Other broadcast requests (like node stats) should use the body to communicate a failure of one of the targets.

@russcam based on your proposal in #60442 for using 207 as the response code when there are varying results, would you expect that to apply in these situations as well?

@jaymode jaymode removed the needs:triage Requires assignment of a team area label label Dec 14, 2020
@thecoop
Copy link
Member

thecoop commented Nov 1, 2024

This is covered by #60442, so this can be looked at there.

@thecoop thecoop closed this as not planned Won't fix, can't repro, duplicate, stale Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/REST API REST infrastructure and utilities >enhancement help wanted adoptme Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

No branches or pull requests

7 participants