Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch does not indicate retryability when flood stage is exceeded #49393

Closed
jasontedor opened this issue Nov 20, 2019 · 7 comments
Closed
Labels
>bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. help wanted adoptme Team:Distributed Meta label for distributed team (obsolete)

Comments

@jasontedor
Copy link
Member

jasontedor commented Nov 20, 2019

Today if a node exceeds the disk flood stage watermark, the disk threshold monitor will apply a special read-only index block to any indices that have a shard allocated to the node that exceeded the watermark. This block carries with it a forbidden status code so that if an attempt is made to index into such an index, the client receives a HTTP 403 status code.

Clients assume that a 403 status code is not retryable and they drop data.

This situation is retryable though, as once the disk threshold monitor observes the free disk space go above the appropriate threshold, the index block is automatically removed.

Rather than expecting our clients to all account for this situation (by inspecting the specifics of the exception that led to the 403 status code), we should indicate retryability by using HTTP status code 429. While 429 is often translated as "too many requests", the HTTP specification is liberal about what this means:

Note that this specification does not define how the origin server identifies the user, nor how it counts requests. For example, an origin server that is limiting request rates can do so based upon counts of requests on a per-resource basis, across the entire server, or even among a set of servers.

By making this change, all of our clients can start retrying when faced with an index that was marked read-only due to a flood stage watermark exceeded event.

Similarly, the status codes of other cluster blocks should be reexamined in this context.

@jasontedor jasontedor added >bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. labels Nov 20, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CRUD)

@ywelsch ywelsch added the help wanted adoptme label Nov 26, 2019
@gaobinlong
Copy link
Contributor

Hi @jasontedor , I'm intersted in this issue. Should we return 429 status code if the cluster block is set manually rather than set automaticly when the flood stage is exceeded?

@jasontedor
Copy link
Member Author

@gaobinlong I think it's fine to treat them the same. I wish we had an easy way to distinguish when it's automatically set versus when it's not, be we don't really so let's proceed to treat them as the same.

@gaobinlong
Copy link
Contributor

@jasontedor ok, I got it.

@gaobinlong
Copy link
Contributor

Hi @jasontedor , I hava made a PR for this issue, can you help to review the code change?

henningandersen pushed a commit that referenced this issue Feb 22, 2020
We consider index level read_only_allow_delete blocks temporary since
the DiskThresholdMonitor can automatically release those when an index
is no longer allocated on nodes above high threshold.

The rest status has therefore been changed to 429 when encountering this
index block to signal retryability to clients.

Related to #49393
henningandersen pushed a commit to henningandersen/elasticsearch that referenced this issue Feb 22, 2020
…#50166)

We consider index level read_only_allow_delete blocks temporary since
the DiskThresholdMonitor can automatically release those when an index
is no longer allocated on nodes above high threshold.

The rest status has therefore been changed to 429 when encountering this
index block to signal retryability to clients.

Related to elastic#49393
henningandersen pushed a commit that referenced this issue Feb 22, 2020
We consider index level read_only_allow_delete blocks temporary since
the DiskThresholdMonitor can automatically release those when an index
is no longer allocated on nodes above high threshold.

The rest status has therefore been changed to 429 when encountering this
index block to signal retryability to clients.

Related to #49393
@rjernst rjernst added the Team:Distributed Meta label for distributed team (obsolete) label May 4, 2020
@zez3
Copy link

zez3 commented Mar 27, 2021

#50166

This PR valid from 7.7 onwards has been brought to my attention

@DaveCTurner
Copy link
Contributor

Closed by #50166.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. help wanted adoptme Team:Distributed Meta label for distributed team (obsolete)
Projects
None yet
Development

No branches or pull requests

7 participants