Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RestClient: retry timeout starts when the request is submitted to the client, disabling retry when client is under heavy load #25951

Closed
yrodiere opened this issue Jul 28, 2017 · 3 comments
Labels
>bug :Clients/Java Low Level REST Client Minimal dependencies Java Client for Elasticsearch

Comments

@yrodiere
Copy link
Contributor

yrodiere commented Jul 28, 2017

Elasticsearch version: Any

Plugins installed: N/A

JVM version (java -version): Any

OS version (uname -a if on a Unix-like system): Any

Description of the problem including expected versus actual behavior:

The starting point for the retry timeout implemented in RestClient is when the request is submitted to the Apache HTTP client, instead of when the request is actually sent to the Elasticsearch cluster. But the client has an unbounded request queue, so the request may be sent a long time after it got submitted.
As a result, when the HTTP client is already busy handling a lot of requests, the RestClient will never retry failing requests, because failures will always occur more than 30s after the failing request has been submitted (or whatever your timeout is, 30s is just the default).
The problem is even worse when you try to use more reasonable timeouts, such as 1 or 2 seconds. As soon as the client gets clogged up with more than 1 or 2 seconds worth of requests, retry is basically disabled.

With synchronous requests, this is not really a problem, because the submitting thread will stop waiting after maxRequestTimeout anyway, so the failure will happen long after the submitting thread gave up on the request.
With asynchronous requests, though, there's a good chance you chose asynchonous execution because you knew the request could take a lot of time executing. In which case, the problem will happen every time a request fails.

One example of a use case where this behavior is undesirable is when indexing a huge amount of data, for example when initializing the indexes from an external data source. In this case, you want to send a lot of asynchronous requests to the client, so as to be sure you're using the cluster to the maximum of its capacity, and as a result the client's request queue will probably be very long. Yet you still want failing requests to be retried...

Steps to reproduce:

  1. Set the max retry timeout to 10s, and use a single Elasticsearch host
  2. Submit (asynchronously) 40 requests, each taking about 1s to execute, to the RestClient; this will keep the two transport threads busy for approximately 20s.
  3. Just after that, submit (still asynchronously) another request that will fail after, say, 1s.
  4. The request from 3 will fail, but will not be retried, despite the fact it only had 1 second to execute.

Solution:

One solution would be to set the starting point for the retry timeout when the Apache HTTP client actually starts processing the request.
This would not affect the timeout for synchronous requests, but would still provide a significant improvement for asynchronous requests.

@javanna javanna added :Clients/Java Low Level REST Client Minimal dependencies Java Client for Elasticsearch >bug labels Jul 31, 2017
@javanna javanna self-assigned this Jul 31, 2017
@haggy
Copy link

haggy commented Mar 12, 2018

We are running into this as well. Are there any known workarounds yet? Is the fix scheduled?

@bleskes
Copy link
Contributor

bleskes commented Mar 12, 2018

/cc @elastic/es-core-infra

@javanna javanna removed their assignment Aug 10, 2018
javanna added a commit that referenced this issue Feb 6, 2019
We have had various reports of problems caused by the maxRetryTimeout
setting in the low-level REST client. Such setting was initially added
in the attempts to not have requests go through retries if the request
already took longer than the provided timeout.

The implementation was problematic though as such timeout would also
expire in the first request attempt (see #31834), would leave the
request executing after expiration causing memory leaks (see #33342),
and would not take into account the http client internal queuing (see #25951).

Given all these issues, it seems that this custom timeout mechanism 
gives little benefits while causing a lot of harm. We should rather rely 
on connect and socket timeout exposed by the underlying http client 
and accept that a request can overall take longer than the configured 
timeout, which is the case even with a single retry anyways.

This commit removes the `maxRetryTimeout` setting and all of its usages.
@javanna
Copy link
Member

javanna commented Feb 6, 2019

Thanks for the detailed explanation. We've got other reports of issues caused by the max retry timeout, and the more I looked into it the more I realized that this timeout mechanism did not play well with the internals of the underlying http client that we rely on. I think it's better to rely on the different timeouts exposed by the http client directly and remove this additional mechanism that ends up causing more harm than good. We have deprecated max retry timeout in 6.x, as well as increased its default value to 90 seconds (#38425) and removed it in 7.0 (#38085).

@javanna javanna closed this as completed Feb 6, 2019
yrodiere added a commit to yrodiere/hibernate-search that referenced this issue Apr 26, 2019
…eout" anymore

It is no longer available in Elasticsearch 7, and we already have our
own timeout. It just won't retry anymore.
See elastic/elasticsearch#25951 (comment)

Note this clears the path for more useful timeouts that start when the
request is sent to the cluster rather than when the request is queued.
See https://hibernate.atlassian.net/browse/HSEARCH-2836.
gsmet pushed a commit to hibernate/hibernate-search that referenced this issue Apr 26, 2019
…eout" anymore

It is no longer available in Elasticsearch 7, and we already have our
own timeout. It just won't retry anymore.
See elastic/elasticsearch#25951 (comment)

Note this clears the path for more useful timeouts that start when the
request is sent to the cluster rather than when the request is queued.
See https://hibernate.atlassian.net/browse/HSEARCH-2836.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Clients/Java Low Level REST Client Minimal dependencies Java Client for Elasticsearch
Projects
None yet
Development

No branches or pull requests

4 participants