RestClient: retry timeout starts when the request is submitted to the client, disabling retry when client is under heavy load #25951

yrodiere · 2017-07-28T15:19:37Z

Elasticsearch version: Any

Plugins installed: N/A

JVM version (java -version): Any

OS version (uname -a if on a Unix-like system): Any

Description of the problem including expected versus actual behavior:

The starting point for the retry timeout implemented in RestClient is when the request is submitted to the Apache HTTP client, instead of when the request is actually sent to the Elasticsearch cluster. But the client has an unbounded request queue, so the request may be sent a long time after it got submitted.
As a result, when the HTTP client is already busy handling a lot of requests, the RestClient will never retry failing requests, because failures will always occur more than 30s after the failing request has been submitted (or whatever your timeout is, 30s is just the default).
The problem is even worse when you try to use more reasonable timeouts, such as 1 or 2 seconds. As soon as the client gets clogged up with more than 1 or 2 seconds worth of requests, retry is basically disabled.

With synchronous requests, this is not really a problem, because the submitting thread will stop waiting after maxRequestTimeout anyway, so the failure will happen long after the submitting thread gave up on the request.
With asynchronous requests, though, there's a good chance you chose asynchonous execution because you knew the request could take a lot of time executing. In which case, the problem will happen every time a request fails.

One example of a use case where this behavior is undesirable is when indexing a huge amount of data, for example when initializing the indexes from an external data source. In this case, you want to send a lot of asynchronous requests to the client, so as to be sure you're using the cluster to the maximum of its capacity, and as a result the client's request queue will probably be very long. Yet you still want failing requests to be retried...

Steps to reproduce:

Set the max retry timeout to 10s, and use a single Elasticsearch host
Submit (asynchronously) 40 requests, each taking about 1s to execute, to the RestClient; this will keep the two transport threads busy for approximately 20s.
Just after that, submit (still asynchronously) another request that will fail after, say, 1s.
The request from 3 will fail, but will not be retried, despite the fact it only had 1 second to execute.

Solution:

One solution would be to set the starting point for the retry timeout when the Apache HTTP client actually starts processing the request.
This would not affect the timeout for synchronous requests, but would still provide a significant improvement for asynchronous requests.

The text was updated successfully, but these errors were encountered:

haggy · 2018-03-12T13:58:29Z

We are running into this as well. Are there any known workarounds yet? Is the fix scheduled?

bleskes · 2018-03-12T14:00:05Z

/cc @elastic/es-core-infra

We have had various reports of problems caused by the maxRetryTimeout setting in the low-level REST client. Such setting was initially added in the attempts to not have requests go through retries if the request already took longer than the provided timeout. The implementation was problematic though as such timeout would also expire in the first request attempt (see #31834), would leave the request executing after expiration causing memory leaks (see #33342), and would not take into account the http client internal queuing (see #25951). Given all these issues, it seems that this custom timeout mechanism gives little benefits while causing a lot of harm. We should rather rely on connect and socket timeout exposed by the underlying http client and accept that a request can overall take longer than the configured timeout, which is the case even with a single retry anyways. This commit removes the `maxRetryTimeout` setting and all of its usages.

javanna · 2019-02-06T07:47:36Z

Thanks for the detailed explanation. We've got other reports of issues caused by the max retry timeout, and the more I looked into it the more I realized that this timeout mechanism did not play well with the internals of the underlying http client that we rely on. I think it's better to rely on the different timeouts exposed by the http client directly and remove this additional mechanism that ends up causing more harm than good. We have deprecated max retry timeout in 6.x, as well as increased its default value to 90 seconds (#38425) and removed it in 7.0 (#38085).

…eout" anymore It is no longer available in Elasticsearch 7, and we already have our own timeout. It just won't retry anymore. See elastic/elasticsearch#25951 (comment) Note this clears the path for more useful timeouts that start when the request is sent to the cluster rather than when the request is queued. See https://hibernate.atlassian.net/browse/HSEARCH-2836.

javanna added :Clients/Java Low Level REST Client Minimal dependencies Java Client for Elasticsearch >bug labels Jul 31, 2017

javanna self-assigned this Jul 31, 2017

javanna mentioned this issue Jul 6, 2018

Assertion failures cause yaml REST tests to timeout instead of displaying error #31834

Closed

javanna removed their assignment Aug 10, 2018

javanna mentioned this issue Aug 10, 2018

Support for custom per request maxRetryTimeout on RestClient #32724

Closed

javanna mentioned this issue Jan 31, 2019

Remove support for maxRetryTimeout from low-level REST client #38085

Merged

javanna closed this as completed Feb 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RestClient: retry timeout starts when the request is submitted to the client, disabling retry when client is under heavy load #25951

RestClient: retry timeout starts when the request is submitted to the client, disabling retry when client is under heavy load #25951

yrodiere commented Jul 28, 2017 •

edited

Loading

haggy commented Mar 12, 2018

bleskes commented Mar 12, 2018

javanna commented Feb 6, 2019

RestClient: retry timeout starts when the request is submitted to the client, disabling retry when client is under heavy load #25951

RestClient: retry timeout starts when the request is submitted to the client, disabling retry when client is under heavy load #25951

Comments

yrodiere commented Jul 28, 2017 • edited Loading

haggy commented Mar 12, 2018

bleskes commented Mar 12, 2018

javanna commented Feb 6, 2019

yrodiere commented Jul 28, 2017 •

edited

Loading