-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RestClient: retry timeout starts when the request is submitted to the client, disabling retry when client is under heavy load #25951
Comments
We are running into this as well. Are there any known workarounds yet? Is the fix scheduled? |
/cc @elastic/es-core-infra |
We have had various reports of problems caused by the maxRetryTimeout setting in the low-level REST client. Such setting was initially added in the attempts to not have requests go through retries if the request already took longer than the provided timeout. The implementation was problematic though as such timeout would also expire in the first request attempt (see #31834), would leave the request executing after expiration causing memory leaks (see #33342), and would not take into account the http client internal queuing (see #25951). Given all these issues, it seems that this custom timeout mechanism gives little benefits while causing a lot of harm. We should rather rely on connect and socket timeout exposed by the underlying http client and accept that a request can overall take longer than the configured timeout, which is the case even with a single retry anyways. This commit removes the `maxRetryTimeout` setting and all of its usages.
Thanks for the detailed explanation. We've got other reports of issues caused by the max retry timeout, and the more I looked into it the more I realized that this timeout mechanism did not play well with the internals of the underlying http client that we rely on. I think it's better to rely on the different timeouts exposed by the http client directly and remove this additional mechanism that ends up causing more harm than good. We have deprecated max retry timeout in 6.x, as well as increased its default value to 90 seconds (#38425) and removed it in 7.0 (#38085). |
…eout" anymore It is no longer available in Elasticsearch 7, and we already have our own timeout. It just won't retry anymore. See elastic/elasticsearch#25951 (comment) Note this clears the path for more useful timeouts that start when the request is sent to the cluster rather than when the request is queued. See https://hibernate.atlassian.net/browse/HSEARCH-2836.
…eout" anymore It is no longer available in Elasticsearch 7, and we already have our own timeout. It just won't retry anymore. See elastic/elasticsearch#25951 (comment) Note this clears the path for more useful timeouts that start when the request is sent to the cluster rather than when the request is queued. See https://hibernate.atlassian.net/browse/HSEARCH-2836.
Elasticsearch version: Any
Plugins installed: N/A
JVM version (
java -version
): AnyOS version (
uname -a
if on a Unix-like system): AnyDescription of the problem including expected versus actual behavior:
The starting point for the retry timeout implemented in RestClient is when the request is submitted to the Apache HTTP client, instead of when the request is actually sent to the Elasticsearch cluster. But the client has an unbounded request queue, so the request may be sent a long time after it got submitted.
As a result, when the HTTP client is already busy handling a lot of requests, the RestClient will never retry failing requests, because failures will always occur more than 30s after the failing request has been submitted (or whatever your timeout is, 30s is just the default).
The problem is even worse when you try to use more reasonable timeouts, such as 1 or 2 seconds. As soon as the client gets clogged up with more than 1 or 2 seconds worth of requests, retry is basically disabled.
With synchronous requests, this is not really a problem, because the submitting thread will stop waiting after
maxRequestTimeout
anyway, so the failure will happen long after the submitting thread gave up on the request.With asynchronous requests, though, there's a good chance you chose asynchonous execution because you knew the request could take a lot of time executing. In which case, the problem will happen every time a request fails.
One example of a use case where this behavior is undesirable is when indexing a huge amount of data, for example when initializing the indexes from an external data source. In this case, you want to send a lot of asynchronous requests to the client, so as to be sure you're using the cluster to the maximum of its capacity, and as a result the client's request queue will probably be very long. Yet you still want failing requests to be retried...
Steps to reproduce:
Solution:
One solution would be to set the starting point for the retry timeout when the Apache HTTP client actually starts processing the request.
This would not affect the timeout for synchronous requests, but would still provide a significant improvement for asynchronous requests.
The text was updated successfully, but these errors were encountered: