Cap max RetryableAction wait time/timeout. #74940

henningandersen · 2021-07-05T17:14:23Z

RetryableAction uses randomized and exponential back off. If unlucky,
the randomization would cause a series of very short waits, which would
double the bound every time, risking a subsequent very long wait. Now
randomize between [bound/2, bound[.

Closes #70996

This fixes testAckedIndexing because it ran into a final very long wait time for one of the retries.

I am not necessarily fixed on this solution, though it seems like a step in the right direction. An
alternative could be to cap the final wait time to meet the timeout, but it does add more deterministic/less random
behavior and the solution here should be good enough.

RetryableAction uses randomized and exponential back off. If unlucky, the randomization would cause a series of very short waits, which would double the bound every time, risking a subsequent very long wait. Now randomize between [bound/2, bound[. Closes elastic#70996

elasticmachine · 2021-07-05T17:14:26Z

Pinging @elastic/ml-core (Team:ML)

elasticmachine · 2021-07-05T17:14:26Z

Pinging @elastic/es-distributed (Team:Distributed)

davidkyle

ML changes LGTM

Consider renaming the calculateDelay(long) function to calculateMaxDelay(long) or calculateMaxDelayBound(long) as the logic to determine the actual delay passed to threadPool.schedule is in the onFailure method.

...n/ml/src/main/java/org/elasticsearch/xpack/ml/utils/persistence/ResultsPersisterService.java

astefan · 2021-07-15T12:20:27Z

Will this PR fix timeouts like this one? https://gradle-enterprise.elastic.co/s/qvkqnuh5ocffw

henningandersen · 2021-08-04T18:53:38Z

@astefan

Will this PR fix timeouts like this one? https://gradle-enterprise.elastic.co/s/qvkqnuh5ocffw

I am afraid not, the symptoms of the failing testAckedIndexing in that gradle scan look quite different. On the other hand it is probably worth getting this PR in before investing in looking into this.

Tim-Brooks

LGTM

…s/persistence/ResultsPersisterService.java Co-authored-by: David Kyle <[email protected]>

…n_timeout

elasticsearchmachine · 2021-08-05T09:04:49Z

💔 Backport failed

Status	Branch	Result
❌	7.14	Commit could not be cherrypicked due to conflicts
❌	7.x	Commit could not be cherrypicked due to conflicts

To backport manually run:
backport --pr 74940

RetryableAction uses randomized and exponential back off. If unlucky, the randomization would cause a series of very short waits, which would double the bound every time, risking a subsequent very long wait. Now randomize between [bound/2, bound[. Closes elastic#70996

RetryableAction uses randomized and exponential back off. If unlucky, the randomization would cause a series of very short waits, which would double the bound every time, risking a subsequent very long wait. Now randomize between [bound/2, bound[. Closes #70996

henningandersen added >bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. :ml Machine learning v8.0.0 v7.14.1 v7.15.0 labels Jul 5, 2021

elasticmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:ML Meta label for the ML team labels Jul 5, 2021

henningandersen requested review from benwtrent and Tim-Brooks July 5, 2021 18:21

davidkyle approved these changes Jul 6, 2021

View reviewed changes

...n/ml/src/main/java/org/elasticsearch/xpack/ml/utils/persistence/ResultsPersisterService.java Outdated Show resolved Hide resolved

benwtrent approved these changes Jul 12, 2021

View reviewed changes

Tim-Brooks approved these changes Aug 5, 2021

View reviewed changes

henningandersen and others added 3 commits August 5, 2021 08:25

Update x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/util…

31af859

…s/persistence/ResultsPersisterService.java Co-authored-by: David Kyle <[email protected]>

Merge remote-tracking branch 'origin/master' into fix_retryable_actio…

9e3db25

…n_timeout

Rename calculateDelay

f69cc7b

henningandersen merged commit 0fd3f76 into elastic:master Aug 5, 2021

henningandersen added the auto-backport Automatically create backport pull requests when merged label Aug 5, 2021

henningandersen removed the auto-backport Automatically create backport pull requests when merged label Aug 5, 2021

henningandersen mentioned this pull request Aug 5, 2021

Cap max RetryableAction wait time/timeout. (#74940) #76152

Merged

stu-elastic mentioned this pull request Sep 8, 2021

[CI] RetryableActionTests testRetryableActionTimeout failing #76165

Closed

jakelandis added v8.0.0-alpha2 and removed v8.0.0 labels Sep 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cap max RetryableAction wait time/timeout. #74940

Cap max RetryableAction wait time/timeout. #74940

henningandersen commented Jul 5, 2021

elasticmachine commented Jul 5, 2021

elasticmachine commented Jul 5, 2021

davidkyle left a comment

astefan commented Jul 15, 2021

henningandersen commented Aug 4, 2021

Tim-Brooks left a comment

elasticsearchmachine commented Aug 5, 2021

Cap max RetryableAction wait time/timeout. #74940

Cap max RetryableAction wait time/timeout. #74940

Conversation

henningandersen commented Jul 5, 2021

elasticmachine commented Jul 5, 2021

elasticmachine commented Jul 5, 2021

davidkyle left a comment

Choose a reason for hiding this comment

astefan commented Jul 15, 2021

henningandersen commented Aug 4, 2021

Tim-Brooks left a comment

Choose a reason for hiding this comment

elasticsearchmachine commented Aug 5, 2021

💔 Backport failed