Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cap max RetryableAction wait time/timeout. #74940

Merged

Conversation

henningandersen
Copy link
Contributor

RetryableAction uses randomized and exponential back off. If unlucky,
the randomization would cause a series of very short waits, which would
double the bound every time, risking a subsequent very long wait. Now
randomize between [bound/2, bound[.

Closes #70996

This fixes testAckedIndexing because it ran into a final very long wait time for one of the retries.

I am not necessarily fixed on this solution, though it seems like a step in the right direction. An
alternative could be to cap the final wait time to meet the timeout, but it does add more deterministic/less random
behavior and the solution here should be good enough.

RetryableAction uses randomized and exponential back off. If unlucky,
the randomization would cause a series of very short waits, which would
double the bound every time, risking a subsequent very long wait. Now
randomize between [bound/2, bound[.

Closes elastic#70996
@henningandersen henningandersen added >bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. :ml Machine learning v8.0.0 v7.14.1 v7.15.0 labels Jul 5, 2021
@elasticmachine elasticmachine added Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:ML Meta label for the ML team labels Jul 5, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML changes LGTM

Consider renaming the calculateDelay(long) function to calculateMaxDelay(long) or calculateMaxDelayBound(long) as the logic to determine the actual delay passed to threadPool.schedule is in the onFailure method.

@astefan
Copy link
Contributor

astefan commented Jul 15, 2021

Will this PR fix timeouts like this one? https://gradle-enterprise.elastic.co/s/qvkqnuh5ocffw

@henningandersen
Copy link
Contributor Author

@astefan

Will this PR fix timeouts like this one? https://gradle-enterprise.elastic.co/s/qvkqnuh5ocffw

I am afraid not, the symptoms of the failing testAckedIndexing in that gradle scan look quite different. On the other hand it is probably worth getting this PR in before investing in looking into this.

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@henningandersen henningandersen merged commit 0fd3f76 into elastic:master Aug 5, 2021
@henningandersen henningandersen added the auto-backport Automatically create backport pull requests when merged label Aug 5, 2021
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
7.14 Commit could not be cherrypicked due to conflicts
7.x Commit could not be cherrypicked due to conflicts

To backport manually run:
backport --pr 74940

@henningandersen henningandersen removed the auto-backport Automatically create backport pull requests when merged label Aug 5, 2021
henningandersen added a commit to henningandersen/elasticsearch that referenced this pull request Aug 5, 2021
RetryableAction uses randomized and exponential back off. If unlucky,
the randomization would cause a series of very short waits, which would
double the bound every time, risking a subsequent very long wait. Now
randomize between [bound/2, bound[.

Closes elastic#70996
henningandersen added a commit that referenced this pull request Aug 5, 2021
RetryableAction uses randomized and exponential back off. If unlucky,
the randomization would cause a series of very short waits, which would
double the bound every time, risking a subsequent very long wait. Now
randomize between [bound/2, bound[.

Closes #70996
henningandersen added a commit that referenced this pull request Aug 5, 2021
RetryableAction uses randomized and exponential back off. If unlucky,
the randomization would cause a series of very short waits, which would
double the bound every time, risking a subsequent very long wait. Now
randomize between [bound/2, bound[.

Closes #70996
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. :ml Machine learning Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. Team:ML Meta label for the ML team v7.14.1 v7.15.0 v8.0.0-alpha2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ClusterDisruptionIT.testAckedIndexing test failure
8 participants