-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make HTTP tests more robust by adding retries to the tests #9652
Conversation
if i > max_iterations then Panic.throw caught_panic else | ||
if i % 10 == 0 then | ||
IO.println "Still failing after "+i.to_text+" retries. ("+loc.to_display_text+")" | ||
Thread.sleep (1000*sleep_time . floor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total time spent might be significantly longer if the action
itself takes non-negligible time; it might be better to check the current time against (start_time + total_sleep_delay)
rather than using a counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, but I feel like the current behaviour is what we want.
If the action takes 3s to complete due to bad network conditions and it fails on a timeout, then with a retry delay of 2s - it will not retry at all... But the whole point of this is to do some retries. I think it's better to do the same number of retries regardless of how long the underlying action is taking.
The total_sleep_delay
is just used to approximate the total wait time. But I guess I can rephrase this to just be max_retries
counter and remove the total_sleep_delay
altogether, if that will be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a 2-linear backoff? E.g., first retry waits for 2 seconds, another for 4 seconds, another for 8 seconds, 16 secs, etc.... The way you coded it, it will wait for 100 seconds on the CI after every retry, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a 2-linear backoff? E.g., first retry waits for 2 seconds, another for 4 seconds, another for 8 seconds, 16 secs, etc.... The way you coded it, it will wait for 100 seconds on the CI after every retry, right?
It will wait for 100 milliseconds between every retry, not 100 seconds 😅
I feel like this is unnecessarily complicating stuff. I want the test to finish as soon as possible, so increasing the wait time does not seem to make that better. The strategy we have here was already successfully used for running cloud tests with propagation delays. I don't think there's value in complicating this strategy until we have a reason to do so. For now, I don't see any reasons - it works good enough and is simple.
Pull Request Description
with_retries
method to now beTest.with_retries
which can be used anywhere in our tests for the retry logic.Data.fetch
. We may add that later, but that needs some further design. In such case we may remove some retries from tests if they become unnecessary.Important Notes
Checklist
Please ensure that the following checklist has been satisfied before submitting the PR:
Scala,
Java,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
./run ide build
.