Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some randomness to the exponential backoff of max_retries #1336

Closed
keymon opened this issue Nov 7, 2016 · 5 comments
Closed

Add some randomness to the exponential backoff of max_retries #1336

keymon opened this issue Nov 7, 2016 · 5 comments
Labels
feature-request A feature should be added or improved.

Comments

@keymon
Copy link

keymon commented Nov 7, 2016

The retry logic configured by max_retries, for the case of Rate limit related errors, does a exponential backoff following this function: 2.0 ** retries * 0.3

That is great, but there is a problem: It is exactly the same backoff time for every execution. This is problematic when you are doing lots of operations from multiple threads. For instance, creating a environment with multiple VMs using 32 threads. All the threads will execute the same API calls at the same time, and if several threads fail at the same time because the Rate limit from AWS, they will retry again after the same amount of seconds.

What I want to propose is add a little bit of randomness to that backoff. We can use this function:

def backoff(r)
    b = 2.0 ** r.to_f * 0.3
    v = 1.25 ** r.to_f * 0.3 * (Random.rand() - 0.5)
    return b+v
end

With this code we can test it a little bit:

(1..5) each { |run| # Repeat the experiment
    puts "---- Execution: #{run} ----"
    (1..10).each { |r| 
        b = backoff(r)
        puts "Retries #{r}: #{b}s == #{b.floor()}m #{b % 60}s"
    }
}
---- Execution: 1 ----
Retries 1: 0.4723107748709313s == 0m 0.4723107748709313s
Retries 2: 1.4083360795375834s == 1m 1.4083360795375834s
Retries 3: 2.271092661113177s == 2m 2.271092661113177s
Retries 4: 4.916736747923815s == 4m 4.916736747923815s
Retries 5: 10.000144551307137s == 10m 10.000144551307137s
Retries 6: 18.839097183818104s == 18m 18.839097183818104s
Retries 7: 38.480158713168855s == 38m 38.480158713168855s
Retries 8: 77.45500289555731s == 77m 17.455002895557314s
Retries 9: 152.7204691297255s == 152m 32.7204691297255s
Retries 10: 307.6726209158677s == 307m 7.672620915867697s
---- Execution: 2 ----
Retries 1: 0.6050370631541719s == 0m 0.6050370631541719s
Retries 2: 1.345491076957416s == 1m 1.345491076957416s
Retries 3: 2.334873909044141s == 2m 2.334873909044141s
Retries 4: 5.13400991278694s == 5m 5.13400991278694s
Retries 5: 9.521875510187117s == 9m 9.521875510187117s
Retries 6: 19.662625638261183s == 19m 19.662625638261183s
Retries 7: 38.171137370653234s == 38m 38.171137370653234s
Retries 8: 77.06413258415549s == 77m 17.064132584155487s
Retries 9: 154.28351057847166s == 154m 34.28351057847166s
Retries 10: 306.916256081867s == 306m 6.9162560818670045s
---- Execution: 3 ----
Retries 1: 0.4471059140693189s == 0m 0.4471059140693189s
Retries 2: 1.0662461484531782s == 1m 1.0662461484531782s
Retries 3: 2.382296072057813s == 2m 2.382296072057813s
Retries 4: 4.736341404068338s == 4m 4.736341404068338s
Retries 5: 9.503966410550943s == 9m 9.503966410550943s
Retries 6: 19.3627308579865s == 19m 19.3627308579865s
Retries 7: 38.26510317848355s == 38m 38.26510317848355s
Retries 8: 76.87617016912421s == 76m 16.876170169124208s
Retries 9: 154.29058926529942s == 154m 34.29058926529942s
Retries 10: 306.0287511171526s == 306m 6.028751117152581s
---- Execution: 4 ----
Retries 1: 0.4755658624874498s == 0m 0.4755658624874498s
Retries 2: 1.4092526649834793s == 1m 1.4092526649834793s
Retries 3: 2.535036109629495s == 2m 2.535036109629495s
Retries 4: 4.989201659041932s == 4m 4.989201659041932s
Retries 5: 9.953319397492805s == 9m 9.953319397492805s
Retries 6: 18.81325306642286s == 18m 18.81325306642286s
Retries 7: 38.19264967633735s == 38m 38.19264967633735s
Retries 8: 76.23892541177054s == 76m 16.238925411770538s
Retries 9: 153.79126999144194s == 153m 33.791269991441936s
Retries 10: 307.49926799424094s == 307m 7.499267994240938s
---- Execution: 5 ----
Retries 1: 0.6095471614007916s == 0m 0.6095471614007916s
Retries 2: 1.0568851832398147s == 1m 1.0568851832398147s
Retries 3: 2.3237468552036296s == 2m 2.3237468552036296s
Retries 4: 4.682081485913115s == 4m 4.682081485913115s
Retries 5: 9.193068254055532s == 9m 9.193068254055532s
Retries 6: 19.31007315566773s == 19m 19.31007315566773s
Retries 7: 37.98337119203318s == 37m 37.98337119203318s
Retries 8: 76.26888668898539s == 76m 16.26888668898539s
Retries 9: 152.87777227393545s == 152m 32.87777227393545s
Retries 10: 307.56591645560525s == 307m 7.565916455605247s

What do you think?

If you agree, I can do a PR.

@pchaganti
Copy link

👍

@trevorrowe
Copy link
Member

I'd be in favor of adding jitter to the retry backoff. Feel free to send a PR.

@trevorrowe trevorrowe added feature-request A feature should be added or improved. Version 2 labels Nov 16, 2016
@keymon
Copy link
Author

keymon commented Nov 27, 2016

And appart of adding jitter, what do you think about changing the shape of the function to use a sigmoid one:

I have been playing with functions here: https://graphsketch.com/?eqn1_color=1&eqn1_eqn=(2%5Ex)*0.3&eqn2_color=3&eqn2_eqn=(2%5Ex)*0.3%20-%20(1.3%5Ex)*0.3*0.5&eqn3_color=2&eqn3_eqn=(2%5Ex)*0.3%20%2B%20(1.3%5Ex)*0.3*0.5&eqn4_color=3&eqn4_eqn=2%5Eatan(x-3)*6-2.5&eqn5_color=5&eqn5_eqn=(2%5Eatan(x-3)*6-2.5)%20*%200.95&eqn6_color=2&eqn6_eqn=(2%5Eatan(x-3)*6-2.5)%20*%201.05&x_min=-1&x_max=8&y_min=-1&y_max=25&x_tick=1&y_tick=1&x_label_freq=5&y_label_freq=5&do_grid=0&do_grid=1&bold_labeled_lines=0&bold_labeled_lines=1&line_width=2&image_w=850&image_h=525

image

I propose sigmoid because this exponential backoff becomes useless, as the back off time is too exagerated when we get into 8 or more retries. With a sigmoid function the back off time will converge to a maximum time to backoff.

Unfortunatelly this is a major change, specially for common retry numbers, that might break tools that use this library.

@awood45
Copy link
Member

awood45 commented Nov 29, 2016

I think you're right that a fundamental change to the algorithm as a default might be an issue (though it might not), we might be able to expose that in other ways. You can, of course, already bring your own backoff block.

awood45 added a commit that referenced this issue Nov 29, 2016
Related to GitHub issues:

* #1299
* #1323
* #1335
* #1336
@awood45
Copy link
Member

awood45 commented Nov 29, 2016

Added to feature request backlog. Will definitely take as a PR as well.

@awood45 awood45 closed this as completed Nov 29, 2016
lwoggardner added a commit to lwoggardner/aws-sdk-ruby that referenced this issue Jan 26, 2017
lwoggardner added a commit to lwoggardner/aws-sdk-ruby that referenced this issue Feb 1, 2017
   jitter functions, max delay cap, configurable base delay
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved.
Projects
None yet
Development

No branches or pull requests

4 participants