Add some randomness to the exponential backoff of max_retries #1336

keymon · 2016-11-07T10:44:32Z

The retry logic configured by max_retries, for the case of Rate limit related errors, does a exponential backoff following this function: 2.0 ** retries * 0.3

That is great, but there is a problem: It is exactly the same backoff time for every execution. This is problematic when you are doing lots of operations from multiple threads. For instance, creating a environment with multiple VMs using 32 threads. All the threads will execute the same API calls at the same time, and if several threads fail at the same time because the Rate limit from AWS, they will retry again after the same amount of seconds.

What I want to propose is add a little bit of randomness to that backoff. We can use this function:

def backoff(r)
    b = 2.0 ** r.to_f * 0.3
    v = 1.25 ** r.to_f * 0.3 * (Random.rand() - 0.5)
    return b+v
end

With this code we can test it a little bit:

(1..5) each { |run| # Repeat the experiment
    puts "---- Execution: #{run} ----"
    (1..10).each { |r| 
        b = backoff(r)
        puts "Retries #{r}: #{b}s == #{b.floor()}m #{b % 60}s"
    }
}

---- Execution: 1 ----
Retries 1: 0.4723107748709313s == 0m 0.4723107748709313s
Retries 2: 1.4083360795375834s == 1m 1.4083360795375834s
Retries 3: 2.271092661113177s == 2m 2.271092661113177s
Retries 4: 4.916736747923815s == 4m 4.916736747923815s
Retries 5: 10.000144551307137s == 10m 10.000144551307137s
Retries 6: 18.839097183818104s == 18m 18.839097183818104s
Retries 7: 38.480158713168855s == 38m 38.480158713168855s
Retries 8: 77.45500289555731s == 77m 17.455002895557314s
Retries 9: 152.7204691297255s == 152m 32.7204691297255s
Retries 10: 307.6726209158677s == 307m 7.672620915867697s
---- Execution: 2 ----
Retries 1: 0.6050370631541719s == 0m 0.6050370631541719s
Retries 2: 1.345491076957416s == 1m 1.345491076957416s
Retries 3: 2.334873909044141s == 2m 2.334873909044141s
Retries 4: 5.13400991278694s == 5m 5.13400991278694s
Retries 5: 9.521875510187117s == 9m 9.521875510187117s
Retries 6: 19.662625638261183s == 19m 19.662625638261183s
Retries 7: 38.171137370653234s == 38m 38.171137370653234s
Retries 8: 77.06413258415549s == 77m 17.064132584155487s
Retries 9: 154.28351057847166s == 154m 34.28351057847166s
Retries 10: 306.916256081867s == 306m 6.9162560818670045s
---- Execution: 3 ----
Retries 1: 0.4471059140693189s == 0m 0.4471059140693189s
Retries 2: 1.0662461484531782s == 1m 1.0662461484531782s
Retries 3: 2.382296072057813s == 2m 2.382296072057813s
Retries 4: 4.736341404068338s == 4m 4.736341404068338s
Retries 5: 9.503966410550943s == 9m 9.503966410550943s
Retries 6: 19.3627308579865s == 19m 19.3627308579865s
Retries 7: 38.26510317848355s == 38m 38.26510317848355s
Retries 8: 76.87617016912421s == 76m 16.876170169124208s
Retries 9: 154.29058926529942s == 154m 34.29058926529942s
Retries 10: 306.0287511171526s == 306m 6.028751117152581s
---- Execution: 4 ----
Retries 1: 0.4755658624874498s == 0m 0.4755658624874498s
Retries 2: 1.4092526649834793s == 1m 1.4092526649834793s
Retries 3: 2.535036109629495s == 2m 2.535036109629495s
Retries 4: 4.989201659041932s == 4m 4.989201659041932s
Retries 5: 9.953319397492805s == 9m 9.953319397492805s
Retries 6: 18.81325306642286s == 18m 18.81325306642286s
Retries 7: 38.19264967633735s == 38m 38.19264967633735s
Retries 8: 76.23892541177054s == 76m 16.238925411770538s
Retries 9: 153.79126999144194s == 153m 33.791269991441936s
Retries 10: 307.49926799424094s == 307m 7.499267994240938s
---- Execution: 5 ----
Retries 1: 0.6095471614007916s == 0m 0.6095471614007916s
Retries 2: 1.0568851832398147s == 1m 1.0568851832398147s
Retries 3: 2.3237468552036296s == 2m 2.3237468552036296s
Retries 4: 4.682081485913115s == 4m 4.682081485913115s
Retries 5: 9.193068254055532s == 9m 9.193068254055532s
Retries 6: 19.31007315566773s == 19m 19.31007315566773s
Retries 7: 37.98337119203318s == 37m 37.98337119203318s
Retries 8: 76.26888668898539s == 76m 16.26888668898539s
Retries 9: 152.87777227393545s == 152m 32.87777227393545s
Retries 10: 307.56591645560525s == 307m 7.565916455605247s

What do you think?

If you agree, I can do a PR.

The text was updated successfully, but these errors were encountered:

pchaganti · 2016-11-07T12:07:15Z

👍

trevorrowe · 2016-11-16T23:42:11Z

I'd be in favor of adding jitter to the retry backoff. Feel free to send a PR.

keymon · 2016-11-27T11:08:01Z

And appart of adding jitter, what do you think about changing the shape of the function to use a sigmoid one:

I have been playing with functions here: https://graphsketch.com/?eqn1_color=1&eqn1_eqn=(2%5Ex)*0.3&eqn2_color=3&eqn2_eqn=(2%5Ex)*0.3%20-%20(1.3%5Ex)*0.3*0.5&eqn3_color=2&eqn3_eqn=(2%5Ex)*0.3%20%2B%20(1.3%5Ex)*0.3*0.5&eqn4_color=3&eqn4_eqn=2%5Eatan(x-3)*6-2.5&eqn5_color=5&eqn5_eqn=(2%5Eatan(x-3)*6-2.5)%20*%200.95&eqn6_color=2&eqn6_eqn=(2%5Eatan(x-3)*6-2.5)%20*%201.05&x_min=-1&x_max=8&y_min=-1&y_max=25&x_tick=1&y_tick=1&x_label_freq=5&y_label_freq=5&do_grid=0&do_grid=1&bold_labeled_lines=0&bold_labeled_lines=1&line_width=2&image_w=850&image_h=525

I propose sigmoid because this exponential backoff becomes useless, as the back off time is too exagerated when we get into 8 or more retries. With a sigmoid function the back off time will converge to a maximum time to backoff.

Unfortunatelly this is a major change, specially for common retry numbers, that might break tools that use this library.

awood45 · 2016-11-29T00:16:15Z

I think you're right that a fundamental change to the algorithm as a default might be an issue (though it might not), we might be able to expose that in other ways. You can, of course, already bring your own backoff block.

Related to GitHub issues: * #1299 * #1323 * #1335 * #1336

awood45 · 2016-11-29T00:26:33Z

Added to feature request backlog. Will definitely take as a PR as well.

jitter functions, max delay cap, configurable base delay

trevorrowe added feature-request A feature should be added or improved. Version 2 labels Nov 16, 2016

awood45 added a commit that referenced this issue Nov 29, 2016

Update Feature Request Backlog

bfeebaf

Related to GitHub issues: * #1299 * #1323 * #1335 * #1336

awood45 closed this as completed Nov 29, 2016

lwoggardner added a commit to lwoggardner/aws-sdk-ruby that referenced this issue Jan 26, 2017

Additional backoff strategies with cap + jitter aws#1336

34d3a34

lwoggardner added a commit to lwoggardner/aws-sdk-ruby that referenced this issue Feb 1, 2017

More flexible backoff strategies aws#1336

c0613fa

jitter functions, max delay cap, configurable base delay

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some randomness to the exponential backoff of max_retries #1336

Add some randomness to the exponential backoff of max_retries #1336

keymon commented Nov 7, 2016

pchaganti commented Nov 7, 2016

trevorrowe commented Nov 16, 2016

keymon commented Nov 27, 2016 •

edited

Loading

awood45 commented Nov 29, 2016

awood45 commented Nov 29, 2016

Add some randomness to the exponential backoff of max_retries #1336

Add some randomness to the exponential backoff of max_retries #1336

Comments

keymon commented Nov 7, 2016

pchaganti commented Nov 7, 2016

trevorrowe commented Nov 16, 2016

keymon commented Nov 27, 2016 • edited Loading

awood45 commented Nov 29, 2016

awood45 commented Nov 29, 2016

keymon commented Nov 27, 2016 •

edited

Loading