fix: duplicate layers of API request retries and integer overflow in backoff #20
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What was the problem/requirement? (What/Why)
Problem 1: Integer Overflow
The
botocore.retries.standard.ExponentialBackoff
class used by the worker agent has a bug. Note that this function grows expontentially (by design) as the retry attempt number increases.For the worker agent, we'd like to retry indefinitely (but backoff to a a very low-frequency amount). If we use this retry algorithm indefinitely, we encounter an overflow error such as:
Problem 2: Jitter Reduction
In addition to the above integer overflow, the botocore.retries.standard.ExponentialBackoff class will reduce jitter as iterations increase.
Then the boto algorithm can be expressed as:
The issue here is that the jitter is applied before clamping to the maximum back-off.
If we take the limit as iterations increase towards infinity, we get$m$ .
I'd suspected that this meant that in the event of a prolonged service error, all workers will retry at constant rates and the jitter approaches zero.
Problem 2: Duplicate Retry Layers
The worker agent uses
boto3
for making AWS API requests. By default,boto3
uses retries (docs) for certain error responses.In addition to this, the worker agent itself implements its own layer of retries in the
aws/deadline
directory. These two layers of retries impact the frequency of API requests. In the event of a service issue, the goal of exponential backoff algorithms is to reduce load on the service in order to expedite recovery and successful requests. Having layered retries each with their own back-off competes with this goal.What was the solution? (How)
Solution 1: Switch to uniform jitter after reaching maximum backoff
We use botocore's backoff algorithm until the attempt number reaches
2 * log(max_backoff, 2)
(reaches maximum backoff). After this point, we switch to a uniform jitter between 80-100% of the maximum backoff.Solution 2: Simulate Backoff
In order to confirm my hypothesis about jitter reduction, I tested this with a simulation analysis.
docs/research/retry_backoff_jitter.ipynb
requirements-research.txt
for setting up the Python virtual environment with the needed Python dependencies for analysis.The conclusion reached by the analysis is to use the algorithm specified in solution 1. While boto's algorithm reduces jitter as the attempt number increases, this is somewhat desirable. Having full jitter creates more load on the service since some retries become quicker.
While we can't use boto's algorithm for arbitrarily large retries because of the overflow error described in problem 1, we can take a compromise by fixing jitter so that requests with high attempt numbers are jittered between 80-100% of the maximum backoff.
Solution 3: Turn off boto Retries
We configure all
deadline
boto clients to make a request per boto3 API call. The worker agent retains control of the backoff algorithm used.What is the impact of this change?
Worker agents backs off when encountering failed API requests to the service and applies jitter using a validated algorithm.
How was this change tested?
Exponential Backoff And Jitter
Was this change documented?
Yes, there are code comments and the Jupyter notbook serve as documentation.
Is this a breaking change?
No