fix: duplicate layers of API request retries and integer overflow in backoff #20

jusiskin · 2023-09-07T18:51:24Z

What was the problem/requirement? (What/Why)

Problem 1: Integer Overflow

The botocore.retries.standard.ExponentialBackoff class used by the worker agent has a bug. Note that this function grows expontentially (by design) as the retry attempt number increases.

For the worker agent, we'd like to retry indefinitely (but backoff to a a very low-frequency amount). If we use this retry algorithm indefinitely, we encounter an overflow error such as:

2023-08-09 06:34:07,647 INFO [bealine_worker_agent.aws.bealine] UpdateWorkerSchedule throttled. Retrying in 30 seconds...
2023-08-09 06:34:37,850 ERROR [bealine_worker_agent.scheduler] Exception in WorkerScheduler
Traceback (most recent call last):
  File "/home/agentuser/.venv/lib/python3.9/site-packages/bealine_worker_agent/aws/bealine/__init__.py", line 688, in update_worker_schedule
    response = bealine_client.update_worker_schedule(**request)
  File "/home/agentuser/.venv/lib/python3.9/site-packages/bealine_worker_agent/boto_mock.py", line 126, in update_worker_schedule
    raise _ClientError(
botocore.exceptions.ClientError: An error occurred (ThrottlingException) when calling the UpdateWorkerSchedule operation: Unknown

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/agentuser/.venv/lib/python3.9/site-packages/bealine_worker_agent/scheduler/scheduler.py", line 244, in run
    interval = self._sync(interruptable=True)
  File "/home/agentuser/.venv/lib/python3.9/site-packages/bealine_worker_agent/scheduler/scheduler.py", line 379, in _sync
    response = update_worker_schedule(**request)
  File "/home/agentuser/.venv/lib/python3.9/site-packages/bealine_worker_agent/aws/bealine/__init__.py", line 692, in update_worker_schedule
    delay = backoff.delay_amount(RetryContext(retry))
  File "/home/agentuser/.venv/lib/python3.9/site-packages/botocore/retries/standard.py", line 275, in delay_amount
    self._random() * (self._base ** (context.attempt_number - 1)),
OverflowError: int too large to convert to float

Problem 2: Jitter Reduction

In addition to the above integer overflow, the botocore.retries.standard.ExponentialBackoff class will reduce jitter as iterations increase.

Let $i$ denote the (0-based) retry number.
Let $m$ denote the maximum backoff
Let $\textrm{random}()$ be a function that returns a random floating-point number in the range $\big{[}0,1\big{]}$

Then the boto algorithm can be expressed as:

$$\textrm{delay}(i, m) = \textrm{min}(m, \textrm{random}() * 2^i)$$

The issue here is that the jitter is applied before clamping to the maximum back-off.

$$\lim_{i \to \infty}\textrm{delay}(i, m) = m$$

If we take the limit as iterations increase towards infinity, we get $m$.

I'd suspected that this meant that in the event of a prolonged service error, all workers will retry at constant rates and the jitter approaches zero.

Problem 2: Duplicate Retry Layers

The worker agent uses boto3 for making AWS API requests. By default, boto3 uses retries (docs) for certain error responses.

In addition to this, the worker agent itself implements its own layer of retries in the aws/deadline directory. These two layers of retries impact the frequency of API requests. In the event of a service issue, the goal of exponential backoff algorithms is to reduce load on the service in order to expedite recovery and successful requests. Having layered retries each with their own back-off competes with this goal.

What was the solution? (How)

Solution 1: Switch to uniform jitter after reaching maximum backoff

We use botocore's backoff algorithm until the attempt number reaches 2 * log(max_backoff, 2) (reaches maximum backoff). After this point, we switch to a uniform jitter between 80-100% of the maximum backoff.

$$\begin{equation} \textrm{delay}(i, m) = \left\{ \begin{array}{@{}ll@{}} \text{random}() \min(m, 2^i & \text{if}\ i \le 2 \log_2 m \\\ \text{random}() \frac{m}{5} + \frac{4m}{5}, & \text{otherwise} \end{array}\right. \end{equation}$$

Solution 2: Simulate Backoff

In order to confirm my hypothesis about jitter reduction, I tested this with a simulation analysis.

Added this analysis as a Jupyter notebook to the worker agent codebase under docs/research/retry_backoff_jitter.ipynb
This analysis should render in GitHub (need to use the "view file" link from the PR). I've included instructions within the notebook on how to set this up
Added a hatch environment and requirements-research.txt for setting up the Python virtual environment with the needed Python dependencies for analysis.

The conclusion reached by the analysis is to use the algorithm specified in solution 1. While boto's algorithm reduces jitter as the attempt number increases, this is somewhat desirable. Having full jitter creates more load on the service since some retries become quicker.

While we can't use boto's algorithm for arbitrarily large retries because of the overflow error described in problem 1, we can take a compromise by fixing jitter so that requests with high attempt numbers are jittered between 80-100% of the maximum backoff.

Solution 3: Turn off boto Retries

We configure all deadline boto clients to make a request per boto3 API call. The worker agent retains control of the backoff algorithm used.

What is the impact of this change?

Worker agents backs off when encountering failed API requests to the service and applies jitter using a validated algorithm.

How was this change tested?

Added unit tests for the backoff algorithm
Provided a Jupyter notebook that simulates the backoff using a large number of workers and compares that to the benchmark algorithm (boto's). Also compared a few more backoff algorithms from the AWS Architecture Blog post
Exponential Backoff And Jitter

Was this change documented?

Yes, there are code comments and the Jupyter notbook serve as documentation.

Is this a breaking change?

No

ddneilson

Great analysis!

mwiebe · 2023-09-07T18:59:42Z

docs/research/retries_backoff_jitter.ipynb

@@ -0,0 +1,439 @@
+{


FYI to anyone interested, you can see the rendered version of this by hitting the "..." on the top right of the file diff, and selecting "View file".

mwiebe · 2023-09-07T19:07:04Z

src/deadline_worker_agent/boto/retries.py

+        return half_temp + self._random_uniform(0, half_temp)
+
+
+class FullJitterExponentialBackoff(BotocoreExponentialBackoff):  # pragma: no cover


This file should only have the one you selected, right? Keeping the unused ones here as dead code isn't a good idea.

…backoff Signed-off-by: Josh Usiskin <[email protected]>

jusiskin added the bug Something isn't working label Sep 7, 2023

jusiskin requested a review from a team as a code owner September 7, 2023 18:51

ddneilson approved these changes Sep 7, 2023

View reviewed changes

mwiebe requested changes Sep 7, 2023

View reviewed changes

jusiskin force-pushed the usiskin/fix_boto_dupe_retries_integer_overflow branch from 5b32854 to 31d48e9 Compare September 7, 2023 19:17

fix: duplicate layers of API request retries and integer overflow in …

206ee89

…backoff Signed-off-by: Josh Usiskin <[email protected]>

jusiskin force-pushed the usiskin/fix_boto_dupe_retries_integer_overflow branch from 31d48e9 to 206ee89 Compare September 7, 2023 19:31

ddneilson approved these changes Sep 7, 2023

View reviewed changes

mwiebe approved these changes Sep 7, 2023

View reviewed changes

mwiebe merged commit 4905915 into mainline Sep 7, 2023

mwiebe deleted the usiskin/fix_boto_dupe_retries_integer_overflow branch September 7, 2023 19:41

jusiskin mentioned this pull request Sep 27, 2023

fix: respect service's suggested retryAfter when throttled #39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: duplicate layers of API request retries and integer overflow in backoff #20

fix: duplicate layers of API request retries and integer overflow in backoff #20

jusiskin commented Sep 7, 2023 •

edited

Loading

ddneilson left a comment

mwiebe Sep 7, 2023

mwiebe Sep 7, 2023

		return half_temp + self._random_uniform(0, half_temp)


		class FullJitterExponentialBackoff(BotocoreExponentialBackoff): # pragma: no cover

fix: duplicate layers of API request retries and integer overflow in backoff #20

fix: duplicate layers of API request retries and integer overflow in backoff #20

Conversation

jusiskin commented Sep 7, 2023 • edited Loading

What was the problem/requirement? (What/Why)

Problem 1: Integer Overflow

Problem 2: Jitter Reduction

Problem 2: Duplicate Retry Layers

What was the solution? (How)

Solution 1: Switch to uniform jitter after reaching maximum backoff

Solution 2: Simulate Backoff

Solution 3: Turn off boto Retries

What is the impact of this change?

How was this change tested?

Was this change documented?

Is this a breaking change?

ddneilson left a comment

Choose a reason for hiding this comment

mwiebe Sep 7, 2023

Choose a reason for hiding this comment

mwiebe Sep 7, 2023

Choose a reason for hiding this comment

jusiskin commented Sep 7, 2023 •

edited

Loading