Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: duplicate layers of API request retries and integer overflow in backoff #20

Merged
merged 1 commit into from
Sep 7, 2023

Conversation

jusiskin
Copy link
Contributor

@jusiskin jusiskin commented Sep 7, 2023

What was the problem/requirement? (What/Why)

Problem 1: Integer Overflow

The botocore.retries.standard.ExponentialBackoff class used by the worker agent has a bug. Note that this function grows expontentially (by design) as the retry attempt number increases.

For the worker agent, we'd like to retry indefinitely (but backoff to a a very low-frequency amount). If we use this retry algorithm indefinitely, we encounter an overflow error such as:

2023-08-09 06:34:07,647 INFO [bealine_worker_agent.aws.bealine] UpdateWorkerSchedule throttled. Retrying in 30 seconds...
2023-08-09 06:34:37,850 ERROR [bealine_worker_agent.scheduler] Exception in WorkerScheduler
Traceback (most recent call last):
  File "/home/agentuser/.venv/lib/python3.9/site-packages/bealine_worker_agent/aws/bealine/__init__.py", line 688, in update_worker_schedule
    response = bealine_client.update_worker_schedule(**request)
  File "/home/agentuser/.venv/lib/python3.9/site-packages/bealine_worker_agent/boto_mock.py", line 126, in update_worker_schedule
    raise _ClientError(
botocore.exceptions.ClientError: An error occurred (ThrottlingException) when calling the UpdateWorkerSchedule operation: Unknown

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/agentuser/.venv/lib/python3.9/site-packages/bealine_worker_agent/scheduler/scheduler.py", line 244, in run
    interval = self._sync(interruptable=True)
  File "/home/agentuser/.venv/lib/python3.9/site-packages/bealine_worker_agent/scheduler/scheduler.py", line 379, in _sync
    response = update_worker_schedule(**request)
  File "/home/agentuser/.venv/lib/python3.9/site-packages/bealine_worker_agent/aws/bealine/__init__.py", line 692, in update_worker_schedule
    delay = backoff.delay_amount(RetryContext(retry))
  File "/home/agentuser/.venv/lib/python3.9/site-packages/botocore/retries/standard.py", line 275, in delay_amount
    self._random() * (self._base ** (context.attempt_number - 1)),
OverflowError: int too large to convert to float

Problem 2: Jitter Reduction

In addition to the above integer overflow, the botocore.retries.standard.ExponentialBackoff class will reduce jitter as iterations increase.

  • Let $i$ denote the (0-based) retry number.
  • Let $m$ denote the maximum backoff
  • Let $\textrm{random}()$ be a function that returns a random floating-point number in the range $\big{[}0,1\big{]}$

Then the boto algorithm can be expressed as:

$$\textrm{delay}(i, m) = \textrm{min}(m, \textrm{random}() * 2^i)$$

The issue here is that the jitter is applied before clamping to the maximum back-off.

$$\lim_{i \to \infty}\textrm{delay}(i, m) = m$$

If we take the limit as iterations increase towards infinity, we get $m$.

I'd suspected that this meant that in the event of a prolonged service error, all workers will retry at constant rates and the jitter approaches zero.

Problem 2: Duplicate Retry Layers

The worker agent uses boto3 for making AWS API requests. By default, boto3 uses retries (docs) for certain error responses.

In addition to this, the worker agent itself implements its own layer of retries in the aws/deadline directory. These two layers of retries impact the frequency of API requests. In the event of a service issue, the goal of exponential backoff algorithms is to reduce load on the service in order to expedite recovery and successful requests. Having layered retries each with their own back-off competes with this goal.

What was the solution? (How)

Solution 1: Switch to uniform jitter after reaching maximum backoff

We use botocore's backoff algorithm until the attempt number reaches 2 * log(max_backoff, 2) (reaches maximum backoff). After this point, we switch to a uniform jitter between 80-100% of the maximum backoff.

$$\begin{equation} \textrm{delay}(i, m) = \left\{ \begin{array}{@{}ll@{}} \text{random}() \min(m, 2^i & \text{if}\ i \le 2 \log_2 m \\\ \text{random}() \frac{m}{5} + \frac{4m}{5}, & \text{otherwise} \end{array}\right. \end{equation}$$

Solution 2: Simulate Backoff

In order to confirm my hypothesis about jitter reduction, I tested this with a simulation analysis.

  • Added this analysis as a Jupyter notebook to the worker agent codebase under docs/research/retry_backoff_jitter.ipynb
  • This analysis should render in GitHub (need to use the "view file" link from the PR). I've included instructions within the notebook on how to set this up
  • Added a hatch environment and requirements-research.txt for setting up the Python virtual environment with the needed Python dependencies for analysis.

The conclusion reached by the analysis is to use the algorithm specified in solution 1. While boto's algorithm reduces jitter as the attempt number increases, this is somewhat desirable. Having full jitter creates more load on the service since some retries become quicker.

image

While we can't use boto's algorithm for arbitrarily large retries because of the overflow error described in problem 1, we can take a compromise by fixing jitter so that requests with high attempt numbers are jittered between 80-100% of the maximum backoff.

Solution 3: Turn off boto Retries

We configure all deadline boto clients to make a request per boto3 API call. The worker agent retains control of the backoff algorithm used.

What is the impact of this change?

Worker agents backs off when encountering failed API requests to the service and applies jitter using a validated algorithm.

How was this change tested?

  • Added unit tests for the backoff algorithm
  • Provided a Jupyter notebook that simulates the backoff using a large number of workers and compares that to the benchmark algorithm (boto's). Also compared a few more backoff algorithms from the AWS Architecture Blog post
    Exponential Backoff And Jitter

Was this change documented?

Yes, there are code comments and the Jupyter notbook serve as documentation.

Is this a breaking change?

No

@jusiskin jusiskin added the bug Something isn't working label Sep 7, 2023
@jusiskin jusiskin requested a review from a team as a code owner September 7, 2023 18:51
Copy link
Contributor

@ddneilson ddneilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great analysis!

@@ -0,0 +1,439 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI to anyone interested, you can see the rendered version of this by hitting the "..." on the top right of the file diff, and selecting "View file".

return half_temp + self._random_uniform(0, half_temp)


class FullJitterExponentialBackoff(BotocoreExponentialBackoff): # pragma: no cover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should only have the one you selected, right? Keeping the unused ones here as dead code isn't a good idea.

@jusiskin jusiskin force-pushed the usiskin/fix_boto_dupe_retries_integer_overflow branch from 5b32854 to 31d48e9 Compare September 7, 2023 19:17
@jusiskin jusiskin force-pushed the usiskin/fix_boto_dupe_retries_integer_overflow branch from 31d48e9 to 206ee89 Compare September 7, 2023 19:31
@mwiebe mwiebe merged commit 4905915 into mainline Sep 7, 2023
@mwiebe mwiebe deleted the usiskin/fix_boto_dupe_retries_integer_overflow branch September 7, 2023 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants