-
Notifications
You must be signed in to change notification settings - Fork 834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #887 Enable automatic retry by a handy way #1084
Conversation
retry_response: Optional[RetryHttpResponse] = None | ||
response_body = "" | ||
|
||
if self.logger.level <= logging.DEBUG: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this debug logging to print this every time this client does a retry.
@@ -49,6 +55,7 @@ def __init__( | |||
user_agent_prefix: Optional[str] = None, | |||
user_agent_suffix: Optional[str] = None, | |||
logger: Optional[logging.Logger] = None, | |||
retry_handlers: List[RetryHandler] = async_default_handlers, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default one consists of only the (Async)ConnectionErrorRetryHandler
instance with its defaults.
default_interval_calculator = BackoffRetryIntervalCalculator() | ||
|
||
|
||
class RetryHandler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is the main interface in this pull request
from slack_sdk.http_retry.handler import RetryHandler, default_interval_calculator | ||
|
||
|
||
class MyRetryHandler(RetryHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom retry handler for testing
@@ -111,7 +111,7 @@ def _handle(self): | |||
return | |||
if pattern == "rate_limited": | |||
self.send_response(429) | |||
self.send_header("Retry-After", 30) | |||
self.send_header("Retry-After", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the value for faster test execution (we really don't want to wait for 30 seconds
31d7027
to
f9385f7
Compare
Codecov Report
@@ Coverage Diff @@
## main #1084 +/- ##
==========================================
+ Coverage 85.62% 86.09% +0.46%
==========================================
Files 99 110 +11
Lines 9324 9847 +523
==========================================
+ Hits 7984 8478 +494
- Misses 1340 1369 +29
Continue to review full report at Codecov.
|
Applied the following changes:
|
5e418a6
to
84b12e1
Compare
84b12e1
to
85eab23
Compare
I plan on reviewing today - it is a big PR so I did not end up having time yesterday. |
@filmaj Thanks! No rush at all. I know this includes so many changes. I am thinking that this pull request can add more unit tests covering rate limited error patterns for safety. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more comments for reviewers
@@ -190,7 +199,7 @@ def api_call( | |||
return self._perform_http_request( | |||
http_verb=http_verb, | |||
url=url, | |||
body_params=body_params, | |||
body=body_params, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As _perform_http_request
is an internal method, we can safely rename this arg name
counter_for_safety = 0 | ||
while counter_for_safety < 100: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to remove this counter for simplicity. while True
here should be safe enough as retry_state.next_attempt_requested
is usually False
error_types: List[Exception] = [ | ||
ServerConnectionError, | ||
ServerDisconnectedError, | ||
# ClientOSError: [Errno 104] Connection reset by peer | ||
ClientOSError, | ||
], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are aiohttp specific exceptions
return False | ||
|
||
|
||
class ServerErrorRetryHandler(RetryHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this one as a reference implementation but it's unused. We may want to remove this for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest removing it. I am not sure it is a good practice to blindly retry if a request yields an HTTP 500 response; I think it could lead to undesirable network saturation in certain cases like a legitimate outage on Slack's side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is fair enough 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest removing it. I am not sure it is a good practice to blindly retry if a request yields an HTTP 500 response; I think it could lead to undesirable network saturation in certain cases like a legitimate outage on Slack's side.
Sorry to necro, but I think it's worth reconsidering this decision. While it may be undesirable from Slack's perspective to exponentially backoff when their API is returning 5xxs, I think this is what the SDK consumers will want. And I think it makes sense to do what the consumer wants here, because if we don't, the consumer is just going to implement their own exponential backoff logic that includes 5xxs. This is my plan anyway.
Thanks for your work here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a good time to mention the Circuit Breaker pattern.
duration += random.random() | ||
else: | ||
duration = ( | ||
int(response.headers.get(retry_after_header_name)[0]) + random.random() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The random.random()
is a random jitter but it might not be necessary here. This is not backoff
from .interval_calculator import RetryIntervalCalculator | ||
|
||
|
||
class FixedValueRetryIntervalCalculator(RetryIntervalCalculator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a reference. Unused with the default setting. I should add some test for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's unused, can probably not worry about the tests. Unless you want to keep the coverage scores high 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, yeah, I always like better coverage!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, lots of work and you did a great job! Thanks for involving me in the review.
I left a few comments, mostly for my own learning and education.
return False | ||
|
||
|
||
class ServerErrorRetryHandler(RetryHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest removing it. I am not sure it is a good practice to blindly retry if a request yields an HTTP 500 response; I think it could lead to undesirable network saturation in certain cases like a legitimate outage on Slack's side.
duration = ( | ||
int(response.headers.get(retry_after_header_name)[0]) + random.random() | ||
) | ||
time.sleep(duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, using the synchronous client, if the API responds with a relatively large value in the Retry-After
header (e.g. the docs for this header show an example value of 30
) - would this freeze the entire process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this freeze the entire process?
When it comes to the same thread, yes. Thinking about the behavior as the whole app. it depends on how the app is implemented. In the case of Bolt for Python, all the code except ack()
will be executed in a background thread. It does not result in 3 secound timeout.
By default, we don't enable rate limited error retries. Developers should turn it on with great understanding of the potential long pause.
from slack_sdk.http_retry.handler import RetryHandler, default_interval_calculator | ||
|
||
|
||
class AsyncConnectionErrorRetryHandler(RetryHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this async implementation relies on the same base class that is shared with the sync implementation, and the base RetryHandler
class' prepare_for_next_request
uses the built-in Python's sleep
method - could this lead to a situation where we block the process even when using an async handler?
I am not very familiar with aiohttp, but it seems like it is based on the asyncio library which has its own async-friendly sleep
implementation (or, at least, this aiohttp document page implies that such an async sleep exists - search for asyncio
on this page for the relevant section).
I am posing this question from a place of ignorance and a desire to learn so it is likely I am completely off. But asking dumb questions is helpful for me to learn 🤪
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@filmaj Ah, this is a great point! Yes, we should use asyncio.sleep
instead here and I was aware of it. But somehow I forgot to override the method. We can have a base class RetryHandler
, which uses asyncio'sleep method. All the methods in it will be async methods. I will update this part shortly.
header = self.headers["Authorization"] | ||
if header is not None and "xoxp-" in header: | ||
pattern = str(header).split("xoxp-", 1)[1] | ||
if "remote_disconnected" in pattern: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice pattern, I like this a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@filmaj Thanks for your review! I will update some parts before merging this.
from slack_sdk.http_retry.handler import RetryHandler, default_interval_calculator | ||
|
||
|
||
class AsyncConnectionErrorRetryHandler(RetryHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@filmaj Ah, this is a great point! Yes, we should use asyncio.sleep
instead here and I was aware of it. But somehow I forgot to override the method. We can have a base class RetryHandler
, which uses asyncio'sleep method. All the methods in it will be async methods. I will update this part shortly.
return False | ||
|
||
|
||
class ServerErrorRetryHandler(RetryHandler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is fair enough 👍
duration = ( | ||
int(response.headers.get(retry_after_header_name)[0]) + random.random() | ||
) | ||
time.sleep(duration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this freeze the entire process?
When it comes to the same thread, yes. Thinking about the behavior as the whole app. it depends on how the app is implemented. In the case of Bolt for Python, all the code except ack()
will be executed in a background thread. It does not result in 3 secound timeout.
By default, we don't enable rate limited error retries. Developers should turn it on with great understanding of the potential long pause.
from .interval_calculator import RetryIntervalCalculator | ||
|
||
|
||
class FixedValueRetryIntervalCalculator(RetryIntervalCalculator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, yeah, I always like better coverage!
Fixed all the issues in the latest revision. Let me merge this PR now. I will release an RC version for getting feedback from communities. |
Summary
This pull request fixes #887 by adding the new
RetryHandler
feature to all the API clients (except legacy ones underslack
package).With the default settings, the API clients do one retry only for connectivity issues like the "Connection reset by peer" error. For the intervals of retries, the built-in retry handlers behave in the manner of exponential backoff and jitter.
To customize the behavior, you can pass your own
retry_handlers
argument to API client constructors:If an API client with retry handlers encounters an error, it runs each handler's
def can_retry(args) -> bool
method. If any of the method executions returns True, the client runs itsdef prepare_for_next_retry(args) -> None
method to wait for the right timing. Then, the same API request will be performed until the client hits the handler'smax_retry_count
.In this pull request, I've updated the following API clients:
slack_sdk.web.WebClient
slack_sdk.webhook.WebhookClient
slack_sdk.audit_logs.AuditLogsClient
slack_sdk.scim.SCIMClient
slack_sdk.web.async_client.AsyncWebClient
(aiohttp/asyncio compatible)slack_sdk.webhook.async_client.AsyncWebhookClient
(aiohttp/asyncio compatible)slack_sdk.audit_logs.async_client.AsyncAuditLogsClient
(aiohttp/asyncio compatible)slack_sdk.scim.async_client.AsyncSCIMClient
(aiohttp/asyncio compatible)You can reuse retry handlers across the above API clients:
TODOs
Category (place an
x
in each of the[ ]
)/docs-src
(Documents, have you run./docs.sh
?)/docs-src-v2
(Documents, have you run./docs-v2.sh
?)/tutorial
(PythOnBoardingBot tutorial)tests
/integration_tests
(Automated tests for this library)Requirements (place an
x
in each[ ]
)python3 -m venv .venv && source .venv/bin/activate && ./scripts/run_validation.sh
after making the changes.