-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please consider retrying ExpiredToken #471
Comments
Tested to fix the problem with an overloaded S3 serverWe added retries for We observed below that retrying on [2024-12-06 01:17:26,325][root][INFO] Invoking command: <redacted> # start of job
[ERROR] 2024-12-06 01:23:41.999 S3MetaRequest [140592887621184] id=0x7fddcc63a400 Request failed from error 2058 (The connection has closed or is closing.). (request=0x7fde41cb1c80, response status=0). Try to setup a retry.
[ERROR] 2024-12-06 01:23:50.358 S3Endpoint [140592898111040] id=0x7fde59b5c150: Could not acquire connection due to error code 2058 (The connection has closed or is closing.)
[ERROR] 2024-12-06 01:30:08.934 S3MetaRequest [140592659035712] id=0x7fddca63a800 Request failed from error 2058 (The connection has closed or is closing.). (request=0x7fddc8610500, response status=0). Try to setup a retry.
[WARN] 2024-12-06 01:36:26.418 S3Client [140592843585088] id=0x7fde59b8a380 Client upload part timeout rate is larger than expected, current timeout is 1127, bump it up. Request original timeout is: 941
#
# Many similar such errors - retries, closed connections, and several 500 "Internal Server Error" cases during the next hour ...
#
[ERROR] 2024-12-06 02:17:31.980 S3MetaRequest [140592707257920] id=0x7f602f0a0000 Request failed from error 14370 (Token expired (needs a refresh).). (request=0x7fde41c76c80, response status=400). Try to setup a retry.
[ERROR] 2024-12-06 02:17:32.064 S3MetaRequest [140592614995520] id=0x7f602f0a0000 Request failed from error 14370 (Token expired (needs a refresh).). (request=0x7fde41c70f00, response status=400). Try to setup a retry. Job then completed successfully. |
Similar to the case of `RequestTimeout`, when progress stalls due to a prolonged time of the server closing connections and similar conditions that prevent refreshing STS tokens, retrying `ExpiredToken` will give a stalling job a chance to continue, rather than failing. Documented success case in awslabs#471. Resolves awslabs#471.
Do you know whats causing the stalling? Im not quite sure I follow what causes the creds to become expired. |
This is a condition that several SDKs suffer from. I had put a list of these into #464 before realizing that #457 solved a related issue - retry on I had initially blamed the outdated state of our support libraries, but the problem persisted after upgrading The situation is so bad that the job runs for 1 hour, and it seems a proper retry does not get through, my hunch is that a retry task is never properly prepared since the connection it tries to use is closed. |
Signing is NOT the last thing. Currently, the stages of each HTTP request are:
So if it's a slow machine, the requests can sit in the Queued stage for a long time, already signed, waiting for a slot in the HTTP connection pool. Garret's right, this probably IS a problem in any SDK where signing happens before waiting for a slot in a finite-sized HTTP connection pool. If we delayed signing until the last possible moment that it needs to be submitted to the HTTP connection, we could reduce these timeout issues. But it's a non-trivial change, and might mess with our throughput on fast machines... |
Thank you for the PR. It has been released in https://github.com/awslabs/aws-c-s3/releases/tag/v0.7.7. We have also backlogged the task to move signing closer to the actual HTTP request to mitigate this issue further. |
Describe the feature
Add
ExpiredToken
to the list of retryable errors.Use Case
When processing is slow, it is possible to overrun the time budget of an STS token (usually 1 hour).
The STS tokens are refreshed only when a new request is made, or a request is retried.
This is by virtue of calling the delegated STS token provider, which does a
RefreshIfExpired
.Under healthy conditions, retrying a request with an expired refreshable token has a likelihood to succeed on retry.
In the worst case there are a number of useless retries, which seems worth the expense.
Proposed Solution
We have seen
ExpiredToken
under conditions similar toRequestTimeout
(#457) and so propose addingExpiredToken
to the list of retryable exceptions.Why this would work
The STS tokens are refreshed each time
.sign_request = aws_s3_meta_request_sign_request_default
is called, by virtue ofaws_sign_request_aws
, which callsaws_credentials_provider_get_credentials
on the (STS) credential provider delegated from the C++ SDK to the C libraries.s_s3_client_retry_ready
is called (but only ifAWS_S3_CONNECTION_FINISH_CODE_RETRY
is set).aws_s3_meta_request_prepare_request
, which callss_s3_meta_request_schedule_prepare_request_default
,s_s3_meta_request_prepare_request_task
, which callss_s3_meta_request_on_request_prepared
,s_s3_meta_request_sign_request
, which callsaws_s3_meta_request_sign_request_default
,aws_sign_request_aws
,GetCredentials()
is called).Acknowledgements
The text was updated successfully, but these errors were encountered: