Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please consider retrying ExpiredToken #471

Closed
1 of 2 tasks
grrtrr opened this issue Dec 3, 2024 · 5 comments · Fixed by #472
Closed
1 of 2 tasks

Please consider retrying ExpiredToken #471

grrtrr opened this issue Dec 3, 2024 · 5 comments · Fixed by #472
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue

Comments

@grrtrr
Copy link
Contributor

grrtrr commented Dec 3, 2024

Describe the feature

Add ExpiredToken to the list of retryable errors.

Use Case

When processing is slow, it is possible to overrun the time budget of an STS token (usually 1 hour).

The STS tokens are refreshed only when a new request is made, or a request is retried.
This is by virtue of calling the delegated STS token provider, which does a RefreshIfExpired.

Under healthy conditions, retrying a request with an expired refreshable token has a likelihood to succeed on retry.
In the worst case there are a number of useless retries, which seems worth the expense.

Proposed Solution

We have seen ExpiredToken under conditions similar to RequestTimeout (#457) and so propose adding ExpiredToken to the list of retryable exceptions.

Why this would work

The STS tokens are refreshed each time .sign_request = aws_s3_meta_request_sign_request_default is called, by virtue of aws_sign_request_aws, which calls aws_credentials_provider_get_credentials on the (STS) credential provider delegated from the C++ SDK to the C libraries.

  • On retry, s_s3_client_retry_ready is called (but only if AWS_S3_CONNECTION_FINISH_CODE_RETRY is set).
  • This calls aws_s3_meta_request_prepare_request, which calls s_s3_meta_request_schedule_prepare_request_default,
  • which schedules s_s3_meta_request_prepare_request_task, which calls s_s3_meta_request_on_request_prepared,
  • which calls s_s3_meta_request_sign_request, which calls aws_s3_meta_request_sign_request_default,
  • which calls aws_sign_request_aws,
  • which then refreshes the credentials (part of what the STS token provider does when GetCredentials() is called).

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change
@grrtrr grrtrr added feature-request A feature should be added or improved. needs-triage This issue or PR still needs to be triaged. labels Dec 3, 2024
@grrtrr
Copy link
Contributor Author

grrtrr commented Dec 6, 2024

Tested to fix the problem with an overloaded S3 server

We added retries for ExpiredToken as suggested above and tested with a bucket where the endpoints regularly close connections, return 500, causing stalling, which produces the problem.
The job started at 1:17 and stalled for over 1 hour. Hence at 2:17, the STS token expired.

We observed below that retrying on ExpiredToken allowed the job to recover.

[2024-12-06 01:17:26,325][root][INFO] Invoking command:  <redacted> # start of job
[ERROR] 2024-12-06 01:23:41.999 S3MetaRequest [140592887621184] id=0x7fddcc63a400 Request failed from error 2058 (The connection has closed or is closing.). (request=0x7fde41cb1c80, response status=0). Try to setup a retry.
[ERROR] 2024-12-06 01:23:50.358 S3Endpoint [140592898111040] id=0x7fde59b5c150: Could not acquire connection due to error code 2058 (The connection has closed or is closing.)
[ERROR] 2024-12-06 01:30:08.934 S3MetaRequest [140592659035712] id=0x7fddca63a800 Request failed from error 2058 (The connection has closed or is closing.). (request=0x7fddc8610500, response status=0). Try to setup a retry.
[WARN] 2024-12-06 01:36:26.418 S3Client [140592843585088] id=0x7fde59b8a380 Client upload part timeout rate is larger than expected, current timeout is 1127, bump it up. Request original timeout is: 941
#
# Many similar such errors - retries, closed connections, and several 500 "Internal Server Error" cases during the next hour ...
#
[ERROR] 2024-12-06 02:17:31.980 S3MetaRequest [140592707257920] id=0x7f602f0a0000 Request failed from error 14370 (Token expired (needs a refresh).). (request=0x7fde41c76c80, response status=400). Try to setup a retry.
[ERROR] 2024-12-06 02:17:32.064 S3MetaRequest [140592614995520] id=0x7f602f0a0000 Request failed from error 14370 (Token expired (needs a refresh).). (request=0x7fde41c70f00, response status=400). Try to setup a retry.

Job then completed successfully.

grrtrr added a commit to grrtrr/aws-c-s3 that referenced this issue Dec 6, 2024
Similar to the case of `RequestTimeout`, when progress stalls due to a prolonged time of
the server closing connections and similar conditions that prevent refreshing STS tokens,
retrying `ExpiredToken` will give a stalling job a chance to continue, rather than failing.

Documented success case in awslabs#471.

Resolves awslabs#471.
@DmitriyMusatkin
Copy link
Contributor

Do you know whats causing the stalling? Im not quite sure I follow what causes the creds to become expired.
aws-c-s3 should signing the request last thing before it puts the request on the wire. every time request is signed, it first checks how long the creds are valid for and if its less than 5 min, it will block signing until it refreshes creds.
We did recently (as of 4 months ago) bump that expiration bound check from 10s to 5min, since 10s was too small. Do you have the version number of crt on which you are seeing the problem? I think overall its fine to retry here, but want to make sure retry is not masking some bigger problem.

@grrtrr
Copy link
Contributor Author

grrtrr commented Dec 7, 2024

Do you know whats causing the stalling?

This is a condition that several SDKs suffer from. I had put a list of these into #464 before realizing that #457 solved a related issue - retry on RequestTimeout.
RequestTimeout happens frequently under such conditions, due to the S3 stalling.
The stalling has been a long-term problem in 2 of our buckets. Sometimes tasks that finish in under 20 minutes drag along over 3 hours.

I had initially blamed the outdated state of our support libraries, but the problem persisted after upgrading aws-crt-cpp to v0.29.3, using all the related dependencies that this version uses (aws-c-s3 at v0.7.0).

The situation is so bad that the job runs for 1 hour, and it seems a proper retry does not get through, my hunch is that a retry task is never properly prepared since the connection it tries to use is closed.

@graebm
Copy link
Contributor

graebm commented Dec 9, 2024

...aws-c-s3 should signing the request last thing before it puts the request on the wire...

Signing is NOT the last thing. Currently, the stages of each HTTP request are:

  • Preparing (signing happens at the end of this stage)
  • Queued (waiting for slot in the HTTP connection pool)
  • Networking (Actually using HTTP connection)
  • Streaming-waiting (Done, but out of order so can't be delivered to user yet)
  • Streaming-response (Done, in order, being delivered to user)

So if it's a slow machine, the requests can sit in the Queued stage for a long time, already signed, waiting for a slot in the HTTP connection pool.

Garret's right, this probably IS a problem in any SDK where signing happens before waiting for a slot in a finite-sized HTTP connection pool.

If we delayed signing until the last possible moment that it needs to be submitted to the HTTP connection, we could reduce these timeout issues. But it's a non-trivial change, and might mess with our throughput on fast machines...

@jmklix jmklix added p2 This is a standard priority issue and removed needs-triage This issue or PR still needs to be triaged. labels Dec 11, 2024
@waahm7
Copy link
Contributor

waahm7 commented Dec 12, 2024

Thank you for the PR. It has been released in https://github.com/awslabs/aws-c-s3/releases/tag/v0.7.7. We have also backlogged the task to move signing closer to the actual HTTP request to mitigate this issue further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved. p2 This is a standard priority issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants