-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
transfer upload batch of files always fails #2992
Comments
Hi @pdumoulin , This does seem related to the issue you linked, and also #29 and others. In all of the other issues with the same symptoms, the problem went away after a few retries. Are you able to configure the retryer to retry the upload 5-10 times and see if it makes the problem go away? I have already engaged the S3 team internally to try and get this diagnosed and fixed, but so far haven't heard back. Let me know if the retry strategy works for you. |
Hi @RanVaknin, I set the retry count to Due to the fact that the average runtime did not increase when the retry count was increased, and the aws-cli in the exact same environment is not encountering these issues, I think it's possible that the aws-php-sdk retry mechanism is not being activated at all in these cases. Please let me know if there is anything else I can do to help gather some more information about the issue and if there are any updates from the S3 team internally! |
Hi @pdumoulin , Apologies for the late response. I got your internal ticket thank you for that.
You can verify whether the retries are taking effect or not by enabling the request logs: $connection = S3Client::factory([
'debug' => true
]); I will update your support person with next steps in order to engage S3. Thanks, |
Hi @RanVaknin , I set the debug level in the client factory as you suggested and was able to gather a lot more log information. I ran the same test and see Please let me know if there are any other data points I can extract from the logs to help. |
Hi @pdumoulin , I'm not sure why the header is always 0/0. If I had to hypothesize, it's likely because of the challenge in updating this value when code is being ran concurrently. Instead of looking at the header value when inspecting the request logs, do you see the same request being sent multiple times? i;e if this is an Can you share some of those logs with us? (please redact the signature and principal ID from the logs). Thanks! |
Hi @RanVaknin, I've attached the logs to the internal AWS ticket that I filed. It does appear that one of the parts which ultimately failed was in fact retried 5 times, but my php client is configured to retry 3 times. I believe this may indicate that cURL is retrying but the php-sdk is not. Please let me know if you find out anything and if you would like any more information. Thank You! |
Hi @pdumoulin , Thanks for providing these logs. From reviewing the error messages, I can see the part you are asking about:
The retry count shown here comes from the underlying http client (Guzzle) retrying the connection, and not the SDK. So even if you set the SDK to retry 3 times, the Guzzle http client's default might be 5 therefore explaining this discrepancy. Based on other reports of that same issue, this difficulty of establishing a connection is not specific to the PHP SDK (not sure about the CLI, perhaps the multipart upload was not done concurrently? ) but is reported across many SDKs and likely points to an issue with S3 itself. The workaround discussed was to configure a higher retry count, and after 10-20 retry attempts the http client was able to establish that connection. One way we can test this workaround is to try to create a Guzzle middleware to increase the retry count. Check out this documentation. Another thing I can see from your user-agent is that the retry mode is set to @stobrien89 any thoughts? Thanks, |
My understanding of the expected retry hierarchy is that the aws-php-sdk should retry whenever the underlying Guzzle client exhausts it's retries. In this case, I would expect 1 retry from the aws-php-sdk for every 5 Guzzle retries, for a total of 15 Guzzle retries in my setup. Is that assumption wrong? I believe this still points to the aws-php-sdk (and other sdks) retry mechanism not taking into account an issue with the S3 service itself.
I can experiment with this if I have some spare time, but as per my above comment, I would expect this to be the responsibility of the aws-php-sdk and not need to be implemented by a system using it.
I already ran the same tests with all retry modes, and the uploads failed in the same fashion. |
Hi @pdumoulin , A few more clarifications after speaking to @stobrien89:
$options = array(
'debug' => true,
'http' => [
'curl' => [
CURLOPT_VERBOSE => true,
],
] This should expose some networking level logs that might give us some more info. As far as the solution / workaround, my previous statement still stands. We need to raise the retry limit on the Guzzle client to try and force the server to re-establish a connection, so that suggestion remains the same. Thanks again, |
Hi, I missed your latest comment so responding separately here.
I don't think this is the case. A connection error is a networking level event that is handled by the underlying HTTP client and not by the application layer. The SDK is meant to retry any retryable error that is tied to an HTTP status code like a 5xx error or a 429 or any other error that is defined as a retryable errors. Admittedly, I'm basing my knowledge on the behavior of other SDKs I support. I'm more familiar with like the Go and the JS SDK which have some support for retrying networking level errors. From the stack trace, it seems like there is an issue with establishing a connection rather than the typical I/O bound retry strategy. I will review this again with the team to make sure I'm not misleading you here.
I agree with the sentiment here. Customers should not implement patches for the underlying http client. But there are two caveats here:
Thanks, |
Hi @RanVaknin, Thank you for your thoughtful and detailed responses. I appreciate the time and attention to details. Firstly, I ran and planned some more tests as per your suggestions, here is how they went...
Secondly, I also looked into how the aws-php-sdk is handling networking level errors and found that (if I'm reading the code correctly) it is setup to retry cURL errors for |
Hi @pdumoulin, Sorry for the issues and thanks for your patience thus far. I suspect this issue is coming from the S3 side— we've had reports of slowdowns and general connection issues since they've enabled TLS 1.3. I'd be curious to see your results after forcing TLS 1.2 in the S3 client by adding the following to your top-level configuration: 'http' => [
'curl' => [
CURLOPT_SSLVERSION => CURL_SSLVERSION_MAX_TLSv1_2,
]
] |
@stobrien89 - I gave your suggestion a try and received the same error |
I was able to reproduce the issue and it appears to be a network bottleneck caused by a combination of the number of concurrent requests and the number of multipart workflows (for instance, a 50 mb file will require 12 separate requests). I don't think retrying cURL send errors will alleviate this. I'll need to dig deeper on the specific limiting factors, but in the meantime, you could try tweaking the |
Firstly, I tried tweaking settings to reduce network requests, all the way down to...
During every test, I encountered the same error as before. Secondly, I am testing with 73 files, only 3 of which are larger than 50 mb. I tried setting |
We ran a test on real data with...
All other settings as default. The data that was giving us consistent problems in the past went through without an issue. It's a small data set, but encouraging. I don't think we can consider this completely resolved, but I hope changing these settings will at least stabilize the behaviour for our use case. Thank you for the help, and please let me know if there are any further changes in the client or server side regarding this issue. |
@stobrien89 - although our test case is stable, our production system is still experiencing these errors we will probably try tweaking the setting some more, but I wanted to check in on a couple things...
|
Hi @pdumoulin, Very sorry for the late reply. Glad you were able to find a workaround for the time being. I haven't had a chance to do a more thorough investigation, but will let you know as soon as I know more.
I'm not certain it wouldn't help, but since this seems like a resource/bandwidth issue I'm not optimistic. Although it seems like you've had success with the CLI/Python SDK, I thought it might be helpful to include this snippet from their transfer config documentation: https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#max-concurrent-requests. At the very least, I think we'll need to update our documentation with a similar warning. |
It's worth noting that we experienced failures with the aws-php-sdk's default concurrency setting of "3" and success with the aws-cli's default concurrency setting of "10" during tests which uploaded a single directly. However, in our actual production workloads, there are a variable number of clients making requests, so setting the concurrency per client only goes so far. I keep asking about retries because I am assuming that any resource/bandwidth issue would be ephemeral and a retry, especially with a long backoff, would alleviate the issue. For our use case, delayed success is much better than quick failure. If it's possible to create a release candidate or beta version of the aws-php-sdk with retries for this specific cURL error, we would be happy to test it out. |
Hi @pdumoulin , I gave reproducing this issue a fair shake. Created the same amount of .flac files with various file sizes corresponding to the info you've provided. However After 10 executions in a row, I'm not able to see any curl issues manifesting. The reason I brought up retries in the first place was to see if it can point to service side issue manifesting across multiple SDKs which based on other similar issues, went away after a retry. What @stobrien89 said about bandwidth makes sense. As far as retrying this specific curl error blindly for all SDK customers, I'm not sure what would be the implications. That's why I suggested you implement your own middleware to begin with, that way you can modify the retry behavior for your use case to see if it solves your issue, so that we approach S3 with some concrete examples in order for them to fix this, and in order to unblock you for the time being. I would have written my own middleware and given you the actual code, but Im not able to synthetically raise the curl error in order to test a middleware against it, but maybe you can. Can you refer to this comment for an example of how you'd implement a simple retry middleware hook? Any other reproduction details you can provide us with so I can continue to try reproducing this and raising this curl error myself? Thanks again, |
@RanVaknin @stobrien89 - I setup a slate of tests in different environments using Docker that can be used to re-create the issue, you can pull it down and run them yourself from pdumoulin/php-aws-bulk-upload-sample. Running the experiments in each case either fails 100% of the time (noted by ❌) or succeeds 100% of the time (noted by ✅ ). The setup "Debian Bookworm" most closely matches our production workloads. I am going to continue to investigate this on my end to try and figure out what might be causing the different behaviour between environments. Please give this process a try and let me know if you can reproduce the same results, and if you have any other ideas to test or metrics to record. |
@RanVaknin @stobrien89 - I ran a few more tests, tweaking cURL options, and here's what I have observed...
This is what I think might be happening...
I'm still perplexed as to why this would occur in a Debian container and not an Alpine one. I diff-ed all the sysctl network settings between the two containers and don't see many changes or anything that would explain this. I think we are going to try setting |
@RanVaknin @stobrien89 - We implemented the above change to set In my opinion, this points to the issue not being about bandwidth and retries being acceptable. Please see my reproduction steps above for further investigation and to help with potential changes. |
Describe the bug
Using the aws-php-sdk transfer function to upload a directory of files always fails at some point during the process.
Expected Behavior
I would expect these operations would be retried by the SDK or these sort of error would be very rare.
Current Behavior
Uploading a batch of 73 files, ranging in size from 58MB to 2MB consistently fails sometime in the middle of the batch with one of the following errors...
AWS HTTP error: cURL error 55: Connection died, tried 5 times before giving up
AWS HTTP error: cURL error 52: Empty reply from server
RequestTimeout (client): Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
Reproduction Steps
The following php script runs in a debian12 docker container using php82.
Possible Solution
No response
Additional Information/Context
FLAC
audio filesSDK version used
3.321.0
Environment details (Version of PHP (
php -v
)? OS name and version, etc.)php 8.2.22 ; OpenSSL 3.0.13 ; curl 7.88.1 ; debian12 (docker container)
The text was updated successfully, but these errors were encountered: