Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RequestTimeTooSkewed when uploading large files? #1065

Closed
evleeann opened this issue Dec 18, 2014 · 13 comments
Closed

RequestTimeTooSkewed when uploading large files? #1065

evleeann opened this issue Dec 18, 2014 · 13 comments
Labels
feature-request A feature should be added or improved.

Comments

@evleeann
Copy link

I've been using the awscli to push files from our local backup linux server to the S3 storage cloud, and for the most part it works. A few times a week though, my backup will fail, with

"A client error (RequestTimeTooSkewed) occurred when calling the UploadPart operation: The difference between the request time and the current time is too large."

The first time it happened, I read up on the error, and set up ntp to run nightly to resync my local server time to the time server.

Today it happened while I was watching - so I immediately ran a check on the machine to see if the local time was off -

me@bk1 ~$ curl http://s3.amazonaws.com -v
About to connect() to s3.amazonaws.com port 80 (#0)
Trying 54.231.32.32... connected
Connected to s3.amazonaws.com (54.231.32.32) port 80 (#0)
GET / HTTP/1.1
User-Agent: curl/7.19.7 (i386-redhat-linux-gnu) libcurl/7.19.7 NSS/3.14.0.0 zlib/1.2.3 libidn/1.18 libssh2/1.4.2
Host: s3.amazonaws.com
Accept: /
< HTTP/1.1 307 Temporary Redirect
< x-amz-id-2: mNLP6aO0f332BxDWUuOhd5zokYd2ZuPrl22tvP2juZ4ln152DEW64pj5CQqqhx1y
< x-amz-request-id: 4388A0673C619CF8
< Date: Thu, 18 Dec 2014 21:56:15 GMT
< Location: http://aws.amazon.com/s3/
< Content-Length: 0
< Server: AmazonS3
<
Connection #0 to host s3.amazonaws.com left intact
Closing connection #0
me@bk1 ~$ date -u
Thu Dec 18 21:56:07 UTC 2014

So I am within a few seconds of the amazon official time - so it doesn't seem like this is really the actual problem - but I can't find any information on what else the problem might actually be.

Any ideas on what might be causing this error, when it's not actually a 15 minute time difference?

me@bk1 ~$ aws --version
aws-cli/1.6.2 Python/2.6.6 Linux/2.6.32-431.el6.i686

I'm connecting to S3 in Oregon if that makes any difference!
using the command

/usr/local/bin/aws s3 cp /backups/$i s3://mybucketname
and looping through each file to be uploaded.

The file in question is 7.6 gigs - and s3 was breaking it up into parts. It got to part 56 before it got the errors

Completed 56 of 524 part(s) with -1 file(s) remaining

@jamesls
Copy link
Member

jamesls commented Dec 18, 2014

Would you mind trying on the latest version of the AWS CLI? We've made several bug fixes related to S3 retries and signing that I believe fixes the issue.

@evleeann
Copy link
Author

I tried with the latest version and it seemed to get a bit farther, but then I got a different error:

[me@bk1 scripts]$ sh push_prod02_full.sh
upload failed: ../../backups/mybigfile to s3://prod02-extraview/mybigfile
HTTPSConnectionPool(host='mybucketname.s3.amazonaws.com', port=443): Max retries exceeded with url: /mybigfile?partNumber=312&uploadId=m3_RJNmpdY.oUNALPdFgB5uV1abSFATuDuLbUNm0EbMjzt.kJ7Wq4cbFIrG7NSCz5Dh2R3tuAPGzcWXC6MgL10DWsOtVRl.HkaAQ2OttHJdW4k5C4i_gmvAoPPxxxVSi (Caused by <class 'socket.error'>: [Errno 104] Connection reset by peer)

looking at #823 - this seems similar now.
I verified my config is set to "us-west-2" for the region

@jamesls
Copy link
Member

jamesls commented Dec 19, 2014

Interesting. Those error cases are retried, so if you're seeing that error it means that we exceeded all our retry attempts and it still did not succeed.

I'm going to attempt to repro this issue. Is there any thing you can share about your environment that may help? You're uploading a large (say 10GB) file to an s3 bucket in us-west-2. Is this on an EC2 instance also in us-west-2? If so what instance type is it?

@evleeann
Copy link
Author

I am uploading from a CentOS machine in my colocated server environment (not in AWS) directly to S3 in us-west-2. The server is in California.
I have also been using the aws cli to do an initial sync up of 100,000+ smaller files continuously from the same server, to a different bucket also in us-west-2 for the past 48 hours - without any issues - so there have not been any network issues between my server and us-west-2 that have derailed that process - it is still running.
It just seems to be the large monolithic files that are having a problem, smaller files seem okay.
Please let me know if there's any other information I can provide!

@jamesls
Copy link
Member

jamesls commented Dec 19, 2014

I've tried this several times and I am not seeing this issue. I've uploaded a 10gb file to a us-west-2 bucket both within us-west-2 as well as from other regions, all without issue. What I'm guessing is happening is that in specific scenarios the number of retries is not sufficient. We are retrying connection errors, but we only retry up to 5 times. If you run this with --debug, can you confirm that you're seeing the request being retried up to 5 times before failing? If that's the case then I think we'll need to expose some configuration options that can help with this scenario:

  • Possible access to retry configuration.
  • Possible access to concurrency levels. We've seen reports that lowering concurrency settings can sometimes help with reducing the number of connection errors overall.

I'll update this when we've figured out what our best options are.

@jamesls jamesls added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Dec 19, 2014
@evleeann
Copy link
Author

Hi James -

I re-ran my giant file upload with --debug and got many lines of debug output before it failed again.

Working up from the bottom of the output file, I see many lines indicating that the task has been cancelled, and then above those, errors like:

RequestTimeoutYour socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.75315EA98FC4D419/pRFbu/vMFn3UDUsbj+VQZskqA3mweFD2scmQImt4gbnWEbPfmkEAp3+JGsgpx7g
2014-12-22 11:33:28,257 - Thread-8 - botocore.hooks - DEBUG - Event needs-retry.s3.UploadPart: calling handler <botocore.retryhandler.RetryHandler object at 0x8de95ac>
2014-12-22 11:33:28,258 - Thread-8 - botocore.retryhandler - DEBUG - retry needed: matching HTTP status and error code seen: 400, RequestTimeout
2014-12-22 11:33:28,258 - Thread-8 - botocore.retryhandler - DEBUG - Reached the maximum number of retry attempts: 5
2014-12-22 11:33:28,259 - Thread-8 - botocore.retryhandler - DEBUG - No retry needed.
2014-12-22 11:33:28,259 - Thread-8 - botocore.hooks - DEBUG - Event after-call.s3.UploadPart: calling handler <function enhance_error_msg at 0x8be4224>
2014-12-22 11:33:28,260 - Thread-8 - botocore.hooks - DEBUG - Event after-call.s3.UploadPart: calling handler <awscli.errorhandler.ErrorHandler object at 0x8c2db2c>
2014-12-22 11:33:28,260 - Thread-8 - awscli.errorhandler - DEBUG - HTTP Response Code: 400
2014-12-22 11:33:28,261 - Thread-8 - awscli.customizations.s3.tasks - DEBUG - Error during part upload: A client error (RequestTimeout) occurred when calling the UploadPart operation: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/awscli/customizations/s3/tasks.py", line 239, in call
self._filename.service, 'UploadPart', params)
File "/usr/lib/python2.6/site-packages/awscli/customizations/s3/utils.py", line 221, in operate
http_response, response_data = operation.call(**kwargs)
File "/usr/lib/python2.6/site-packages/botocore/operation.py", line 98, in call
parsed=response[1])
File "/usr/lib/python2.6/site-packages/botocore/session.py", line 735, in emit
return self._events.emit(event_name, **kwargs)
File "/usr/lib/python2.6/site-packages/botocore/hooks.py", line 182, in emit
response = handler(**kwargs)
File "/usr/lib/python2.6/site-packages/awscli/errorhandler.py", line 70, in call
http_status_code=http_response.status_code)
ClientError: A client error (RequestTimeout) occurred when calling the UploadPart operation: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.
2014-12-22 11:33:28,315 - Thread-8 - awscli.customizations.s3.executor - DEBUG - Worker thread invoking task: <awscli.customizations.s3.tasks.UploadPartTask object at 0xb502b1ac>
2014-12-22 11:33:28,316 - Thread-8 - awscli.customizations.s3.tasks - DEBUG - Uploading part 232 for filename: /backups/prod02/mygiantfile.gz
2014-12-22 11:33:28,316 - Thread-1 - awscli.customizations.s3.executor - DEBUG - Received print task: PrintTask(message=u'upload failed: ../../backups/prod02/mygiantfile.gz to s3://mybucket/mygiantfile.gz\nA client error (RequestTimeout) occurred when calling the UploadPart operation: Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed.', error=True, total_parts=None, warning=None)
upload failed: ../../backups/prod02/mygiantfile.gz to s3://mybucket/mygiantfile.gz

It's not particularly obvious to me how to see 5 tries for part 232 - things don't particularly seem to be happening precisely in numerical part order - but if you tell me a string to look for, I'm happy to look for lines in the log to show the 5 retries?

@jamesls
Copy link
Member

jamesls commented Dec 22, 2014

Probably the quickest way to check for this is to look for "Retry needed in the logs":

$ grep 'Retry needed' debug.2.log
2014-12-19 19:58:42,381 - Thread-6 - botocore.retryhandler - DEBUG - Retry needed, action of: 0.79597851249
2014-12-19 19:58:46,877 - Thread-13 - botocore.retryhandler - DEBUG - Retry needed, action of: 0.566578898223
2014-12-19 19:59:03,374 - Thread-7 - botocore.retryhandler - DEBUG - Retry needed, action of: 0.879621257998
2014-12-19 19:59:12,012 - Thread-7 - botocore.retryhandler - DEBUG - Retry needed, action of: 0.745445290666
2014-12-19 19:59:45,328 - Thread-10 - botocore.retryhandler - DEBUG - Retry needed, action of: 0.0227251462876
2014-12-19 20:00:23,009 - Thread-8 - botocore.retryhandler - DEBUG - Retry needed, action of: 0.632486181046

If you see increasing times on the same thread, this will show a request being continually retried (retries happen on the same thread as the original request):

2014-12-22 23:23:33,061 - MainThread - botocore.retryhandler - DEBUG - Retry needed, action of: 0.21749102538
2014-12-22 23:23:33,281 - MainThread - botocore.retryhandler - DEBUG - Retry needed, action of: 0.977166976238
2014-12-22 23:23:34,263 - MainThread - botocore.retryhandler - DEBUG - Retry needed, action of: 1.20192933873
2014-12-22 23:23:35,469 - MainThread - botocore.retryhandler - DEBUG - Retry needed, action of: 6.02506593184

@evleeann
Copy link
Author

Okay so there is one thread that has 4 lines chronologically, where the "action of" time is increasing:

2014-12-22 11:17:31,006 - Thread -9- botocore.retryhandler -DEBUG -Retry needed, action of: 0.598787798528
2014-12-22 11:20:44,345 - Thread -9- botocore.retryhandler - DEBUG - Retry needed, action of: 0.801475020663
2014-12-22 11:25:08,490 - Thread -9- botocore.retryhandler -DEBUG -Retry needed, action of: 1.70129612243
2014-12-22 11:32:17,285 - Thread -9- botocore.retryhandler - DEBUG - Retry needed, action of: 2.95315374334

I didn't find any with 5 retries though, where the action of value was accumulating

@jamesls
Copy link
Member

jamesls commented Dec 23, 2014

I should have said "we try (not retry) up to 5 times" which would mean the initial attempt + 4 retries for a total of 5 attempts. So it looks like you are exhausting your retry attempts.

@evleeann
Copy link
Author

Is it common for the multi-part upload to have failures where it needs multiple retries on parts? Based on the debug file that was generated, my file is looking at having nearly 800 parts - if each part can be expected to take up to 5 tries before it goes up, my uploads are not going to be very efficient

I'm pretty new to the aws world and the cli - so I am just a bit surprised that the connections are that flaky that you've had to incorporate that many retries - and that it's used that much!

Do you have any recommendations for best practices when trying to push files into the aws world?

thanks very much!

@jamesls
Copy link
Member

jamesls commented Dec 23, 2014

It will depend on your connection. For example, from an EC2 instance to an S3 bucket in the same region, I was not able to see any issues with retries. However, I can certainly see how ConnectionErrors can become more frequent if you are going across a less reliable WAN link. In the past the two things that have helped are:

  1. increase the number of retries
  2. decrease the concurrency level. We currently use a threadpool of 10. Sometimes lowering that value helps reduce the number of ConnectionErrors encountered.

The gap here is that we don't expose any configuration options to alter these two values, but it appears that we probably should, specifically for scenarios such as this.

@jefft
Copy link

jefft commented Dec 28, 2014

I had perhaps the same problem as OP. In my case, trying to upload a 37MB file over a slow residential link, I persistently got failures:

/usr/local/bin/aws s3 sync /srv/www/content.coastec.net.au s3://content.coastec.net.au --exclude .dropbox
upload failed: ../srv/www/content.coastec.net.au/podcasts/2014.12.21.mp3 to s3://content.coastec.net.au/podcasts/2014.12.21.mp3
A client error (RequestTimeTooSkewed) occurred when calling the UploadPart operation: The difference between the request time and the current time is too large.
upload failed: ../srv/www/content.coastec.net.au/podcasts/2014.12.21.mp3 to s3://content.coastec.net.au/podcasts/2014.12.21.mp3
A client error (RequestTimeTooSkewed) occurred when calling the UploadPart operation: The difference between the request time and the current time is too large.
upload failed: ../srv/www/content.coastec.net.au/podcasts/2014.12.14.mp3 to s3://content.coastec.net.au/podcasts/2014.12.14.mp3
A client error (RequestTimeTooSkewed) occurred when calling the UploadPart operation: The difference between the request time and the current time is too large.

^CCleaning up. Please wait...                       
^CTraceback (most recent call last):ile(s) remaining
  File "/usr/local/bin/aws", line 23, in <module>
    sys.exit(main())
  File "/usr/local/bin/aws", line 19, in main
    return awscli.clidriver.main()
  File "/usr/local/lib/python2.7/dist-packages/awscli/clidriver.py", line 49, in main
    return driver.main()
  File "/usr/local/lib/python2.7/dist-packages/awscli/clidriver.py", line 195, in main
    return command_table[parsed_args.command](remaining, parsed_args)
  File "/usr/local/lib/python2.7/dist-packages/awscli/customizations/commands.py", line 182, in __call__
    parsed_globals)

(this with local clock synced to *.amazon.pool.ntp.org)

Upgrading awscli to 1.6.10 ('pip install --upgrade awscli'), I no longer get the RequestTimeTooSkewed error. I do get socket errors causing "Max retries exceeded", a bit surprising since I don't see any packet loss to the S3 server, but at least after repeated runs the file uploads:

/usr/local/bin/aws s3 sync /srv/www/content.coastec.net.au s3://content.coastec.net.au --exclude .dropbox

upload failed: ../../../srv/www/content.coastec.net.au/podcasts/2014.12.14.mp3 to s3://content.coastec.net.au/podcasts/2014.12.14.mp3
HTTPSConnectionPool(host='s3-ap-southeast-2.amazonaws.com', port=443): Max retries exceeded with url: /content.coastec.net.au/podcasts/2014.12.14.mp3?partNumber=5&uploadId=TjG_5qgsBHGueXZnd7ZfX7CTV4vR.t9_QYZYhq0BLU62g.QYUzkRLwuiIk1u7cCeLdf34yPZheQRkm1xNWYhFzfAWUnd0c40rmn6s2G0uPg- (Caused by <class 'socket.error'>: [Errno 32] Broken pipe)
upload: ../../../srv/www/content.coastec.net.au/podcasts/2014.12.24.mp3 to s3://content.coastec.net.au/podcasts/2014.12.24.mp3
Completed 12 of 13 part(s) with 1 file(s) remaining
upload: ../../../srv/www/content.coastec.net.au/podcasts/2014.12.21.mp3 to s3://content.coastec.net.au/podcasts/2014.12.21.mp3

@jamesls
Copy link
Member

jamesls commented Jan 13, 2015

I'm trying to aggregate all these related issues into the main features requests for S3 command enhancements. There's two features I think that would help here.

  1. Add ability for S3 commands to increase retry count #1092: Add ability for S3 commands to increase retry account
  2. Limit number of parallel s3 transfers #907: Limit number of parallel S3 transfers.

Closing this ticket in favor of these two issues.

@jamesls jamesls closed this as completed Jan 13, 2015
@jamesls jamesls added bug This issue is a bug. feature-request A feature should be added or improved. accepted and removed bug This issue is a bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Jan 13, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request A feature should be added or improved.
Projects
None yet
Development

No branches or pull requests

3 participants