Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

random cURL errors on HTTPS requests to SWF and DynamoDB #924

Closed
ppaulis opened this issue Mar 3, 2016 · 67 comments
Closed

random cURL errors on HTTPS requests to SWF and DynamoDB #924

ppaulis opened this issue Mar 3, 2016 · 67 comments
Labels
investigating This issue is being investigated and/or work is in progress to resolve the issue.

Comments

@ppaulis
Copy link

ppaulis commented Mar 3, 2016

Hello!

we recently switched on our EC2 instances from the v2 to v3 (most recent version) of the PHP SDK. We use SWF, DynamoDB and S3. Since we switched to the v3, we are facing cURL errors that appear completely randomly it seems:

The application has thrown an exception!
Aws\Swf\Exception\SwfException
Error executing "PollForActivityTask" on "https://swf.us-east-1.amazonaws.com"; AWS HTTP error: cURL error 56: SSL read: error:00000000:lib(0):func(0):reason(0), errno 104 (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)

Sometimes the exception appears twice per hour, sometimes once in three hours... there's no scheme.

We already switched back to an older AMI and updated cURL from 7.35.0 (there is a problem with chunked upload in the 7.35) to 7.36.0 but nothing helps. We are polling SWF for activity tasks with long polling enabled, so there shouldn't be too many requests.

I googled this of course before opening an issue, but the only topics I found either date from 2012-2013 or were related to a broken load balancer on AWS. And that would probably be too much of a coincidence in our case...

Does anyone know about this problem..? I can't even say if it's a problem in the SDK or something else. So I'm grateful for every hint you can give me!

Thanks a lot!
Pascal

@jeskew
Copy link
Contributor

jeskew commented Mar 3, 2016

It looks like the connection is being reset while OpenSSL is attempting to decrypt data, which can be caused by the client and server being unable to agree on a TLS protocol. Which version of OpenSSL are you using?

OpenSSL emits error 104 (ECONNRESET) when the connection is reset by a peer host, and this error is happening while cURL is attempting to read data. cURL is therefore surfacing this as a read error (cURL error 56 - CURLE_RECV_ERROR).

I'll look into why the SDK's behavior in response to this event would differ from v2 to v3.

@jeskew jeskew added the v3 label Mar 3, 2016
@ppaulis
Copy link
Author

ppaulis commented Mar 4, 2016

Hey @jeskew ,

thanks for the quick reply! What I have tried so far:

  • using the standard setting without specifying a SSL/TLS version
  • I tried with TLS 1.2:
return
    [
        'aws' =>
        [
            'region' => 'us-east-1',
            'curl.options' => [
                'CURLOPT_SSLVERSION' => 'CURL_SSLVERSION_TLSv1_2'
            ]
        ]
    ]

or by directly using the value of the constant:

'CURLOPT_SSLVERSION' => 6

It's a ZF2 application, so this code is in a aws.local.php file in my application configuration. I'll try to put it directly in the creation of the SwfClient.

I tried, so far, OpenSSL 1.0.1f and 1.0.2g. Both with the same result.

Thanks for your help!
Pascal

@ppaulis
Copy link
Author

ppaulis commented Mar 4, 2016

I also tried using the CA cert bundle from the curl website by specifying it in the php.ini with openssl.cafile, no luck :-/

@ppaulis
Copy link
Author

ppaulis commented Mar 4, 2016

a few additional infos:

ubuntu@ip-10-0-11-226:/tmp$ curl -V
curl 7.36.0 (x86_64-pc-linux-gnu) libcurl/7.36.0 OpenSSL/1.0.1f zlib/1.2.8 libidn/1.28
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtsp smtp smtps telnet tftp 
Features: AsynchDNS GSS-Negotiate IDN IPv6 Largefile NTLM NTLM_WB SSL libz TLS-SRP

and here's the output of a connection attempt to the SWF endpoint:

ubuntu@ip-10-0-11-226:/tmp$ openssl s_client -connect "swf.us-west-2.amazonaws.com:443"
CONNECTED(00000003)
depth=1 C = US, O = Symantec Corporation, OU = Symantec Trust Network, CN = Symantec Class 3 Secure Server CA - G4
verify error:num=20:unable to get local issuer certificate
verify return:0

Certificate chain
 0 s:/C=US/ST=Washington/L=Seattle/O=Amazon.com, Inc./CN=swf.us-west-2.amazonaws.com
   i:/C=US/O=Symantec Corporation/OU=Symantec Trust Network/CN=Symantec Class 3 Secure Server CA - G4
 1 s:/C=US/O=Symantec Corporation/OU=Symantec Trust Network/CN=Symantec Class 3 Secure Server CA - G4
   i:/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=(c) 2006 VeriSign, Inc. - For authorized use only/CN=VeriSign Class 3 Public Primary Certification Authority - G5

Server certificate
-----BEGIN CERTIFICATE-----
(removed for the sake of readability)
-----END CERTIFICATE-----
subject=/C=US/ST=Washington/L=Seattle/O=Amazon.com, Inc./CN=swf.us-west-2.amazonaws.com
issuer=/C=US/O=Symantec Corporation/OU=Symantec Trust Network/CN=Symantec Class 3 Secure Server CA - G4
---
No client certificate CA names sent
---
SSL handshake has read 2762 bytes and written 621 bytes
---
New, TLSv1/SSLv3, Cipher is AES128-SHA
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1
    Cipher    : AES128-SHA
    Session-ID: DDAB0212D8D9B49556328037D194D03EB7F22E50B0548E447B015F024874963C
    Session-ID-ctx: 
    Master-Key: D1968248414FFB3CB5AE009DB37E5836837F8257F1540170936EDC45BDDC3CFC60ECE93168D9D15D8B89765D917B1782
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1457086840
    Timeout   : 300 (sec)
    Verify return code: 20 (unable to get local issuer certificate)
---

When I run instead:

openssl s_client -connect "swf.us-west-2.amazonaws.com:443" -CAfile /tmp/cacert.pem

I get a return code 0:

ubuntu@ip-10-0-11-226:/tmp$ openssl s_client -connect "swf.us-west-2.amazonaws.com:443" -CAfile /tmp/cacert.pem 
CONNECTED(00000003)
depth=2 C = US, O = "VeriSign, Inc.", OU = VeriSign Trust Network, OU = "(c) 2006 VeriSign, Inc. - For authorized use only", CN = VeriSign Class 3 Public Primary Certification Authority - G5
verify return:1
depth=1 C = US, O = Symantec Corporation, OU = Symantec Trust Network, CN = Symantec Class 3 Secure Server CA - G4
verify return:1
depth=0 C = US, ST = Washington, L = Seattle, O = "Amazon.com, Inc.", CN = swf.us-west-2.amazonaws.com
verify return:1
---
Certificate chain
 0 s:/C=US/ST=Washington/L=Seattle/O=Amazon.com, Inc./CN=swf.us-west-2.amazonaws.com
   i:/C=US/O=Symantec Corporation/OU=Symantec Trust Network/CN=Symantec Class 3 Secure Server CA - G4
 1 s:/C=US/O=Symantec Corporation/OU=Symantec Trust Network/CN=Symantec Class 3 Secure Server CA - G4
   i:/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=(c) 2006 VeriSign, Inc. - For authorized use only/CN=VeriSign Class 3 Public Primary Certification Authority - G5
---
Server certificate
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
subject=/C=US/ST=Washington/L=Seattle/O=Amazon.com, Inc./CN=swf.us-west-2.amazonaws.com
issuer=/C=US/O=Symantec Corporation/OU=Symantec Trust Network/CN=Symantec Class 3 Secure Server CA - G4
---
No client certificate CA names sent
---
SSL handshake has read 2762 bytes and written 621 bytes
---
New, TLSv1/SSLv3, Cipher is AES128-SHA
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1
    Cipher    : AES128-SHA
    Session-ID: D908CDB1F5DD27E4281543407894B92FA33A181172068DA98FCAD4CF77E2963C
    Session-ID-ctx: 
    Master-Key: 658CED64C142D37CCCAC7F2163F707357D071894874288AEF5B364DD81A533A88B111BB68059F9C8073A47F86EA212BC
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    SRP username: None
    Start Time: 1457087123
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
---

@ppaulis
Copy link
Author

ppaulis commented Mar 4, 2016

btw, the problem appears both on us-east-1 and us-west-2

@jeskew
Copy link
Contributor

jeskew commented Mar 4, 2016

Is the CA bundle at /tmp/cacert.pem the bundle provided on the curl website?

@ppaulis
Copy link
Author

ppaulis commented Mar 4, 2016

@jeskew yes, it is.

@ppaulis
Copy link
Author

ppaulis commented Mar 4, 2016

I have currently running a few machines with an older version of our application, that uses the v2 of the SDK. To double check, that the exception doesn't appear on this version. I'll keep you updated!

@ppaulis
Copy link
Author

ppaulis commented Mar 4, 2016

@jeskew The result is as expected. The same machines with the old application version (sdk v2.8.18, guzzle 3.7.1) don't produce the exceptions.

@jeskew
Copy link
Contributor

jeskew commented Mar 5, 2016

v2 of the SDK uses Guzzle 3, which includes its own certificate authority bundle. v3 uses Guzzle 5/6, which rely on an external bundle.

Can you set the openssl.cafile ini setting and then check what is output by running openssl_get_cert_locations()? (This function is only defined in PHP >= 5.6.) Since running curl on the command line fails unless you specify a CAFile option, this will need to be set in PHP.

@ppaulis
Copy link
Author

ppaulis commented Mar 5, 2016

@jeskew without specifying the the openssl.cafile I get:

array(8) {
  ["default_cert_file"]=>
  string(21) "/usr/lib/ssl/cert.pem"
  ["default_cert_file_env"]=>
  string(13) "SSL_CERT_FILE"
  ["default_cert_dir"]=>
  string(18) "/usr/lib/ssl/certs"
  ["default_cert_dir_env"]=>
  string(12) "SSL_CERT_DIR"
  ["default_private_dir"]=>
  string(20) "/usr/lib/ssl/private"
  ["default_default_cert_area"]=>
  string(12) "/usr/lib/ssl"
  ["ini_cafile"]=>
  string(0) ""
  ["ini_capath"]=>
  string(0) ""
}

and after specifying openssl.cafile in the php.ini, I get:

array(8) {
  ["default_cert_file"]=>
  string(21) "/usr/lib/ssl/cert.pem"
  ["default_cert_file_env"]=>
  string(13) "SSL_CERT_FILE"
  ["default_cert_dir"]=>
  string(18) "/usr/lib/ssl/certs"
  ["default_cert_dir_env"]=>
  string(12) "SSL_CERT_DIR"
  ["default_private_dir"]=>
  string(20) "/usr/lib/ssl/private"
  ["default_default_cert_area"]=>
  string(12) "/usr/lib/ssl"
  ["ini_cafile"]=>
  string(15) "/tmp/cacert.pem"
  ["ini_capath"]=>
  string(0) ""
}

I'm currently using PHP 5.6.18.

I already tested this configuration, with the CA bundle specified in the php.ini. The exceptions keep appearing :-/

@jeskew
Copy link
Contributor

jeskew commented Mar 5, 2016

Is it possible that the processes that terminate following an OpenSSL error are either setting the SSL_CERT_FILE environment variable or unsetting the openssl.cafile ini variable? Clearly OpenSSL/cURL is loading this ini value (the equivalent of the command-line CAfile option) normally, but as you mentioned, the error you're seeing is occurring in a random minority of cases.

v2 of the SDK uses Guzzle 3, which will use a vendored CA bundle by default, whereas v3 (via Guzzle 5/6) will use PHP, OS, and OpenSSL configuration as its default. You can override this on a client by specifying the verify HTTP option. For example:

$dynamoClient = new \Aws\DynamoDb\DynamoDbClient([
    'version' => 'latest',
    'region' => 'us-west-2',
    'http' => [
        'verify' => '/tmp/cacert.pem',
    ],
]);

@ppaulis
Copy link
Author

ppaulis commented Mar 5, 2016

We never handled the SSL_CERT_FILE or openssl.cafile manually so far, because guzzle v3 handled this perfectly itself.

I'll give it a try with the http option and keep you updated!

Thank you, Jonathan, for your time!

@ppaulis
Copy link
Author

ppaulis commented Mar 5, 2016

I specified the ca bundle now manually in the php code:

$aws->createSwf([
            'version' => '2012-01-25',
            'http' => [
                'verify' => '/tmp/cacert.pem',
            ]
        ]);

the region, domain, etc. is loaded via the ZF2 config files.

@jeskew
Copy link
Contributor

jeskew commented Mar 5, 2016

One last question: are you connecting to AWS through any kind of network proxy? A similar error message was reported on the Docker repository: moby/moby#2011

They mention that this can happen to traffic from a Docker container, and one comment mentions that they saw a similar error when using a VPN.

@ppaulis
Copy link
Author

ppaulis commented Mar 5, 2016

Hmm, we are using simple EC2 instances with Ubuntu in a VPC on AWS. The outgoing traffic passes through a NAT (http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_NAT_Instance.html). But I'm not sure if the connection to the AWS services like SWF remain in the internal network... I'll check that with a collegue and get back to you.

Thanks!

@ppaulis
Copy link
Author

ppaulis commented Mar 6, 2016

@jeskew In confirm that connections to SWF go through the NAT

@ppaulis
Copy link
Author

ppaulis commented Mar 7, 2016

@jeskew The 'verify' => '/tmp/cacert.pem' unfortunately doesn't help either :-/

@jeskew
Copy link
Contributor

jeskew commented Mar 7, 2016

Since neither specifying an openssl.cafile ini variable nor a verify http option worked, the error is not related to the CA bundle. One other change from Guzzle 3 to Guzzle 6 is that Guzzle now reuses cURL handles between requests. There's an open pull request on OpenSSL that would remedy a failure case where you connect to a pool of hosts (the fleet addressable as swf.us-west-2.amazonaws.com, for example) and negotiate an SSL session, save that session, and then use the session to connect to a different host in the same pool. If the second host only supports TLS 1.0 and the first host supports TLS 1.2, the session will not downgrade but will instead error out. There was a lot of discussion on the JavaScript SDK about how this issue was affecting DynamoDB customers.

Could you try forcing TLS 1.0? In v3, you would do so like this:

$ddb = new \Aws\DynamoDb\DynamoDbClient([
    ...
    'http' => [
        'curl' => [
            CURLOPT_SSLVERSION => CURL_SSLVERSION_TLSv1_0,
        ]
    ]
]);

OpenSSL is working on a fix but it has not yet been accepted and released.

@ppaulis
Copy link
Author

ppaulis commented Mar 7, 2016

it's in place:

protected function getClient()
    {
        /** @var \Aws\Sdk $aws */
        $aws = $this->getServiceLocator()->get(Sdk::class);

        return $aws->createSwf([
            'version' => '2012-01-25',
            'http' => [
                'curl' => [
                    CURLOPT_SSLVERSION => CURL_SSLVERSION_TLSv1_0,
                ]
            ]
        ]);
    }

I'll keep you updated! Thanks!

@ppaulis
Copy link
Author

ppaulis commented Mar 7, 2016

This sounds like a very serious lead.. It would explain the fact that the error is completely random.

@ppaulis
Copy link
Author

ppaulis commented Mar 8, 2016

Got again an exception, but I'm trying now with:

protected function getClient()
    {
        /** @var \Aws\Sdk $aws */
        $aws = $this->getServiceLocator()->get(Sdk::class);

        return $aws->createSwf([
            'version' => '2012-01-25',
            'curl.options' => [
                CURLOPT_SSLVERSION => CURL_SSLVERSION_TLSv1_0,
            ]
        ]);
    }

I think I've seen that somewhere in the docs...

@jeskew
Copy link
Contributor

jeskew commented Mar 8, 2016

curl.options will only be used by a v2 client. Did you get the same cURL/OpenSSL error codes (cURL error 56: SSL read: error:00000000:lib(0):func(0):reason(0), errno 104)?

@ppaulis
Copy link
Author

ppaulis commented Mar 8, 2016

Yes I still get them :-/

Sent from my Phone
Am 08.03.2016 7:03 nachm. schrieb "Jonathan Eskew" <[email protected]

:

curl.options will only be used by a v2 client. Did you get the same
cURL/OpenSSL error codes (cURL error 56: SSL read:
error:00000000:lib(0):func(0):reason(0), errno 104)?


Reply to this email directly or view it on GitHub
#924 (comment).

@ppaulis
Copy link
Author

ppaulis commented Mar 9, 2016

@jeskew just scanned multiple times the SWF endpoint in us-east-1:

https://www.ssllabs.com/ssltest/analyze.html?d=swf.us-east-1.amazonaws.com

On different IPs, the result seems always the same. TLS 1.2 not supported, but TLS 1.0 is.

@ppaulis
Copy link
Author

ppaulis commented Mar 9, 2016

I just tried to force TLS 1.2, and indeed, I get the following:

GuzzleHttp\Exception\ConnectException
 cURL error 35: error:14077102:SSL routines:SSL23_GET_SERVER_HELLO:unsupported protocol (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)

@jeskew
Copy link
Contributor

jeskew commented Mar 9, 2016

The root cause might be something else that's supported by one server and not another. I'm still unable to reproduce the issue, so if you can capture any more context about a failing request that would be very helpful.

In the meantime, could you verify that this is related to sharing cURL handles? You can disable handle sharing by creating a custom Guzzle client like so:

use Aws\Handler\GuzzleV6\GuzzleHandler;
use Aws\Swf\SwfClient;
use GuzzleHttp\Client;
use GuzzleHttp\Handler\CurlFactory;
use GuzzleHttp\Handler\CurlHandler;
use GuzzleHttp\Handler\CurlMultiHandler;
use GuzzleHttp\Handler\Proxy;
use GuzzleHttp\HandlerStack;

// Create a Guzzle client that will not share cURL handles
$guzzleClient = new Client([
    'handler' => HandlerStack::create(Proxy::wrapSync(
        new CurlMultiHandler(['handle_factory' => new CurlFactory(0)]),
        new CurlHandler(['handle_factory' => new CurlFactory(0)])
    ))
]);

// Use the no-shared handle client to create an AWS client
$swfClient = new SwfClient([
    'region' => 'us-east-1',
    'version' => 'latest',
    'http_handler' => new GuzzleHandler($guzzleClient),
]);

@ppaulis
Copy link
Author

ppaulis commented Mar 9, 2016

I'm still getting the exceptions, but I'll check if I can get more verbose output from curl about this.

Thanks!
Pascal

@ppaulis
Copy link
Author

ppaulis commented Mar 9, 2016

I just added:

sudo ssldump -a -A -H -i eth0 > /tmp/ssldump.txt &

I'll get back to you in a few hours!

Thanks!

@srinivasudadi9000
Copy link

Hi ppaulis

I would like to send push notifications for android
so for that i wrote code using curlinit() method initialization
but aws server showing like as below connection closed
image

@srinivasudadi9000
Copy link

in webservice it is showing error like as below
Fatal error: Call to undefined function curl_init() in /home/ubuntu/projects/furreka/trunk/server/furreka_rest_api/api.php on line 267

@ppaulis
Copy link
Author

ppaulis commented Apr 21, 2017

@srinivasudadi9000 it seems to me that there is a package missing on your system? Did you install the php-curl package?

@srinivasudadi9000
Copy link

Hi ppaulis please guide me , what is that package

$fields = array
(
'registration_ids' => $registrationIds,
'data' => $msg
);

$headers = array
(
'Authorization: key=' . API_ACCESS_KEY,
'Content-Type: application/json'
);

$ch = curl_init();
curl_setopt( $ch,CURLOPT_URL, 'https://gcm-http.googleapis.com/gcm/send' );
curl_setopt( $ch,CURLOPT_POST, true );
curl_setopt( $ch,CURLOPT_HTTPHEADER, $headers );
curl_setopt( $ch,CURLOPT_RETURNTRANSFER, true );
curl_setopt( $ch,CURLOPT_SSL_VERIFYPEER, false );
curl_setopt( $ch,CURLOPT_POSTFIELDS, json_encode( $fields ) );
$result = curl_exec($ch );
// print_r($headers );
curl_close( $ch );

    	$userdetails = array
        ( "Userdetails"=>failure,
           "status"=>$result,
           "users"=>$users
            
        );
         
       print_r(json_encode($userdetails));

@srinivasudadi9000
Copy link

I wrote code for sending push notification for android and my local server 192 ..working fine but coming to aws - ec2 server showing fatal errror ..so how to resolve it ..If possible can you send the package related to curl

@srinivasudadi9000
Copy link

how to install php-curl package in aws ppaulis can you send me any reference

@ppaulis
Copy link
Author

ppaulis commented Apr 21, 2017

@srinivasudadi9000 Well it depends on your machine... Sometimes it's php-curl, or php5-curl, etc. The command to install it depends in your operating system.

Because your questions are clearly not related to the original topic of this issue, I suggest that you ask it rather on a site like stackoverflow.

Greetings,
Pascal

@imshashank
Copy link
Contributor

@ppaulis Is there any update on the issue? Are you all getting random curl errors?

@oberman
Copy link

oberman commented May 19, 2017

Yes and no. Internal (to the SDK) curl errors with retries are unavoidable. But if you mess around with the sdk settings you can reduce the amount times you run out of retries causing an error to bubble up.

The settings are here:
https://docs.aws.amazon.com/aws-sdk-php/v3/guide/guide/configuration.html

My goal was to maximize API call success while minimizing time spent waiting on a bad remote server (or servers) so I globally changed:
retries, 3 -> 40
connect_timeout, infinity to 1 second
I left the request timeout (timeout) as the infinity default, but I overrode it on a per service or per service method call (usually to 1 second) if I knew that service or service method was idempotent (e.g. doing the same request twice doesn't have any problematic side effects, trivial examples x++ is NOT idempotent but setting x to equal 3 is). It's important to know that on connect_timeout or timeout the SDK internally retries, thus the retries bump from 3 to 40 (to give AWS time to reroute traffic from bad to good servers, or to allow a server to recover, etc...).

I also added instrumentation to my code to peek into the SDK internal retries. See the graph of my last week.
screen shot 2017-05-19 at 1 10 33 pm
You can see that retries happen constantly (avg of 1 every 2 minutes). But, there are spikes. In this case ganglia is dividing the truth, and "max 15.7" was really a period where I saw ~300 retries in 1 minute. It's important to know that I'm making on the order of 10k AWS API calls per minute so even 300 retries was barely noticed.

@kstich
Copy link
Contributor

kstich commented Jun 8, 2017

@oberman @ppaulis @senorbacon

We are currently unable to reproduce the issue and are opening new discussions to explore any possible leads. As part of trying to get to the bottom of this issue, we would like to reaffirm what your broken states are.

  • Are you all still experiencing these cURL errors?
  • Are the errors inconsistent with regards to time, capacity, and network traffic?
  • Are the errors arbitrary in that they are different cURL messages, or are they the same?

Additionally, if you're still experiencing errors, can you update us with your current state on the following information:

  • PHP Version
  • cURL Version
  • OpenSSL Version
  • Guzzle Version
  • What AWS service the issue is occurring with
  • If you have tried contacting the service/support team with regards to the issue
  • What steps in this thread, via comment #, you have tried to resolve the issue
  • Exception text
  • If possible, the outputs of curl_errno or (if Guzzle 6+) GuzzleHttp\Exception\RequestException::getHandlerContext for each of the arbitrary errors

We know this is a lot of information to ask for, but it's all in hopes of driving this to a resolution.

Thank you!

@kstich kstich added the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jun 9, 2017
@oberman
Copy link

oberman commented Jun 12, 2017

This is a long thread now and I've contributed in two ways:

1.) cURL error 56: Searching my logs, I see this exception is exceedingly rare, but still happens. By "exceedingly rare" I mean 4 times in 2017 out of what I guess are billions of calls to AWS services. My info:
php55-5.5.38-2.119.amzn1.x86_64
curl-7.47.1-9.66.amzn1.x86_64
openssl-1.0.1k-15.96.amzn1.x86_64
AWS SDK 3.24.4
Guzzle 6.2.3

The most recent failure was on 2017-06-07 10:27:04 GMT-4. The stack trace:
exception 'GuzzleHttp\Exception\RequestException' with message 'cURL error 56: TCP connection reset by peer (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)' in REDACTED/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php:187
Stack trace:
#0 REDACTED/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(150): GuzzleHttp\Handler\CurlFactory::createRejection(Object(GuzzleHttp\Handler\EasyHandle), Array)
#1 REDACTED/vendor/guzzlehttp/guzzle/src/Handler/CurlFactory.php(103): GuzzleHttp\Handler\CurlFactory::finishError(Object(GuzzleHttp\Handler\CurlMultiHandler), Object(GuzzleHttp\Handler\EasyHandle), Object(GuzzleHttp\Handler\CurlFactory))
#2 REDACTED/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php(180): GuzzleHttp\Handler\CurlFactory::finish(Object(GuzzleHttp\Handler\CurlMultiHandler), Object(GuzzleHttp\Handler\EasyHandle), Object(GuzzleHttp\Handler\CurlFactory))
#3 REDACTED/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php(108): GuzzleHttp\Handler\CurlMultiHandler->processMessages()
#4 REDACTED/vendor/guzzlehttp/guzzle/src/Handler/CurlMultiHandler.php(123): GuzzleHttp\Handler\CurlMultiHandler->tick()
#5 REDACTED/vendor/guzzlehttp/promises/src/Promise.php(246): GuzzleHttp\Handler\CurlMultiHandler->execute(true)
#6 REDACTED/vendor/guzzlehttp/promises/src/Promise.php(223): GuzzleHttp\Promise\Promise->invokeWaitFn()
#7 REDACTED/vendor/guzzlehttp/promises/src/Promise.php(267): GuzzleHttp\Promise\Promise->waitIfPending()
#8 REDACTED/vendor/guzzlehttp/promises/src/Promise.php(225): GuzzleHttp\Promise\Promise->invokeWaitList()
#9 REDACTED/vendor/guzzlehttp/promises/src/Promise.php(267): GuzzleHttp\Promise\Promise->waitIfPending()
#10 REDACTED/vendor/guzzlehttp/promises/src/Promise.php(225): GuzzleHttp\Promise\Promise->invokeWaitList()
#11 REDACTED/vendor/guzzlehttp/promises/src/Promise.php(62): GuzzleHttp\Promise\Promise->waitIfPending()
#12 REDACTED/vendor/aws/aws-sdk-php/src/AwsClientTrait.php(59): GuzzleHttp\Promise\Promise->wait()
#13 REDACTED/vendor/aws/aws-sdk-php/src/AwsClientTrait.php(78): Aws\AwsClient->execute(Object(Aws\Command))

This error is a bummer because the SDK doesn't try to retry it. I think it might be the only error like that?

2.) The general issue of failures at the curl level. I'm able to control the time a call takes and the success rate of the call by configuring timeouts and number of retries (see my most previous post). In my log file processing I break internal retries into the following categories:

capacity = DynamoDB specific and is generally zero.

error = {5xx, PR_CONNECT_RESET_ERROR, PR_NOT_CONNECTED_ERROR}. Generally has been decreasing over time. A year ago I could have measured this category at the hour level (e.g. at least 1 event/hour), but recently I would have to expand to day (and even then I have days without an error).

timeout = {timed out before SSL handshake, Connection timed out after, Resolving timed out after, Operation timed out}. This is steady and I see happen at the "10 minute level" (e.g. at least 1 event every 10 minutes). Again, as previously noted I have extremely agressive timeout settings and a generous number of retries.

@kstich
Copy link
Contributor

kstich commented Jun 19, 2017

@oberman Thanks for all the updated information. cURL error 56 is retried, so you must be running over the specified number of retries.

You mentioned that there are fewer of these errors now than before. What, if anything, has changed in that time?

Is there any correlation between this cURL error surfacing and your outbound traffic? It's possible there is an issue with how many TLS connections are running simultaneously from your application.

@oberman
Copy link

oberman commented Jun 19, 2017

curl error 56:
is this a change? I was told curl errorno 56 is not retried other than AWS API calls that are guaranteed to be idempotent:
#983

fewer errors:
my error tracking doesn't distinguish between {5xx, PR_CONNECT_RESET_ERROR, PR_NOT_CONNECTED_ERROR}. I peeked at today's logging and the 2 errors I had in the last four hours were 500 errors against DynamoDB (that retried and succeeded). The last time I had consistent networking problems was before August 2016. AWS released a server-side fix (details of that hidden from me) of DyanmoDB around then that dropped that class of problem I was having.

@kstich
Copy link
Contributor

kstich commented Jun 19, 2017

Re: cURL 56 - Apologies, I was looking at outdated information for that portion of the response.

Are the cURL 56 errors you are receiving specific to an AWS service, or is it across multiple services? You've mentioned both SQS and DynamoDB in this thread, but with a decrease in DynamoDB issues as well.

@oberman
Copy link

oberman commented Jun 19, 2017

I think we're mixing things up by talking about different issues at the same time:

curl 56 = I see no upward or downward trend. It's extremely rare (a small # per month). I see it in SQS and DDB, but that makes sense since those are my two highest volume APIs. I dug up the latest time it happened and will paste it at the bottom.

AWS API errors = I've seen two plateaus. One before August 2016 and one after. It was higher before and lower now. I believe before was networking errors + 5xx errors. After is just 5xx errors.

The last curl 56 error I saw:

* Rebuilt URL to: https://dynamodb.us-east-1.amazonaws.com/
* Hostname dynamodb.us-east-1.amazonaws.com was found in DNS cache
*   Trying 52.119.226.86...
* Connected to dynamodb.us-east-1.amazonaws.com (52.119.226.86) port 443 (#0)
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* ALPN/NPN, server did not agree to a protocol
* SSL connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
* Server certificate:
*       subject: CN=dynamodb.us-east-1.amazonaws.com,O="Amazon.com, Inc.",L=Seattle,ST=Washington,C=US
*       start date: Sep 30 00:00:00 2016 GMT
*       expire date: Dec 02 23:59:59 2017 GMT
*       common name: dynamodb.us-east-1.amazonaws.com
*       issuer: CN=Symantec Class 3 Secure Server CA - G4,OU=Symantec Trust Network,O=Symantec Corporation,C=US
> POST / HTTP/1.1
Host: dynamodb.us-east-1.amazonaws.com
X-Amz-Target: DynamoDB_20120810.Query
Content-Type: application/x-amz-json-1.0
aws-sdk-invocation-id: 4190b86981be52048a57ea4ac7a102e4
aws-sdk-retry: 0/0
X-Amz-Date: 20170607T142703Z
Authorization: AWS4-HMAC-SHA256 Credential=[KEY]/20170607/us-east-1/dynamodb/aws4_request, SignedHeaders=aws-sdk-invocation-id;aws-sdk-retry;host;x-amz-date;x-amz-target, Signature=[SIGNATURE]
User-Agent: aws-sdk-php/3.24.4 GuzzleHttp/6.2.1 curl/7.47.1 PHP/5.5.38
Content-Length: 295

* upload completely sent off: 295 out of 295 bytes
* SSL read: errno -5961 (PR_CONNECT_RESET_ERROR)
* TCP connection reset by peer
* Closing connection 0

@ppaulis
Copy link
Author

ppaulis commented Jun 21, 2017

@kstich sorry for the late reply! Here are already a few the infos you asked for:

  • PHP : PHP 7.0.15-0ubuntu0.16.04.4
  • cURL : curl 7.47.0 (x86_64-pc-linux-gnu) libcurl/7.47.0 GnuTLS/3.4.10 zlib/1.2.8 libidn/1.32 librtmp/2.3
    Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smb smbs smtp smtps telnet tftp
    Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP UnixSockets
  • OpenSSL : 1.0.2g
  • Guzzle : 6.2.1
  • AWS Services : Mainly on SWF, but having seen it also on Put operations for S3 if I remember well

I will try to provide you with the detailed error output asap.

Thanks!

@kstich
Copy link
Contributor

kstich commented Jun 27, 2017

@ppaulis @oberman Thank you both for the updates on your current state, we're continuing to try to diagnose the issue. Please keep us updated if you have any new or interesting information, we will do the same.

@gregholland
Copy link

gregholland commented Nov 9, 2017

I finally upgraded from v2 to v3 of the SDK a couple of weeks ago and have been seeing lots of these errors. For me they only occur on query requests to DynamoDB using Aws\ResultPaginator.

Error executing "Query" on "https://dynamodb.us-east-1.amazonaws.com"; AWS HTTP error: cURL error 56: SSL read: error:00000000:lib(0):func(0):reason(0), errno 104 (see http://curl.haxx.se/libcurl/c/libcurl-errors.html) Unable to parse error information from response - Error parsing JSON: Control character error, possibly incorrectly encoded

I've read this entire thread but still have no idea how to go about solving it. Has anyone had any luck getting to the bottom of this? If so, care to share?

PHP: 7.1.10
OpenSSL: 1.0.1t
Curl: 7.38.0
Guzzle: 6.3.0
AWS Services: DynamoDB (Specifically Query using the ResultPaginator)

Thanks!

@gregholland
Copy link

Also occurring with the following setup:
PHP: 7.2.0
OpenSSL: 1.1.0f
Curl: 7.52.1
Guzzle: 6.3.0
AWS Service: DynamoDB only using the ResultPaginator.

@kstich
Copy link
Contributor

kstich commented Jan 5, 2018

Root Cause

Guzzle v5 and Guzzle v6 do not trigger automatic retry of CURLE_RECV_ERROR like was the case in Guzzle v3. The AWS SDK for PHP v2 uses Guzzle3, where AWS SDK for PHP v3 specifies Guzzle v5 or v6.

But why?

At this point in the request/response lifecycle, there is no guarantee that retrying on that error code would be free of side-effects. This same thought from Guzzle applies to the SDK itself, so the error code is not retried in an SDK specific manner.

Now what?

If you are receiving this error, and believe the operation is safe to retry in your situation (no destructive or unintended side-effects, etc.), you may wish to retry the error in your own code.

Middlewares Solution

Create an additional 'retry-curl56' middleware with a custom decider for just this case and add it to the client's handler list via appendSign. You can read more about working with the SDK's middlewares here.

@kstich kstich removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 7 days. label Jan 5, 2018
@kstich
Copy link
Contributor

kstich commented Jan 9, 2018

After some further discussion based around the operations being handled, it is believed we can retry this at the SDK level since it's not safe to do at the Guzzle level. Support is being added for automatic retry of CURLE_RECV_ERROR via #1463.

@kstich kstich added the investigating This issue is being investigated and/or work is in progress to resolve the issue. label Jan 11, 2018
@ppaulis
Copy link
Author

ppaulis commented Jan 22, 2018

Thanks guys! That was not an easy one to find :-)

@kstich
Copy link
Contributor

kstich commented Jan 22, 2018

This was included in the 3.52.0 release.

@neoacevedo
Copy link

OK, with aws.phar 3.54.3 I'm having this performance using it on Amazon S3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigating This issue is being investigated and/or work is in progress to resolve the issue.
Projects
None yet
Development

No branches or pull requests