Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent crash/hang with TLS on Windows under poor network conditions #191

Closed
enolan opened this issue Apr 21, 2016 · 21 comments
Closed

Comments

@enolan
Copy link

enolan commented Apr 21, 2016

I can get a simple http-conduit program to crash or hang fairly often when the connection has occasional data corruption. I've also observed the hang when only simulating packet loss. Here's a repo with my test case. The program is tiny:

module Main where
import Network.HTTP.Conduit
import qualified Data.ByteString.Lazy.Char8 as LC8

main :: IO ()
main = do
    man <- newManager tlsManagerSettings
    req <- parseUrl "https://s3.amazonaws.com/hackage.fpcomplete.com/package/lens-4.12.3.tar.gz"
    response <- httpLbs req man
    LC8.putStrLn $ responseBody response

Using Clumsy in tamper mode set to 20%, it'll crash or hang maybe half the time.

When I instrument API calls with API Monitor I can see that the crashes happen immediately after a call to recv() fails. recv() returns SOCKET_ERROR which is a constant equal to -1. (When it's successful, recv() returns the number of bytes received.) Then memcpy() is called with the length set to -1. This causes the Windows equivalent of a segfault. Somewhere, the return value of recv() isn't getting checked.

The same thing with a cleartext HTTP connection is no problem. I also set up a simple cleartext HTTP server on a Linux machine and injected RSTs with tcpkill. This doesn't make it crash either.

I found this in the course of debugging commercialhaskell/stack#1689 (in the course of getting Idris building on AppVeyor). Stack hasn't ever crashed on me, but it seems likely the hangs and crashes in the simple http-conduit program are related to the hangs in Stack.

To reproduce:

  1. Clone the testcase repo
  2. stack build
  3. Download Clumsy
  4. Set Clumsy parameters to something like this and hit start.
  5. stack exec bad-memcpy-crash

You may have to run it several times. When it crashes you should see the Windows crash dialog and if you have Visual Studio installed, an offer to open a debugger.

Let me know if there's more I can do to figure this out.

@snoyberg
Copy link
Owner

I recently heard about haskell-tls/hs-tls#124
causing trouble. Can you try upgrading to the latest tls package (by using
a nightly snapshot) and see if that fixes things?

On Thu, Apr 21, 2016, 7:11 AM Echo Nolan [email protected] wrote:

I can get a simple http-conduit program to crash or hang fairly often when
the connection has occasional data corruption. I've also observed the hang
when only simulating packet loss. Here's a repo with my test case
https://github.com/enolan/http-conduit-crash-hang. The program is tiny:

module Main whereimport Network.HTTP.Conduitimport qualified Data.ByteString.Lazy.Char8 as LC8
main :: IO ()
main = do
man <- newManager tlsManagerSettings
req <- parseUrl "https://s3.amazonaws.com/hackage.fpcomplete.com/package/lens-4.12.3.tar.gz"
response <- httpLbs req man
LC8.putStrLn $ responseBody response

Using Clumsy https://jagt.github.io/clumsy/ in tamper mode set to 20%,
it'll crash or hang maybe half the time.

When I instrument API calls with API Monitor
http://www.rohitab.com/apimonitor I can see that the crashes happen
immediately after a call to recv() fails http://i.imgur.com/rgIhYCH.png.
recv() returns SOCKET_ERROR which is a constant equal to -1. (When it's
successful, recv() returns the number of bytes received.) Then memcpy()
is called with the length set to -1. This causes the Windows equivalent of
a segfault. Somewhere, the return value of recv() isn't getting checked.

The same thing with a cleartext HTTP connection is no problem. I also set
up a simple cleartext HTTP server on a Linux machine and injected RSTs with
tcpkill. This doesn't make it crash either.

I found this in the course of debugging commercialhaskell/stack#1689
commercialhaskell/stack#1689 (in the course
of getting Idris building on AppVeyor). Stack hasn't ever crashed on me,
but it seems likely the hangs and crashes in the simple http-conduit
program are related to the hangs in Stack.

To reproduce:

  1. Clone the testcase repo
    https://github.com/enolan/http-conduit-crash-hang
  2. stack build
  3. Download Clumsy https://jagt.github.io/clumsy/
  4. Set Clumsy parameters to something like this
    http://i.imgur.com/whWZ956.png and hit start.
  5. stack exec bad-memcpy-crash

You may have to run it several times. When it crashes you should see the
Windows crash dialog and if you have Visual Studio installed, an offer to
open a debugger.

Let me know if there's more I can do to figure this out.


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#191

@enolan
Copy link
Author

enolan commented Apr 21, 2016

Still happens on nightly-2016-04-20 with tls-1.3.5 :(

@snoyberg
Copy link
Owner

I don't have ready access to a Windows machine to try this on. Can you test if this kind of problem occurs at all when using the connection package directly?

Pinging @vincenthz for any other thoughts on narrowing down the problem.

@vincenthz
Copy link
Contributor

it does smells like a network bug, more than anything linked to tls. I haven't actually really look at it, but to the best of my knowledge, I'm never setting any network buffer anywhere; they are all the result of some network call.

@snoyberg
Copy link
Owner

@enolan Can you test if you can reproduce this at all with an insecure connection?

@enolan
Copy link
Author

enolan commented Apr 23, 2016

I tried pretty hard and couldn't. Debug printfs indicate the crash happens in hGetBuf. I'm trying to narrow it down further.

@enolan
Copy link
Author

enolan commented May 3, 2016

GHC ticket

@snoyberg
Copy link
Owner

snoyberg commented May 4, 2016

Wow, good catch. Is there any potential workaround outside of GHC?

@enolan
Copy link
Author

enolan commented May 4, 2016

It only happens when sockets are converted to handles, so if tls used
network's API rather than hGet / hPut there wouldn't be a problem.

On Tue, May 3, 2016, 8:20 PM Michael Snoyman [email protected]
wrote:

Wow, good catch. Is there any potential workaround outside of GHC?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#191 (comment)

@snoyberg
Copy link
Owner

snoyberg commented May 4, 2016

I don't know the details of how this works internally to tls/connection. @vincenthz is there any reasonable way to move away from Handles here? Would you consider a PR from someone to make such a change?

@vincenthz
Copy link
Contributor

Good catch @enolan !

I've pushed support to connection and socks (tls was already able to handle both), I would appreciate some testing before I push a new version if possible

@snoyberg
Copy link
Owner

snoyberg commented May 5, 2016

Are there any changes that need to be made on the http-client-tls side to take advantage of this support, or just recompiling against master?

@vincenthz
Copy link
Contributor

it should just work as is

@snoyberg
Copy link
Owner

snoyberg commented May 5, 2016

Awesome, thank you :)

@enolan
Copy link
Author

enolan commented May 8, 2016

@vincenthz Sorry! I read network's code wrong. It imports send and recv itself, with the right types for Windows, but it uses those imports on everything but Windows. Switching to using network's API directly won't actually help here. My bad.

@vincenthz
Copy link
Contributor

:(

could network detect the 4294967295 (wrapped -1) value on windows 64 bits as meaning error like -1 ?

@enolan
Copy link
Author

enolan commented May 8, 2016

The crash happens before control returns from base unfortunately.

On Sun, May 8, 2016, 11:55 AM Vincent Hanquez [email protected]
wrote:

:(

could network detect the 4294967295 (wrapped -1) value on windows 64 bits
as meaning error like -1 ?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#191 (comment)

@vincenthz
Copy link
Contributor

vincenthz commented May 8, 2016

darn :\ . Could probably have network call a different foreign import maybe, one that returns the right thing, because fixing base is not going to fix any previous version, and it could take a really long time ..

ghc-mirror pushed a commit to ghc/ghc that referenced this issue May 19, 2016
Summary:
They return signed 32 bit ints on Windows, even on a 64 bit OS, rather than
Linux's 64 bit ssize_t. This means when recv() returned -1 to signal an error we
thought it was 4294967295. It was converted to an int, -1 and the buffer was
memcpy'd which caused a segfault. Other bad stuff happened with send()s.

See also note CSsize in System.Posix.Internals.

Add a test for #12010

Test Plan:
- GHC testsuite (T12010)
- http-conduit test (snoyberg/http-client#191)

Reviewers: austin, hvr, bgamari, Phyx

Reviewed By: Phyx

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2170

GHC Trac Issues: #12010
@enolan
Copy link
Author

enolan commented Jun 2, 2016

This is the rabbit hole with no bottom. It's actually fixable in network, and I did, but now I get a different crash. haskell/network#192 is the PR. If either of you could look at it and see what I'm missing that'd be great.

@vincenthz Your change to tls to use Network.Socket over Handles actually wasn't in vain. The network workaround only works if you're using network.

@enolan
Copy link
Author

enolan commented Aug 4, 2016

This is fixed upstream in network-2.6.3.1 and asn1-encoding-0.9.4.

@enolan enolan closed this as completed Aug 4, 2016
@snoyberg
Copy link
Owner

snoyberg commented Aug 4, 2016

Awesome, thank you!

On Thu, Aug 4, 2016, 6:51 AM Echo Nolan [email protected] wrote:

This is fixed upstream in network-2.6.3.1 and asn1-encoding-0.9.4.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#191 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AADBB-oKQP-8e-CHy3_KDK-sVm3GNjimks5qcWGjgaJpZM4IMU3S
.

bgamari pushed a commit to bgamari/ghc that referenced this issue Sep 5, 2016
Summary:
They return signed 32 bit ints on Windows, even on a 64 bit OS, rather than
Linux's 64 bit ssize_t. This means when recv() returned -1 to signal an error we
thought it was 4294967295. It was converted to an int, -1 and the buffer was
memcpy'd which caused a segfault. Other bad stuff happened with send()s.

See also note CSsize in System.Posix.Internals.

Add a test for #12010

Test Plan:
- GHC testsuite (T12010)
- http-conduit test (snoyberg/http-client#191)

Reviewers: austin, hvr, bgamari, Phyx

Reviewed By: Phyx

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2170

GHC Trac Issues: #12010

(cherry picked from commit 1ee47c1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants