Incomplete socket send with fast writer and slow reader #610

aseering · 2016-07-01T05:08:08Z

This might be related to #493 ; I'm not sure. It certainly feels very similar. Reproduced under build 14376.

Open up two WSL terminals. In the first, run the following two commands:

dd if=/dev/zero of=big.dat bs=3000000 count=1
python -m SimpleHTTPServer

Then, in the second, run the following:

curl http://localhost:8000/big.dat

Expected:

curl will successfully transfer the file

Actual:

curl emits the following error:

curl: (18) transfer closed with 509705 bytes remaining to read

The exact number of bytes remaining varies, though it's consistently in the vicinity of 500kb for me.

The text was updated successfully, but these errors were encountered:

aseering · 2016-07-01T05:11:08Z

This issue does not reproduce for me if I run curl.exe in Windows.

It also doesn't reproduce for me if I run python -m SimpleHTTPServer in Windows, and curl in WSL.

It also doesn't reproduce for me if I redirect curl's stdout; for example, curl http://localhost:8000/big.dat | wc -c.

aseering · 2016-07-01T05:19:29Z

strace on the Python server:

...
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
sendto(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 0, NULL, 0) = 8192
sendto(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 0, NULL, 0) = 8192
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384
sendto(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 0, NULL, 0) = 8192
sendto(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 0, NULL, 0) = 8192
read(5, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 1728
read(5, "", 14336)                      = 0
sendto(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1728, 0, NULL, 0) = 1728
read(5, "", 16384)                      = 0
close(5)                                = 0
munmap(0x7fd0e7c30000, 4096)            = 0
shutdown(4, SHUT_WR)                    = 0
close(4)                                = 0
select(4, [3], [], [], {0, 500000})     = 0 (Timeout)
select(4, [3], [], [], {0, 500000})     = 0 (Timeout)
select(4, [3], [], [], {0, 500000})     = 0 (Timeout)
select(4, [3], [], [], {0, 500000})     = 0 (Timeout)
...

strace on the curl client:

clock_gettime(CLOCK_MONOTONIC, {94087, 668388000}) = 0
poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLIN}])
clock_gettime(CLOCK_MONOTONIC, {94087, 668724000}) = 0
poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 0, NULL, NULL) = 16384
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096                                    ) = 4096
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 12288                                   ) = 12288
clock_gettime(CLOCK_MONOTONIC, {94087, 699670000}) = 0
clock_gettime(CLOCK_MONOTONIC, {94087, 700584000}) = 0
clock_gettime(CLOCK_MONOTONIC, {94087, 701067000}) = 0
clock_gettime(CLOCK_MONOTONIC, {94087, 701318000}) = 0
clock_gettime(CLOCK_MONOTONIC, {94087, 701434000}) = 0
poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLIN}])
clock_gettime(CLOCK_MONOTONIC, {94087, 701848000}) = 0
poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384, 0, NULL, NULL) = 16384
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096                                    ) = 4096
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 12288                                   ) = 12288
clock_gettime(CLOCK_MONOTONIC, {94087, 732999000}) = 0
clock_gettime(CLOCK_MONOTONIC, {94087, 733974000}) = 0
clock_gettime(CLOCK_MONOTONIC, {94087, 734506000}) = 0
clock_gettime(CLOCK_MONOTONIC, {94087, 734785000}) = 0
clock_gettime(CLOCK_MONOTONIC, {94087, 734913000}) = 0
poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLIN}])
clock_gettime(CLOCK_MONOTONIC, {94087, 735264000}) = 0
poll([{fd=3, events=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND}], 1, 0) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "", 16384, 0, NULL, NULL)   = 0
clock_gettime(CLOCK_MONOTONIC, {94087, 735804000}) = 0
clock_gettime(CLOCK_MONOTONIC, {94087, 735991000}) = 0
clock_gettime(CLOCK_MONOTONIC, {94087, 736374000}) = 0
time([1467350132])                      = 1467350132
close(3)                                = 0
rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_RESTART, 0x7fc270560330}, NULL, 8) = 0
write(2, "c", 1c)                        = 1
write(2, "u", 1u)                        = 1
write(2, "r", 1r)                        = 1
write(2, "l", 1l)                        = 1
write(2, ":", 1:)                        = 1

I'm mildly perturbed that curl is attempting to push its error message through the kernel one byte at a time. But that's presumably after the actual failure has occurred.

sunilmut · 2016-07-08T22:54:59Z

@aseering - Thanks for reporting the issue and the repro steps. I was able to reproduce this issue and I have an understanding of the bug.

xBytez · 2016-10-19T17:49:24Z

I'm not sure if the issue I am experiencing is related but I think it is.

I get constant "net::ERR_CONTENT_LENGTH_MISMATCH" errors from Chrome when I try to visit my uWSGI HTTP server. When using Flask as webserver, I get frequent "net::ERR_CONNECTION_RESET" (in the Flask stacktrace I get some Broken Pipe errors, I will attach them soon. I don't always get these Broken Pipe stack traces). This same app works fine on a virtual machine for both uWSGI and Flask.

If you need any more information, please do ask.

Edit: Client Chrome is not running from WSL.

Edit 2: I'm being silly. Let me include some information.
Windows build: Version 1607 (OS Build 14946)
Ubuntu version: 16.04 Xenial (apt-get dist-upgrade from 14.04)
uWSGI version: 2.0.3
uWSGI cmdline args: "--http 127.0.0.1:5000"
Flask version: 0.10.1

Edit 3: (Running the Flask webserver here) When trying to access my app from Firefox running on WSL itself, the file just doesn't fully transfer over (JS in my case), causing the script to fail. So, it doesn't matter whether the client is Windows or WSL..

When trying to fetch the file with curl (CURL v7.47.0 on WSL) I get

(18) transfer closed with 524925 bytes remaining to read
xbytez@i5-2500K:~$

When trying to fetch the same on Windows (CURL v7.49.1 on Windows) I get

curl: (56) Recv failure: Connection was reset
C:\Users\xBytez>

Edit 4:
The promised stacktrace

Exception happened during processing of request from ('127.0.0.1', 19431)
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 290, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 318, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 331, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 654, in __init__
    self.finish()
  File "/usr/lib/python2.7/SocketServer.py", line 713, in finish
    self.wfile.close()
  File "/usr/lib/python2.7/socket.py", line 283, in close
    self.flush()
  File "/usr/lib/python2.7/socket.py", line 307, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
error: [Errno 32] Broken pipe

aseering · 2016-10-20T23:06:50Z

@sunilmut -- you mentioned a little while back that you have a repro for this; just wondering where it is on your priority list?

There's this ticket and there's #616; if you keep tracing there are about a dozen tickets that refer to those two; etc. It would be great to hear something from y'all about what your plans are for this collection of issues.

aseering · 2016-10-30T14:26:13Z

FYI -- this still reproduces on build 14955 and Ubuntu 16.04.

Interestingly, during this download, CPU usage on the machine goes up a lot. But that usage doesn't show up in top running in WSL; it's not assigned to either Python or cURL. If I open up Task Manager, I see Console Window Host pegging a CPU core. Neither application is writing anything to stdout when this increased CPU utilization occurs.

sunilmut · 2016-10-31T16:54:45Z

@aseering - Thanks for reviving this thread again. I was out for couple of weeks and so the delay in the response. I will take this up immediately, since I agree that this could be manifesting in many different forms. I will post back once I have something.

xBytez · 2016-10-31T18:47:17Z

Thanks you so much for the update. Once this is fixed I can actually move my dev environment fully to WSL! Thanks to you too @aseering for your continous support for WSL, you're everywhere ;).

sunilmut · 2016-11-01T22:01:29Z

@aseering - So, I have root caused the problem. The way the repro is setup, the server (python) side after sending the last byte, calls shutdown and closes the socket. Watching the data transfer under the debugger, it appears that whenever the socket is being closed by python, invariably, there is pending data to be sent. When there is pending data in the socket (with linger disabled), WSL closes the socket in the background. The way the background logic works is that it will wait for the last byte to be sent or a timeout, whichever comes first. After that it will force close the socket. In this case, we are not hitting the timeout and WSL force closes the socket after sending the last byte. Force closing the socket causes an abortive disconnect, which disregards any data that has not been received so far by the peer (i.e curl in this case).
Now, when I look at the socket man page, the line that grabs my attention is When the socket is closed as part of exit(2), it always lingers in the background. And, it makes me wonder, "how long does the socket linger in the background (maybe by trying to do a graceful disconnect) after sending the last byte, before forcefully closing the connection".

aseering · 2016-11-02T01:12:20Z

@sunilmut -- hm, good question... As a programmer, my instinctive expectation is that there would be no simple timeout. I can't immediately find anything online that contradicts that expectation.

I would expect applications that care about such things to manage them explicitly using SO_LINGER in order to have more-granular control over the socket's exact behavior. In that case, the kernel is given an explicit timeout and should follow it.

If the client dies or otherwise goes unresponsive while the server is trying to send data, then the connection will naturally hit TCP's built-in retry-count limit and get torn down. That's true regardless of whether the server is still trying to send, but, given that the kernel buffer is of bounded size, it does guarantee that the socket will close in bounded (though very large in the worst case) time.

Basically, TCP itself guarantees that the send must still be making progress. If it's making progress, why kill it? If the answer is "because I'm worried about running out of kernel memory or ephemeral ports", then your app is dealing with high load and should be using features like SO_LINGER to carefully manage its resource usage. If it's not -- c'mon, that's like two lines of code; just go fix it :-) (Ah, the advantages of a mostly-open-source ecosystem on the design of the underlying platform...)

sunilmut · 2016-11-02T21:04:53Z

@aseering

Basically, TCP itself guarantees that the send must still be making progress. If it's making progress, why kill it?

The primary motivation for our design decisions is compatibility.

As a programmer, my instinctive expectation is that there would be no simple timeout.

Do you have anything (link) that supports this? My searches are returning empty handed. Maybe that is what you also mean by I can't immediately find anything online that contradicts that expectation. :)

aseering · 2016-11-02T23:32:46Z

Yeah, I can't find any documentation regarding a timeout either (unless SO_LINGER is used). I'm basically just trying to justify an educated guess.

therealkenc · 2016-11-11T16:31:15Z

@sunilmut This blog post has some related information regarding SO_LINGER behavior. You're probably already all over this, but I thought it worth the heads up anyway. Cross platform linger has been a bane for ages.

aseering · 2016-11-24T16:55:40Z

Regularly-scheduled ping: This still reproduces for me on build 14971. Though, interestingly, curl hangs for several seconds before reporting that the transfer has closed -- I thought I remembered it erroring out immediately before.

Also, a few more people appear to have hit this issue, and have filed new tickets. (Relevant tickets are linked above. Mentioning them in case anyone is subscribed to just this ticket, as I think github won't then send you a notification e-mail.)

davidmaxwaterman · 2016-11-25T17:09:29Z

Today, I have been investigating the possibility of switching to Windows from Ubuntu (since I pretty much only use the command line anyway (apart from gvim); unfortunately, I think I'm seeing this too :

^C^C----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 57104)
Traceback (most recent call last):
  File "/usr/lib/python2.7/SocketServer.py", line 295, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 321, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/usr/lib/python2.7/SocketServer.py", line 651, in __init__
    self.finish()
  File "/usr/lib/python2.7/SocketServer.py", line 705, in finish
    self.wfile.flush()
  File "/usr/lib/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
KeyboardInterrupt
----------------------------------------

Python 2.7.6. I'm not sure how to find the versions of the Microsoft s/w, but I installed it today.

therealkenc · 2016-12-04T06:30:06Z

So what was the answer to your pretty astute question: "how long does the socket linger in the background (maybe by trying to do a graceful disconnect) after sending the last byte, before forcefully closing the connection?"

sunilmut · 2016-12-04T10:20:29Z

@therealkenc - For getting maximum reliability guarantees from the TCP stack, the TCP socket should not be closed until both the ends have been fully disconnected (or without the full termination handshake complete). To prevent server side sockets in a hanging state (from misbehaving clients that do not close their ends), there is timeout built it some systems. When the user mode socket is closed, a disconnect request (or FIN in TCP terms) is issued. When the FIN is ACK'ed, the TCP socket is in FIN_WAIT_2 state. After waiting for tcp_fin_timeout(http://man7.org/linux/man-pages/man7/tcp.7.html) seconds in the FIN_WAIT_2 state, the socket is forcibly closed. In WSL, we were not completely following the TCP guidelines, thereby leading to some data loss. The Windows networking team helped us figure it out.

xBytez · 2016-12-08T13:22:58Z

Is this fix included in insider build 14986? Just wondering so I know I can upgrade and test it

sunilmut · 2016-12-08T15:17:51Z

@xBytez Unfortunately, no. The fix got checked in the dev branch 4-5 days ago. The release branch runs behind the dev branch on an average 2-3 weeks (this is not a hard and fast rule, but just from my experience). Just to give you an understanding of the timeline. I am also eager to get this validated.

xBytez · 2016-12-08T18:24:52Z

@sunilmut Thanks for the information. Was just curious. I also found the WSL Release notes on the MSDN blog so I'll just have a look at that ;). Thanks again!

krzysztofantczak · 2016-12-30T10:15:26Z

It seems that 14986 caused this problem occuring more often. I'm using gulp's http server, so this issue is more visible in the browser when loading assets (larger CSS or JS files). Before 14986 i was able to refresh browser few times until it stopped interrupting with assets download, but now it seems i can't do that anymore. Issue is more persistent. Any ideas on new build release date (+/-)?

sunilmut · 2016-12-30T18:21:13Z

Any ideas on new build release date (+/-)?

The fix has made it to the release branch. It depends on which build they snap for the Insider release (which is not in our (WSL team's) control). If not in the upcoming Insider release build, it should definitely be in the next one after that. I don't have any more insight into the timelines/dates etc. Thanks for your patience.

sunilmut · 2017-01-09T20:30:17Z

This should be fixed in 15002. Please validate and let us know if the issue persists. Thanks for your patience, @ALL and persistence, @aseering.

krzysztofantczak · 2017-01-11T09:39:57Z

As for now (an hour of playing with stuff around after upgrade this morning), works perfectly ;-) Thank You!

xBytez · 2017-01-11T13:41:30Z

Same here! I was able to run my Flask server just fine. Thank you!

sunilmut · 2017-01-11T16:17:27Z

@krzysztofantczak & @xBytez - Really glad to know that things are working so far. Will keep the issue open for sometime, before closing it out.

aseering · 2017-01-18T02:01:15Z

This is also working great for me. Thanks!

sunilmut · 2017-01-18T03:11:47Z

Glad to know. Thanks @aseering

This was referenced Jul 1, 2016

Intermittent (repeatable) hang sending/receiving a large buffer over TCP #493

Closed

Unable to use docker, in addition to the 'docker version' command. #575

Closed

benhillis added the network label Jul 7, 2016

aseering mentioned this issue Aug 28, 2016

Python 3.4 text file write failure #991

Closed

aseering mentioned this issue Sep 25, 2016

Docker **Client** Hang on Windows Subsystem for Linux (WSL) #1123

Closed

aseering mentioned this issue Oct 16, 2016

Nginx not start #68

Closed

xBytez mentioned this issue Oct 20, 2016

Flask not working? #570

Closed

aseering mentioned this issue Oct 30, 2016

Intermittent, but more repeatable, hang sending/receiving a large buffer over TCP #616

Closed

sunilmut self-assigned this Nov 1, 2016

This was referenced Nov 4, 2016

docker (client) and docker-machine binaries hang #30

Closed

Truncated large responses from node server #1335

Closed

benhillis added the bug label Nov 10, 2016

aseering mentioned this issue Nov 19, 2016

AptPackageIndexUpdate take up a lot of CPU and don't end. #1184

Closed

janbernloehr mentioned this issue Nov 20, 2016

Subscribing & Publishing from ROS nodes does not work #1391

Closed

aseering mentioned this issue Nov 24, 2016

Files served over internal network are cut off #1413

Closed

sunilmut added the fixinbound label Dec 4, 2016

sunilmut mentioned this issue Dec 4, 2016

select sometimes block forever on a ready TCP sockets #419

Closed

LordMonoxide mentioned this issue Dec 5, 2016

when run php server got errors in browser #1377

Closed

aseering mentioned this issue Dec 9, 2016

unable to install pillow from openai's universe #1483

Closed

garci66 mentioned this issue Dec 27, 2016

High CPU usage for apache + php when transferring large files / sys/poll() returns wrong value #1540

Closed

sunilmut added fixedininsiderbuilds and removed fixinbound labels Jan 9, 2017

therealkenc mentioned this issue Jan 13, 2017

Node Pipe cannot connect #1354

Closed

sunilmut closed this as completed Jan 19, 2017

aseering mentioned this issue Feb 10, 2017

rsync not copying #97

Closed

sunilmut mentioned this issue Aug 2, 2017

Docker build stucks at "Sending build context to Docker daemon 556.1kb" #2388

Closed

aseering mentioned this issue Sep 29, 2017

python simplehttpserver hangs #1906

Closed

AMPivovarov mentioned this issue Dec 13, 2019

Output from WSL command pipeline is randomly truncated #4082

Closed

therealkenc added fixedinreleasebuild and removed bug fixedininsiderbuilds network labels Aug 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incomplete socket send with fast writer and slow reader #610

Incomplete socket send with fast writer and slow reader #610

aseering commented Jul 1, 2016

aseering commented Jul 1, 2016

aseering commented Jul 1, 2016

sunilmut commented Jul 8, 2016

xBytez commented Oct 19, 2016 •

edited

Loading

aseering commented Oct 20, 2016

aseering commented Oct 30, 2016

sunilmut commented Oct 31, 2016

xBytez commented Oct 31, 2016

sunilmut commented Nov 1, 2016

aseering commented Nov 2, 2016

sunilmut commented Nov 2, 2016

aseering commented Nov 2, 2016

therealkenc commented Nov 11, 2016

aseering commented Nov 24, 2016

davidmaxwaterman commented Nov 25, 2016

therealkenc commented Dec 4, 2016

sunilmut commented Dec 4, 2016

xBytez commented Dec 8, 2016

sunilmut commented Dec 8, 2016

xBytez commented Dec 8, 2016

krzysztofantczak commented Dec 30, 2016 •

edited

Loading

sunilmut commented Dec 30, 2016

sunilmut commented Jan 9, 2017

krzysztofantczak commented Jan 11, 2017

xBytez commented Jan 11, 2017 •

edited

Loading

sunilmut commented Jan 11, 2017

aseering commented Jan 18, 2017

sunilmut commented Jan 18, 2017

Incomplete socket send with fast writer and slow reader #610

Incomplete socket send with fast writer and slow reader #610

Comments

aseering commented Jul 1, 2016

aseering commented Jul 1, 2016

aseering commented Jul 1, 2016

sunilmut commented Jul 8, 2016

xBytez commented Oct 19, 2016 • edited Loading

aseering commented Oct 20, 2016

aseering commented Oct 30, 2016

sunilmut commented Oct 31, 2016

xBytez commented Oct 31, 2016

sunilmut commented Nov 1, 2016

aseering commented Nov 2, 2016

sunilmut commented Nov 2, 2016

aseering commented Nov 2, 2016

therealkenc commented Nov 11, 2016

aseering commented Nov 24, 2016

davidmaxwaterman commented Nov 25, 2016

therealkenc commented Dec 4, 2016

sunilmut commented Dec 4, 2016

xBytez commented Dec 8, 2016

sunilmut commented Dec 8, 2016

xBytez commented Dec 8, 2016

krzysztofantczak commented Dec 30, 2016 • edited Loading

sunilmut commented Dec 30, 2016

sunilmut commented Jan 9, 2017

krzysztofantczak commented Jan 11, 2017

xBytez commented Jan 11, 2017 • edited Loading

sunilmut commented Jan 11, 2017

aseering commented Jan 18, 2017

sunilmut commented Jan 18, 2017

xBytez commented Oct 19, 2016 •

edited

Loading

krzysztofantczak commented Dec 30, 2016 •

edited

Loading

xBytez commented Jan 11, 2017 •

edited

Loading