Retry on zero bytes written #149

WyriHaximus · 2020-04-28T21:30:29Z

Spent a few hours helping @mpociot trace down a bug with his expose project. Now the bug he ran into boils down to a fwrite reporting 0 bytes written despite reporting itself as writable. Normally we would consider this stream closed at this point, but in some rare situations fwrite will return 0 because the stream isn't writable but is still open. Locally I couldn't reproduce it so I was kinda pulling my hair out. So to test some ideas I prepared a file with several possible fixes that could be the issue, included some debug output and send it to @mpociot.

The red encircled output is the retry kicking in after it tried to write but it didn't.

kelunik · 2020-04-29T04:52:33Z

See also https://github.com/amphp/byte-stream/blob/b867505edb79dda8f253ca3c3a2bbadae4b16592/lib/ResourceOutputStream.php#L102

We've had that a long time already, and 3 seems to be a good value. I don't think it needs to be customizable.

WyriHaximus · 2020-04-29T10:04:54Z

@kelunik where do you think I had the inspiration from 😉

clue · 2020-04-29T10:51:16Z

This is an interesting one, thanks for filing! When exactly does this problem show up, i.e. how can we reproduce this and be sure the suggested fix works? This hasn't shown up before, is this platform-specific behavior or does this only happen for certain streams (SSL/TLS wrappers, process pipes etc.)?

I understand the underlying write call may return a temporary EAGAIN/EWOULDBLOCK error and retrying sounds reasonable, but why limit this to a specific number of retries? I understand PHP doesn't seem to always expose the underlying error code, but I'd rather not introduce a possible race condition here. Perhaps we can try to use the error message which includes errno=x in this case?

mpociot · 2020-04-29T10:55:48Z

Reproducing this issue seems to be quite hard (for others at least).
The error only occurred to me when I was trying to connect to a WebSocket connection via HTTPS (proxied through nginx).
When connected directly to the proxied port, it worked fine. So I suppose that the SSL connection had something to do with that.

Then, sending data through that connection made it close at some point. That was no longer the issue with this fix in place.

WyriHaximus · 2020-04-29T11:14:13Z

This is an interesting one, thanks for filing! When exactly does this problem show up, i.e. how can we reproduce this and be sure the suggested fix works? This hasn't shown up before, is this platform-specific behavior or does this only happen for certain streams (SSL/TLS wrappers, process pipes etc.)?

I couldn't reproduce this. Literally had to give @mpociot a file to test and confirm it would fix this specific problem.

why limit this to a specific number of retries?

Because if we don't, there will be endless retries. While testing @mpociot would restart the server giving me a broken connection, so it hits the 0 written 13 times and then gives up.

clue · 2020-04-29T11:38:49Z

@mpociot Interesting, seems to confirm my suspicion this could be related to TLS streams. I assume this only happens when you print "a lot" of data, i.e. buffers start to become full(er)? Can you post some details about your platform? PHP version + extensions, OS? How do you reproduce this on your platform, does this happen 100% of the time?

Can you add some logging for the $errstr (a simple echo "[$errstr]\n" is fine) a few lines above this changeset and see what's being reported?

mpociot · 2020-04-29T11:54:44Z

I can reproduce this 100% of the time on my machine, yes.

My setup:
macOS 10.15.3
PHP 7.4.5 (cli) (built: Apr 23 2020 02:25:56) ( NTS )

Installed modules:

[PHP Modules]
bcmath
blackfire
bz2
calendar
Core
ctype
curl
date
dba
dom
exif
FFI
fileinfo
filter
ftp
gd
gettext
gmp
hash
iconv
imagick
intl
json
ldap
libxml
mbstring
mysqli
mysqlnd
odbc
openssl
pcntl
pcre
PDO
pdo_dblib
pdo_mysql
PDO_ODBC
pdo_pgsql
pdo_sqlite
pgsql
Phar
phpdbg_webhelper
posix
pspell
readline
Reflection
session
shmop
SimpleXML
soap
sockets
sodium
SPL
sqlite3
standard
sysvmsg
sysvsem
sysvshm
tidy
tokenizer
xml
xmlreader
xmlrpc
xmlwriter
xsl
Zend OPcache
zip
zlib

[Zend Modules]
Zend OPcache
blackfire

To reproduce the issue for me, it's enough to run this:

<?php
// composer require ratchet/pawl

require __DIR__ . '/vendor/autoload.php';

\Ratchet\Client\connect('wss://expose.dev/expose/control', [], [
    'X-Expose-Control' => 'enabled'
])->then(function($conn) {
    $conn->send(str_repeat('a', 1024*150));
});

This (without the fix) closes the connection 100% of the time.

mpociot · 2020-04-29T11:59:05Z

@clue the error handler does not get called, so nothing gets echoed for me

WyriHaximus · 2020-04-29T12:02:36Z

@mpociot can you disable blackfire and try without the fix?

mpociot · 2020-04-29T12:04:37Z

@WyriHaximus that fails as well

WyriHaximus · 2020-04-29T12:06:33Z

@mpociot ok good so it's not interfering 👍

kelunik · 2020-04-29T17:08:40Z

I think you need fragmented TLS records to reproduce it.

WyriHaximus · 2020-04-29T17:18:29Z

@kelunik that would make sense, since the record isn't fully in we can't read from the stream until the rest is in correct?

kelunik · 2020-04-29T18:28:34Z

Yes, except that writes fail, not reads.

But with TLS any IO operations might require reads / writes, see WANT_WRITE and WANT_READ. We wanted to expose the last openssl error, but haven't done so, yet.

clue · 2020-04-30T11:07:30Z

To reproduce the issue for me, it's enough to run this:
<?php
// composer require ratchet/pawl

require __DIR__ . '/vendor/autoload.php';

\Ratchet\Client\connect('wss://expose.dev/expose/control', [], [
    'X-Expose-Control' => 'enabled'
])->then(function($conn) {
    $conn->send(str_repeat('a', 1024*150));
});
This (without the fix) closes the connection 100% of the time.

Thanks for providing this reproduction script, this helped identify the underlying cause and I've just filed #150 as a more permanent fix without having to rely on retry behavior.

Again thanks to @WyriHaximus and @mpociot for reporting and filing an initial work-around in here. I've had a chat earlier today with @mpociot to verify my assumption and we've been able to reproduce and fix this some minimal changes to the existing test suite and code base. I have intentionally pushed this as multiple commits to highlight this exact issue in #150.

Retry on zero bytes written

2bd3b36

WyriHaximus added the bug label Apr 28, 2020

WyriHaximus added this to the v1.1.1 milestone Apr 28, 2020

WyriHaximus requested review from jsor and clue April 28, 2020 21:30

clue mentioned this pull request Apr 30, 2020

Fix faulty write buffer behavior when sending large data chunks over TLS (Mac OS X only) #150

Merged

clue closed this in #150 May 2, 2020

clue removed this from the v1.1.1 milestone May 2, 2020

clue mentioned this pull request Apr 14, 2022

Ratchet on 100% CPU load ratchetphp/Ratchet#939

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on zero bytes written #149

Retry on zero bytes written #149

WyriHaximus commented Apr 28, 2020

kelunik commented Apr 29, 2020

WyriHaximus commented Apr 29, 2020

clue commented Apr 29, 2020

mpociot commented Apr 29, 2020

WyriHaximus commented Apr 29, 2020

clue commented Apr 29, 2020

mpociot commented Apr 29, 2020

mpociot commented Apr 29, 2020

WyriHaximus commented Apr 29, 2020

mpociot commented Apr 29, 2020

WyriHaximus commented Apr 29, 2020

kelunik commented Apr 29, 2020

WyriHaximus commented Apr 29, 2020

kelunik commented Apr 29, 2020

clue commented Apr 30, 2020

Retry on zero bytes written #149

Retry on zero bytes written #149

Conversation

WyriHaximus commented Apr 28, 2020

kelunik commented Apr 29, 2020

WyriHaximus commented Apr 29, 2020

clue commented Apr 29, 2020

mpociot commented Apr 29, 2020

WyriHaximus commented Apr 29, 2020

clue commented Apr 29, 2020

mpociot commented Apr 29, 2020

mpociot commented Apr 29, 2020

WyriHaximus commented Apr 29, 2020

mpociot commented Apr 29, 2020

WyriHaximus commented Apr 29, 2020

kelunik commented Apr 29, 2020

WyriHaximus commented Apr 29, 2020

kelunik commented Apr 29, 2020

clue commented Apr 30, 2020