Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on zero bytes written #149

Closed

Conversation

WyriHaximus
Copy link
Member

Spent a few hours helping @mpociot trace down a bug with his expose project. Now the bug he ran into boils down to a fwrite reporting 0 bytes written despite reporting itself as writable. Normally we would consider this stream closed at this point, but in some rare situations fwrite will return 0 because the stream isn't writable but is still open. Locally I couldn't reproduce it so I was kinda pulling my hair out. So to test some ideas I prepared a file with several possible fixes that could be the issue, included some debug output and send it to @mpociot.
image

The red encircled output is the retry kicking in after it tried to write but it didn't.

@WyriHaximus WyriHaximus added this to the v1.1.1 milestone Apr 28, 2020
@WyriHaximus WyriHaximus requested review from jsor and clue April 28, 2020 21:30
@kelunik
Copy link

kelunik commented Apr 29, 2020

See also https://github.com/amphp/byte-stream/blob/b867505edb79dda8f253ca3c3a2bbadae4b16592/lib/ResourceOutputStream.php#L102

We've had that a long time already, and 3 seems to be a good value. I don't think it needs to be customizable.

@WyriHaximus
Copy link
Member Author

@kelunik where do you think I had the inspiration from 😉

@clue
Copy link
Member

clue commented Apr 29, 2020

This is an interesting one, thanks for filing! When exactly does this problem show up, i.e. how can we reproduce this and be sure the suggested fix works? This hasn't shown up before, is this platform-specific behavior or does this only happen for certain streams (SSL/TLS wrappers, process pipes etc.)?

I understand the underlying write call may return a temporary EAGAIN/EWOULDBLOCK error and retrying sounds reasonable, but why limit this to a specific number of retries? I understand PHP doesn't seem to always expose the underlying error code, but I'd rather not introduce a possible race condition here. Perhaps we can try to use the error message which includes errno=x in this case?

@mpociot
Copy link

mpociot commented Apr 29, 2020

Reproducing this issue seems to be quite hard (for others at least).
The error only occurred to me when I was trying to connect to a WebSocket connection via HTTPS (proxied through nginx).
When connected directly to the proxied port, it worked fine. So I suppose that the SSL connection had something to do with that.

Then, sending data through that connection made it close at some point. That was no longer the issue with this fix in place.

@WyriHaximus
Copy link
Member Author

This is an interesting one, thanks for filing! When exactly does this problem show up, i.e. how can we reproduce this and be sure the suggested fix works? This hasn't shown up before, is this platform-specific behavior or does this only happen for certain streams (SSL/TLS wrappers, process pipes etc.)?

I couldn't reproduce this. Literally had to give @mpociot a file to test and confirm it would fix this specific problem.

why limit this to a specific number of retries?

Because if we don't, there will be endless retries. While testing @mpociot would restart the server giving me a broken connection, so it hits the 0 written 13 times and then gives up.

@clue
Copy link
Member

clue commented Apr 29, 2020

@mpociot Interesting, seems to confirm my suspicion this could be related to TLS streams. I assume this only happens when you print "a lot" of data, i.e. buffers start to become full(er)? Can you post some details about your platform? PHP version + extensions, OS? How do you reproduce this on your platform, does this happen 100% of the time?

Can you add some logging for the $errstr (a simple echo "[$errstr]\n" is fine) a few lines above this changeset and see what's being reported?

@mpociot
Copy link

mpociot commented Apr 29, 2020

I can reproduce this 100% of the time on my machine, yes.

My setup:
macOS 10.15.3
PHP 7.4.5 (cli) (built: Apr 23 2020 02:25:56) ( NTS )

Installed modules:

[PHP Modules]
bcmath
blackfire
bz2
calendar
Core
ctype
curl
date
dba
dom
exif
FFI
fileinfo
filter
ftp
gd
gettext
gmp
hash
iconv
imagick
intl
json
ldap
libxml
mbstring
mysqli
mysqlnd
odbc
openssl
pcntl
pcre
PDO
pdo_dblib
pdo_mysql
PDO_ODBC
pdo_pgsql
pdo_sqlite
pgsql
Phar
phpdbg_webhelper
posix
pspell
readline
Reflection
session
shmop
SimpleXML
soap
sockets
sodium
SPL
sqlite3
standard
sysvmsg
sysvsem
sysvshm
tidy
tokenizer
xml
xmlreader
xmlrpc
xmlwriter
xsl
Zend OPcache
zip
zlib

[Zend Modules]
Zend OPcache
blackfire

To reproduce the issue for me, it's enough to run this:

<?php
// composer require ratchet/pawl

require __DIR__ . '/vendor/autoload.php';

\Ratchet\Client\connect('wss://expose.dev/expose/control', [], [
    'X-Expose-Control' => 'enabled'
])->then(function($conn) {
    $conn->send(str_repeat('a', 1024*150));
});

This (without the fix) closes the connection 100% of the time.

@mpociot
Copy link

mpociot commented Apr 29, 2020

@clue the error handler does not get called, so nothing gets echoed for me

@WyriHaximus
Copy link
Member Author

@mpociot can you disable blackfire and try without the fix?

@mpociot
Copy link

mpociot commented Apr 29, 2020

@WyriHaximus that fails as well

@WyriHaximus
Copy link
Member Author

@mpociot ok good so it's not interfering 👍

@kelunik
Copy link

kelunik commented Apr 29, 2020

I think you need fragmented TLS records to reproduce it.

@WyriHaximus
Copy link
Member Author

@kelunik that would make sense, since the record isn't fully in we can't read from the stream until the rest is in correct?

@kelunik
Copy link

kelunik commented Apr 29, 2020

Yes, except that writes fail, not reads.

But with TLS any IO operations might require reads / writes, see WANT_WRITE and WANT_READ. We wanted to expose the last openssl error, but haven't done so, yet.

@clue
Copy link
Member

clue commented Apr 30, 2020

To reproduce the issue for me, it's enough to run this:

<?php
// composer require ratchet/pawl

require __DIR__ . '/vendor/autoload.php';

\Ratchet\Client\connect('wss://expose.dev/expose/control', [], [
    'X-Expose-Control' => 'enabled'
])->then(function($conn) {
    $conn->send(str_repeat('a', 1024*150));
});

This (without the fix) closes the connection 100% of the time.

Thanks for providing this reproduction script, this helped identify the underlying cause and I've just filed #150 as a more permanent fix without having to rely on retry behavior.

Again thanks to @WyriHaximus and @mpociot for reporting and filing an initial work-around in here. I've had a chat earlier today with @mpociot to verify my assumption and we've been able to reproduce and fix this some minimal changes to the existing test suite and code base. I have intentionally pushed this as multiple commits to highlight this exact issue in #150.

@clue clue closed this in #150 May 2, 2020
@clue clue removed this from the v1.1.1 milestone May 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants