Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wurlitzer hangs process when pipe/buffer is full #15

Closed
gidim opened this issue May 9, 2018 · 5 comments · Fixed by #16
Closed

Wurlitzer hangs process when pipe/buffer is full #15

gidim opened this issue May 9, 2018 · 5 comments · Fixed by #16

Comments

@gidim
Copy link

gidim commented May 9, 2018

Hi,

We're using Wurlitzer at Comet.ml and we found that when the C library prints a large string
the entire process hangs. We manage to create a test that reproduces this. Would it be possible to increase the buffer size or find a workaround?

def test_buffer_full():
    with pipes(stdout=PIPE, stderr=PIPE) as (stdout, stderr):
        long_string = "x" * 1000000 # create a very long string
        printf_err(long_string)

    # Test never reaches here as the process hangs.
    assert stderr.read() == long_string+"\n"
@minrk
Copy link
Owner

minrk commented May 9, 2018

Thanks for the test!

I think the OS controls the buffer size for pipes, so we don't have much of an option there. My first guess for the test to pass was to switch stderr=PIPE to stderr=io.StringIO() which should be able to contain arbitrary size messages.

I ran the test again and it still hung. It turns out it's hanging in the call to fflush. I reworked how fflush is called in #16 and your test now passes. Does that work for you?

@gidim
Copy link
Author

gidim commented May 9, 2018

Thanks for the quick reply!
Unfortunately it's still stuck on my machine. I managed to identify that it starts when the string is
greater or equals to 65535 bytes long. This is very likely to be the default pipe buffer on OSX 10.13.3

When I removed the fflush calls completely it hanged on os.read(pipe, 1024). Not sure why reading from a full buffer should be blocking. Possibly a race condition ?

@gidim
Copy link
Author

gidim commented May 10, 2018

EDIT:
Just want to be clear that this test is failing (hangs) in OS X with python 3.6.3 and 2.7.14

Manage to dig a little bit deeper. When I dump the stack trace of all threads it's clear that the process is blocked on select():

  File "test.py", line 142, in <module>
    test_buffer_full()
  File "test.py", line 122, in test_buffer_full
    printf_err(long_string)
  File "test.py", line 25, in printf_err
    libc.fprintf(c_stderr_p, (msg + '\n').encode('utf8'))
Thread 0x0000700005250000 (most recent call first):
  File "/Users/Gideon/Documents/dev/wurlitzer/wurlitzer.py", line 161 in forwarder
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 864 in run
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007fff9efa9340 (most recent call first):
  File "test.py", line 135 in sigshow
  File "test.py", line 25 in printf_err
  File "test.py", line 122 in test_buffer_full
  File "test.py", line 142 in <module>

From my understanding select() should never block when provide with a timeout value. It's possible (although unlikely) that passing 0 as the timeout is interpreted as NULL in the C side. When I set the timeout to 0.001 the test passes:

r, w, x = select.select(pipes, [], [], 0.01)

@minrk
Copy link
Owner

minrk commented May 14, 2018

Interesting. The test passed for me, but when I run it more times, I can see that I didn't.

The hang does seem to always be in fflush for me, though, not select. I think I can see that there's a race condition there now - if the too-large send happens in between select and fflush, the fflush will block forever.

In general, now that I know fflush blocks when the buffer is full, I think that calling fflush from the same thread that's reading from the pipe is a fundamentally unsafe pattern. My first idea for this is spawning yet another thread that does nothing but call flush. That seems a little wasteful, but I don't see another way to solve the dueling problems of:

  1. sometimes select won't wake unless fflush has been called
  2. fflush will block if the buffer is full

@gidim
Copy link
Author

gidim commented May 15, 2018

I haven't seen problem 1 happening before so I don't have much to contribute to the discussion. Have you considered using a file instead of a pipe?

@minrk minrk closed this as completed in #16 May 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants