Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Busy WSGITask that does no I/O blocks main thread from flushing outbuf #349

Closed
idanzolantz opened this issue Jun 22, 2021 · 5 comments
Closed

Comments

@idanzolantz
Copy link

Hey guys,

I'm using waitress==1.4.4 + flask==1.0.2 in python 3.6.10, on both MacOS 11.4 (Big Sur) and Linux Alpine 3.11.6 (Kernel 4.19.112).

Created this very simple app:

import threading
from flask import Response, Flask
from waitress import serve

app = Flask(__name__)

@app.route('/stream_example', methods=['GET', 'POST'])
def stream_example():

    def generate():
        for i in range(1000000):
            yield f"This is iter {i} \n"

    r = Response(generate())
    r.headers['Content-Type'] = 'application/octet-stream'
    r.headers.set('Content-Disposition', 'attachment', filename="results.txt")
    return r

if __name__ == '__main__':
    serve(app, host='0.0.0.0', port=8000)

when running:
curl --raw --request POST 'http://localhost:8000/stream_example' --header 'Content-Type: application/json' --data-raw '' (or using python's requests, same) I'm getting a strange behavior - it takes 30-40 seconds for the data to start streaming back to the client (while in chunked encoding I would expect the data stream to start immediately). memory gets inflated, as if the entire response is being buffered in-memory before sending the first byte.

I did some research, and found out that as long as flask/WSGITask keeps generating output with no I/O between yields, main thread struggles to acquire the outbuf lock (_flush_some_if_lockable at channel.py always fails to lock and return immediately, hence nothing gets flushed. when examining the output buffer at each iteration, it's clear the data keeps appending and nothing gets flushed. I'm not sure but I think I noticed the initial BytesIOBasedBuffer object was overflown and rotated to TempfileBasedBuffer at some point).

Out of curiosity, I tried changing:

def _flush_some_if_lockable(self):
    ...
    if self.outbuf_lock.acquire(False):
        ...

to:

def _flush_some_if_lockable(self):
    ...
    if self.outbuf_lock.acquire(True, 0.00000001):
        ...

and viola! response starts flowing immediately.
seams like something between python's GIL and the implementation of Condition.acquire() causes self.outbuf_lock.acquire(False) to almost always peek at the lock state when the lock is held by the application. but if it actually requests the lock (by committing to wait any short timeout), it does get it.

I know blocking main thread is a (very) bad idea. Alternatively, forcing the app/task thread to context-switch after every chunk also works (with worst performance for a single request, but probably better performance for production use-cases, where multiple requests are handled in parallel).
Add to execute() at WSGITask class (task.py) a sleep(0) right after writing each chunk:

class WSGITask(Task):
    """A WSGI task produces a response from a WSGI application.
    """
    ...
    def execute(self):
        ....
            # transmit headers only after first iteration of the iterable
            # that returns a non-empty bytestring (PEP 3333)
            if chunk:
                self.write(chunk)
           ---> time.sleep(0)

        cl = self.content_length
        if cl is not None:
        ...

On unix systems, time.sleep(0) gets translated to select syscall with timeout=0, which is known to cause context-switch.
after that, the main thread has no problem in gaining the lock.

In conclusion, I think this locking issue needs to be addressed.
You can either have the framework code that yields chunks from the application iterator to the socket force a context-switch (like I showed above),
or let users know (via your formal documentation / code examples) that applications that "yield chunks too fast" need to inject some sort of I/O or context-switching instructions between yield loops.

BTW, this issue reproduces on waitress==2.0.0 as well.
As said, checked on MacOS & Linux, did not check on Windows.

Thanks,
Idan.

@digitalresistor
Copy link
Member

digitalresistor commented Jun 27, 2021

I'm going to need to play with this a bit, because after a certain amount of content is buffered Waitress should block the yielding thread from continuing to yield so that it can flush output:

https://github.com/Pylons/waitress/blob/master/src/waitress/channel.py#L342

That function won't return until we are back under the outbuf high water mark... basically it applies back pressured because in the past Waitress would allow an application to write infinite amount of bytes to disk in out buffers.

@idanzolantz
Copy link
Author

@bertjwregeer thanks for your response!
Assuming _flush_outbufs_below_high_watermark does block the yielding thread as it should, thereby allowing main thread to acquire the lock and start flushing out after a certain amount of data has been buffered, it's still not good enough for certain use-cases.
For example, some proxies between clients and servers might give up and break the connection if no bytes arrive after xx seconds of waiting. so waiting for lots of data to aggregate is not only a matter of bad user-experience (might take 1-2 minutes before the file starts downloading), it's also a functional issue.
It's probably best to lower the high watermark size drastically, or find another solution to enable the main thread to start flushing sooner...

@digitalresistor
Copy link
Member

It shouldn't take 40 seconds to start flushing though... especially if you are yielding data really quickly as it should quickly reach the high watermark (or exhaust the iterator).

The high-water mark is set to 16 MiB: 16777216

I will also admit that this is a very much worst case scenario, most applications have plenty of other requests/processes happening at the same time that do I/O. Adding a time.sleep(0) in my opinion is the wrong way to go about this. I'd want to explicitly give up the lock and notify the main thread, and time.sleep(0) may also be dependent on different Python implementations and may not work the same on PyPy for instance.

You could set the outbuf_high_watermark to less in your waitress.serve command to get it pause the thread sooner. We overflow from memory buffer to file based buffer after 1 MiB which is controlled with outbuf_overflow. See the documentation on the variables you may tweak here: https://docs.pylonsproject.org/projects/waitress/en/stable/arguments.html#arguments

@digitalresistor
Copy link
Member

@idanzolantz if you want to give this change a shot: #364 you can see if this alleviates your issues. We now attempt to flush each time write_soon is called from the application thread.

@digitalresistor
Copy link
Member

See this comment: #364 (comment)

This fixes the issue, reproducer used:

import logging

log = logging.getLogger(__name__)


def application(environ, start_response):
    start_response(
        "200 OK",
        [
            ("Content-Type", "application/octet-stream"),
        ],
    )
    for i in range(1000000):
        #log.info(f'yield {i}')
        yield f'this is iter {i}\n'.encode('utf8')


if __name__ == "__main__":
    import waitress

    logging.basicConfig(
        format="%(asctime)-15s %(levelname)-8s %(name)s %(message)s",
        level=logging.DEBUG,
    )

    waitress.serve(application)

(Thanks @mmerickel)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants