werkzeug.formparser is really slow with large binary uploads #875

sekrause · 2016-03-03T14:14:11Z

When I perform a multipart/form-data upload of any large binary file in Flask, those uploads are very easily CPU bound (with Python consuming 100% CPU) instead of I/O bound on any reasonably fast network connection.

A little bit of CPU profiling reveals that almost all CPU time during these uploads is spent in werkzeug.formparser.MultiPartParser.parse_parts(). The reason this that the method parse_lines() yields a lot of very small chunks, sometimes even just single bytes:

# we have something in the buffer from the last iteration.
# this is usually a newline delimiter.
if buf:
    yield _cont, buf
    buf = b''

So parse_parts() goes through a lot of small iterations (more than 2 million for a 100 MB file) processing single "lines", always writing just very short chunks or even single bytes into the output stream. This adds a lot of overhead slowing down those whole process and making it CPU bound very quickly.

A quick test shows that a speed-up is very easily possible by first collecting the data in a bytearray in parse_lines() and only yielding that data back into parse_parts() when self.buffer_size is exceeded. Something like this:

buf = b''
collect = bytearray()
for line in iterator:
    if not line:
        self.fail('unexpected end of stream')

    if line[:2] == b'--':
        terminator = line.rstrip()
        if terminator in (next_part, last_part):
            # yield remaining collected data
            if collect:
                yield _cont, collect
            break

    if transfer_encoding is not None:
        if transfer_encoding == 'base64':
            transfer_encoding = 'base64_codec'
        try:
            line = codecs.decode(line, transfer_encoding)
        except Exception:
            self.fail('could not decode transfer encoded chunk')

    # we have something in the buffer from the last iteration.
    # this is usually a newline delimiter.
    if buf:
        collect += buf
        buf = b''

    # If the line ends with windows CRLF we write everything except
    # the last two bytes.  In all other cases however we write
    # everything except the last byte.  If it was a newline, that's
    # fine, otherwise it does not matter because we will write it
    # the next iteration.  this ensures we do not write the
    # final newline into the stream.  That way we do not have to
    # truncate the stream.  However we do have to make sure that
    # if something else than a newline is in there we write it
    # out.
    if line[-2:] == b'\r\n':
        buf = b'\r\n'
        cutoff = -2
    else:
        buf = line[-1:]
        cutoff = -1

    collect += line[:cutoff]

    if len(collect) >= self.buffer_size:
        yield _cont, collect
        collect.clear()

This change alone reduces the upload time for my 34 MB test file from 4200 ms to around 1100 ms over localhost on my machine, that's almost a 4X increase in performance. All tests are done on Windows (64-bit Python 3.4), I'm not sure if it's as much of a problem on Linux.

It's still mostly CPU bound, so I'm sure there is even more potential for optimization. I think I'll look into it when I find a bit more time.

The text was updated successfully, but these errors were encountered:

languanghao · 2016-03-18T01:51:17Z

I also have same problem, when I upload an iso file(200m), the first call to request.form will take 7s

RonnyPfannschmidt · 2016-03-18T10:59:43Z

2 things seem interesting for further optimization - experimenting with cython, and experimenting with interpreting the content-site headers for smarter mime message parsing

(no need to scan for lines if you know the content-length of a sub-message)

lnielsen · 2016-08-22T20:06:29Z

Just a quick note, that if you stream the file directly in the request body (i.e. no application/multipart-formdata), you completely bypass the form parser and read the file directly from request.stream.

carbn · 2016-10-18T07:04:35Z

I have the same issue with slow upload speeds with multipart uploads when using jQuery-File-Upload's chunked upload method. When using small chunks (~10MB), the transfer speed jumps between 0 and 12MB/s while the network and server are fully capable of speeds over 50MB/s. The slowdown is caused by the cpu bound multipart parsing which takes about the same time as the actual upload. Sadly, using streaming uploads to bypass the multipart parsing is not really an option as I must support iOS devices that can't do streaming in the background.

The patch provided by @sekrause looks nice but doesn't work in python 2.7.

cuibonobo · 2016-10-18T18:50:36Z

@carbn: I was able to get the patch to work in Python 2.7 by changing the last line to collect = bytearray(). This just creates a new bytearray instead of clearing the existing one.

carbn · 2016-10-18T19:50:42Z

@cuibonobo: That's the first thing I changed but still had another error. I can't check the working patch at the moment, but IIRC the yields had to be changed from yield _cont, collect to yield _cont, str(collect). This allowed the code to be tested and the patch yielded about 30% increase in the multipart processing speed. It's a nice speedup, but the performance is still pretty bad.

sekrause · 2016-10-18T20:36:48Z

A little further investigation shows that werkzeug.wsgi.make_line_iter is already too much of a bottleneck to really be able to optimize parse_lines(). Look at this Python 3 test script:

import io
import time
from werkzeug.wsgi import make_line_iter

filename = 'test.bin' # Large binary file
lines = 0

# load a large binary file into memory
with open(filename, 'rb') as f:
    data = f.read()
    stream = io.BytesIO(data)
    filesize = len(data) / 2**20 # MB

start = time.perf_counter()
for _ in make_line_iter(stream):
    lines += 1
stop = time.perf_counter()
delta = stop - start

print('File size: %.2f MB' % filesize)
print('Time: %.1f seconds' % delta)
print('Read speed: %.2f MB/s' % (filesize / delta))
print('Number of lines yielded by make_line_iter: %d' % lines)

For a 923 MB video file with Python 3.5 the output look something like this on my laptop:

File size: 926.89 MB
Time: 20.6 seconds
Read speed: 44.97 MB/s
Number of lines yielded by make_line_iter: 7562905

So even if you apply my optimization above and optimize it further until perfection you'll still be limited to ~45 MB/s for large binary uploads simply because make_line_iter can't give you the data fast enough and you'll be doing 7.5 million iterations for 923 MB of data in your loop that checks for the boundary.

I guess the only great optimization will be to completely replace parse_lines() with something else. A possible solution that comes to mind is to read a reasonably large chunk of the stream into memory then use string.find() (or bytes.find() in Python 3) to check if the boundary is in the chunk. In Python find() is a highly optimized string search algorithm written in C, so that should give you some performance. You would just have to take care of the case where the boundary might be right between two chunks.

sdizazzo · 2017-06-02T14:58:15Z

I wanted to mention doing the parsing on the stream in chunks as it is received. @siddhantgoel wrote this great little parser for us. It's working great for me. https://github.com/siddhantgoel/streaming-form-data

lambdaq · 2017-06-20T11:57:31Z

I guess the only great optimization will be to completely replace parse_lines()

+1 for this.

I am writing a bridge to stream user's upload directly to S3 without any intermediate temp files, possibly with backpressure, and I find werkzeug and flask situation frustrating. You can't move data directly between two pipes.

davidism · 2017-06-20T13:28:09Z

@lambdaq I agree it's a problem that needs to be fixed. If this is important to you, I'd be happy to review a patch changing the behavior.

lnielsen · 2017-06-20T14:41:30Z

@lambdaq Note that if you just stream data directly in the request body and use application/octet-stream then the form parser doesn't kick in at all and you can use request.stream (i.e. no temp files etc).

The only problem we had is the werkzeug form parser is eagerly checking content length against the allowed max content length before knowing if it should actually parse the request body.

This prevents you from setting max content length on normal form data, but also allow very large file uploads.

We fixed it by reordering the check the function a bit. Not sure if it makes sense to provide this upstream as some apps might rely on the existing behaviour.

lambdaq · 2017-06-21T02:30:53Z

Note that if you just stream data directly in the request body and use application/octet-stream then the form parser doesn't kick in at all and you can use request.stream (i.e. no temp files etc).

Unfortunately not. It's just normal form uploads with multipart.

I'd be happy to review a patch changing the behavior.

I tried to hack werkzeug.wsgi.make_line_iter or parse_lines() using generators's send(), so we can signal _iter_basic_lines() to emit whole chunks instead of lines. It turns out not so easy.

Basically, the rabbit whole starts with 'itertools.chain' object has no attribute 'send'.... 😂

ThiefMaster · 2017-06-21T05:36:54Z

I wonder how much this code could be sped up using native speedups written in C (or Cython etc.). I think handling semi-large (a few 100 MB, but not huge as in many GB) files more efficiently is important without having to change how the app uses them (ie streaming them directly instead of buffering) - for many applications this would be overkill and is not absolutely necessary (actually, even the current somewhat slow performance is probably OK for them) but making things faster is always nice!

lambdaq · 2017-09-30T09:18:04Z

Another possible solution is offload the multipart parsing job to nginx

https://www.nginx.com/resources/wiki/modules/upload/

lambdaq · 2017-12-14T01:28:26Z

@ThiefMaster

See

https://github.com/hydrogen18/multipart-python

https://github.com/defnull/multipart

ThiefMaster · 2017-12-14T07:08:20Z

Both repos look dead.

mdemin914 · 2018-02-01T07:49:25Z

so is there no known solution to this?

lnielsen · 2018-02-01T13:34:40Z

There's a workaround👆

sdizazzo · 2018-02-02T17:26:46Z

Under uwsgi, we use it's built in chunked_read() function and parse the stream on our own as it comes in. It works 99% of the time, but it has a bug that I have yet to track down. See my earlier comment for an out-of-the box streaming form parser. Under python2 it was slow, so we rolled our own and it is fast. :)

davidism · 2018-02-02T17:54:16Z

Quoting from above:

I agree it's a problem that needs to be fixed. If this is important to you, I'd be happy to review a patch changing the behavior.

I don't really have time to work on this right now. If this is something that you are spending time on, please consider contributing a patch. Contributions are very welcome.

siddhantgoel · 2018-02-03T09:32:15Z

@sdizazzo

but it has a bug that I have yet to track down

are you talking about streaming-form-data? if so, I'd love to know what the bug is.

kneufeld · 2018-07-21T18:42:54Z

Our problem was that the slow form processing prevented concurrent request handling which caused nomad to think the process was hung and killed it.

My fix was to add a sleep(0) in werkzeug/formparser.py:MutlipartParser.parse_lines():

            for i, line in enumerate(iterator):
                if not line:
                    self.fail('unexpected end of stream')

                # give other greenlets a chance to run every 100 lines
                if i % 100 == 0:
                    time.sleep(0)

search for unexpected end of stream if you want to apply this patch.

patrislav1 · 2018-10-12T10:58:23Z

I wanted to mention doing the parsing on the stream in chunks as it is received. @siddhantgoel wrote this great little parser for us. It's working great for me. https://github.com/siddhantgoel/streaming-form-data

seconded.
this speeds up file uploads to my Flask app by more than factor 10

ghost · 2019-09-10T16:39:51Z

@siddhantgoel
Thanks a lot for your fix with streaming-form-data. I can finally upload gigabyte sized files at good speed and without memory filling up!

davidism · 2020-04-12T17:43:08Z

See #1788 which discusses rewriting the parser to be sans-io. Based on the feedback here, I think that would address this issue too.

sekrause · 2021-01-26T20:57:16Z

@davidism I don't think this issue should be closed because the speed-up is negligible.

Below is a little test script to benchmark the multipart parser and to compare Werkzeug with streaming-form-data. Run it with:

bench.py werkzeug myfile
bench.py streaming-form-data myfile (only works the the stable Werkzeug version.)

These are my results with a 425 MB zip file on my laptop:

Old Werkzeug parser: 8200 ms.
New Werkzeug parser: 6250 ms.
streaming-form-data: 260 ms.

So the new parser is only about 25% faster than the old parser, but still more than an order of magnitude slower than a fast parser.

import argparse
import io
import time
from os.path import basename

from flask import Flask, request
from streaming_form_data import StreamingFormDataParser
from streaming_form_data.targets import BaseTarget
from werkzeug.test import EnvironBuilder, run_wsgi_app

app = Flask(__name__)

class LengthTarget(BaseTarget):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.total = 0

    def on_data_received(self, chunk: bytes):
        self.total += len(chunk)

@app.route("/streaming-form-data", methods=['POST'])
def streaming_form_data_upload():
    target = LengthTarget()

    parser = StreamingFormDataParser(headers=request.headers)
    parser.register('file', target)

    while True:
        chunk = request.stream.read(131072)
        if not chunk:
            break
        parser.data_received(chunk)

    print(target.total)
    return 'done'

@app.route("/werkzeug", methods=['POST'])
def werkzeug_upload():
    file = request.files['file']
    stream = file.stream
    stream.seek(0, io.SEEK_END)
    print(stream.tell())
    return 'done'

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('parser', choices=['streaming-form-data', 'werkzeug'])
    parser.add_argument('file')
    args = parser.parse_args()
    with open(args.file, 'rb') as f:
        data = f.read()

    # Prepare the whole environment in advance so that this doesn't slow down the benchmark.
    e = EnvironBuilder(method='POST', path=f'/{args.parser}')
    e.files.add_file('file', io.BytesIO(data), basename(args.file))
    environ = e.get_environ()

    start = time.perf_counter()
    run_wsgi_app(app, environ)
    stop = time.perf_counter()
    delta = (stop - start) * 1000
    print(f'{delta:.1f} ms')

if __name__ == "__main__":
    main()

davidism · 2021-01-26T21:06:56Z

@sekrause hey, I really appreciate the detail you're providing. However, in the five years since you opened this issue, neither you or anyone else invested in seeing the issue fixed has actually submitted a fix.

I personally will not have the time to learn that library's implementation and identify how it can be applied to ours. Note that the library you're comparing to is implemented in C, so it's unlikely we'll every achieve the same speed. It's also already possible to use that library with Werkzeug when that speed is required. Perhaps someone could turn that into an extension library so it's more integrated as a Request subclass.

I'm happy to consider a PR that adds further improvements to the parser, but leaving this issue open so far doesn't seem to have resulted in that.

siddhantgoel · 2021-01-27T20:17:11Z

Author of the other library here. I'm more than happy to review proposals/patches in case someone wants to provide an extension so it can work better with Werkzeug.

sekrause · 2021-01-30T10:53:59Z

@davidism So I looked into your current implementation to check where it's slow and I think it turns out that from here we can get another 10x speedup by adding less than 10 lines of code.

When uploading a large binary file most of the time is spent in the elif self.state == State.DATA clause of MultipartDecoder.next_event, about half in list(LINE_BREAK_RE.finditer(self.buffer)) and half in the remaining lines.

But we don't really need to look at all lines break. The trick is to offload as much work as possible to bytes.find() which is really fast.

When we execute self.buffer.find(boundary_end) and it returns that nothing has been found, we can be sure that the ending boundary is not in self.buffer[:-len(boundary_end)] and just return this data without looking at it any further. We need to keep the last len(boundary_end) bytes of the buffer for the next iteration in case the ending boundary is on the border between two chunks.

When uploading a large file almost all iterations of the loop can return immediately after self.buffer.find(boundary_end). Only when it actually seems like we have and ending bounary we fall back to the code which checks for the line breaks and with the regular expressions.

If you want to test it yourself add this to MultipartDecoder.__init__():

self.boundary_end = b'--' + boundary + b'--'

And then change the elif self.state == State.DATA clause of MultipartDecoder.next_event into this:

elif self.state == State.DATA:
    if len(self.buffer) <= len(self.boundary_end):
        event = NEED_DATA
    elif self.buffer.find(self.boundary_end) == -1:
        data = bytes(self.buffer[:-len(self.boundary_end)])
        del self.buffer[:-len(self.boundary_end)]
        event = Data(data=data, more_data=True)
    else:
        # Return up to the last line break as data, anything past
        # that line break could be a boundary - more data may be
        # required to know for sure.
        lines = list(LINE_BREAK_RE.finditer(self.buffer))
        if len(lines):
            data_length = del_index = lines[-1].start()
            match = self.boundary_re.search(self.buffer)
            if match is not None:
                if match.group(1).startswith(b"--"):
                    self.state = State.EPILOGUE
                else:
                    self.state = State.PART
                data_length = match.start()
                del_index = match.end()

            data = bytes(self.buffer[:data_length])
            del self.buffer[:del_index]
            more_data = match is None
            if data or not more_data:
                event = Data(data=data, more_data=more_data)

Everything after the else is the old code unchanged and the elif self.buffer.find(self.boundary_end) == -1 is the trick I described above. This change alone reduces the upload time of my 430 MB test from 7000 to 700 ms, a 10x speed-up!

The if len(self.buffer) <= len(self.boundary_end) was needed so that we don't get an infinite loop, not sure if it's correct.

What do you think?

davidism · 2021-01-30T12:50:52Z

Sounds interesting, can you make a pr?

pgjones · 2021-01-30T13:13:02Z

Thanks @sekrause. I don't think exactly as you've written can be used (it ignores the complexity around CR and NL), however I think with some tweaks it works as #2022. Could you try that PR against your benchmarks?

sekrause · 2021-01-30T14:02:29Z

I don't think exactly as you've written can be used (it ignores the complexity around CR and NL)

I think we can make it work. The regular expression from MultipartDecoder.__init__() which checks for the boundary with all the complexity around CR and NL looks like this:

self.boundary_re = re.compile(
    br"%s--%s(--[^\S\n\r]*%s?|[^\S\n\r]*%s)"
    % (LINE_BREAK, boundary, LINE_BREAK, LINE_BREAK),
    re.MULTILINE,
)

So if buffer.find(b'--' + boundary) doesn't find a result the regular expression is also guaranteed to have no match because find() looks for a strict subset of the regular expression.

If find() does find something, we check again with the regular expression which handles all the complexity around CR and NL.

This additional precheck with find() seems to be worth it because when uploading a large file more than 99% of the iterations in the loop will end after find() and never have to recheck with the slower regular expression.

however I think with some tweaks it works as #2022. Could you try that PR against your benchmarks?

Your change is a ~2x speed-up, from 7000 ms to 3700 ms on my computer with my 430 MB test file. I've posted my benchmark program in #875 (comment) if you want to compare yourself.

sekrause · 2021-01-31T22:52:56Z

Final summary now that our changes have landed in Github master. Small benchmark uploading a file of 64 MB random data 10 times in a row and measuring the average request time on an Intel Core i7-8550U:

Werkzeug 1.0.1: 10 requests. Avg request time: 1390.9 ms. 46.0 MiB/s
Werkzeug Master: 10 requests. Avg request time: 91.3 ms. 701.0 MiB/s

With reasonable large files that's a 15x improvement (the difference is a little lower with small files because of the request overhead) and on a somewhat fast server CPU Werkzeug's multipart parser should now be able to saturate a gigabit ethernet link!

I'm happy with the result. :)

Moschn mentioned this issue Jul 2, 2016

File upload is really slow Moschn/neuronal-activity-analyzer#25

Open

lnielsen mentioned this issue Nov 21, 2016

large file uploads eating up memory pallets/flask#2086

Closed

davidism added the bug label Jun 2, 2017

pallets deleted a comment from huhuanming Jun 20, 2017

davidism mentioned this issue Feb 1, 2018

Issue with sending large-ish CSV files to Flask pallets/flask#2617

Closed

patrislav1 mentioned this issue Oct 12, 2018

added Flask example siddhantgoel/streaming-form-data#22

Merged

davidism added the save-for-sprint label May 30, 2019

davidism removed the save-for-sprint label Jun 2, 2019

pallets deleted a comment from VaishaliWadhwa3 Oct 1, 2019

siddhantgoel mentioned this issue Nov 5, 2019

How to stream upload a file using multipart/form-data hugapi/hug#474

Open

davidism mentioned this issue Mar 27, 2020

Request | request.files access is too slow. pallets/flask#3546

Closed

davidism mentioned this issue Jan 24, 2021

Refactor the Multipart parsing into a Sans-IO layer #2017

Merged

6 tasks

davidism added this to the 2.0.0 milestone Jan 26, 2021

davidism closed this as completed in #2017 Jan 26, 2021

davidism reopened this Jan 30, 2021

davidism mentioned this issue Jan 31, 2021

Improve the performance of the multipart parser #2022

Merged

6 tasks

davidism closed this as completed in #2022 Jan 31, 2021

github-actions bot locked as resolved and limited conversation to collaborators Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

werkzeug.formparser is really slow with large binary uploads #875

werkzeug.formparser is really slow with large binary uploads #875

sekrause commented Mar 3, 2016

languanghao commented Mar 18, 2016

RonnyPfannschmidt commented Mar 18, 2016

lnielsen commented Aug 22, 2016

carbn commented Oct 18, 2016 •

edited

Loading

cuibonobo commented Oct 18, 2016

carbn commented Oct 18, 2016

sekrause commented Oct 18, 2016

sdizazzo commented Jun 2, 2017

lambdaq commented Jun 20, 2017 •

edited by davidism

Loading

davidism commented Jun 20, 2017

lnielsen commented Jun 20, 2017 •

edited

Loading

lambdaq commented Jun 21, 2017 •

edited

Loading

ThiefMaster commented Jun 21, 2017

lambdaq commented Sep 30, 2017

lambdaq commented Dec 14, 2017

ThiefMaster commented Dec 14, 2017

mdemin914 commented Feb 1, 2018

lnielsen commented Feb 1, 2018

sdizazzo commented Feb 2, 2018 •

edited

Loading

davidism commented Feb 2, 2018 •

edited

Loading

siddhantgoel commented Feb 3, 2018

kneufeld commented Jul 21, 2018

patrislav1 commented Oct 12, 2018

ghost commented Sep 10, 2019

davidism commented Apr 12, 2020

sekrause commented Jan 26, 2021

davidism commented Jan 26, 2021 •

edited

Loading

siddhantgoel commented Jan 27, 2021

sekrause commented Jan 30, 2021

davidism commented Jan 30, 2021

pgjones commented Jan 30, 2021

sekrause commented Jan 30, 2021

sekrause commented Jan 31, 2021

werkzeug.formparser is really slow with large binary uploads #875

werkzeug.formparser is really slow with large binary uploads #875

Comments

sekrause commented Mar 3, 2016

languanghao commented Mar 18, 2016

RonnyPfannschmidt commented Mar 18, 2016

lnielsen commented Aug 22, 2016

carbn commented Oct 18, 2016 • edited Loading

cuibonobo commented Oct 18, 2016

carbn commented Oct 18, 2016

sekrause commented Oct 18, 2016

sdizazzo commented Jun 2, 2017

lambdaq commented Jun 20, 2017 • edited by davidism Loading

davidism commented Jun 20, 2017

lnielsen commented Jun 20, 2017 • edited Loading

lambdaq commented Jun 21, 2017 • edited Loading

ThiefMaster commented Jun 21, 2017

lambdaq commented Sep 30, 2017

lambdaq commented Dec 14, 2017

ThiefMaster commented Dec 14, 2017

mdemin914 commented Feb 1, 2018

lnielsen commented Feb 1, 2018

sdizazzo commented Feb 2, 2018 • edited Loading

davidism commented Feb 2, 2018 • edited Loading

siddhantgoel commented Feb 3, 2018

kneufeld commented Jul 21, 2018

patrislav1 commented Oct 12, 2018

ghost commented Sep 10, 2019

davidism commented Apr 12, 2020

sekrause commented Jan 26, 2021

davidism commented Jan 26, 2021 • edited Loading

siddhantgoel commented Jan 27, 2021

sekrause commented Jan 30, 2021

davidism commented Jan 30, 2021

pgjones commented Jan 30, 2021

sekrause commented Jan 30, 2021

sekrause commented Jan 31, 2021

carbn commented Oct 18, 2016 •

edited

Loading

lambdaq commented Jun 20, 2017 •

edited by davidism

Loading

lnielsen commented Jun 20, 2017 •

edited

Loading

lambdaq commented Jun 21, 2017 •

edited

Loading

sdizazzo commented Feb 2, 2018 •

edited

Loading

davidism commented Feb 2, 2018 •

edited

Loading

davidism commented Jan 26, 2021 •

edited

Loading