-
-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profile big tables #70
Comments
We have a similar issue when printing large documents (hundreds of pages), Weasy consumes several GB of memory. We don't have a single large table spanning multiple pages, but pretty much every page has tables on it. I'm guessing that GC is not taking place? Or perhaps it needs to take place where it currently isn't? |
WeasyPrint creates multiple objects for every element of the document, and keeps it all in memory until the end. So very high memory usage for big documents is kind of "expected". It’s not the GC’s fault if nothing is freed. Things that we could do include:
|
I have uploaded a sample document to http://filebin.ca/nGTQ12Aivfp. This is a fairly common document layout approach that we use. I printed the above file in Weasy 0.19:
This seems to be a fairly common occurrence, with the same problem occurring on many of our servers on large documents. |
I give a first try to profiling with memory_profiler, here's the data: this is how i generated the data:
values marked as small are generated by this script, the ones marked as huge are generated by generating 10 more times the table lines, if i create 100x tds weasyprint get killed by OOM killer on my 3gb machine.
A couple of interesting findings:
|
I can't reproduce this issue anymore. I've tried to render a 10000-line table, it took 108 seconds on my computer with about 3GB of RAM used (between 10MB and 15MB per page). I've just used 10000 instead of 100 in the script above. The output is a 239-page, 164kB PDF. Of course, we could be better. Firefox takes about 2s and 100MB to render the same page. |
The example provided by @si13b still takes a lot of memory. Stats on my i7-6500U @ 2.5GHz :
I'll try to use a generator instead of a list when we render the pages. 4MB per page looks like a bad but not awful score for me. Firefox takes 10s and 600MB of RAM to render the page. Opening and closing the web inspector makes it crash. |
Style is not copied anymore when boxes are duplicated. Style dicts are not modified anymore during the layout, as it was before for some properties: - margins, borders and paddings when the box was split between two pages (useless as these computed values are stored directly in the box), - top borders were changed in tables (useless for the same reason), - bookmark labels and string sets are now stored in the box. This commit can introduce very subtle bugs that are hard to debug. In the future, we should try to freeze the style dicts before the layout. Related to #70.
Python 3.6 is a huge improvement thanks to compact dicts. 344cb08 prevents style dicts to be copied each time a box is duplicated. I have to check that it doesn't break anything with the W3C suite, it may have introduced subtle bugs but I'm pretty confident. Oh, and I think that the "40 minutes" from the last comment were not true 😉. |
@liZe amazing, thanks! |
The results are very good for other documents too. I've tested the examples of #384, they're both significantly faster and less memory consuming thanks to Python 3.6 and this commit. There's room for more improvement though (before I close this issue). Inline boxes need really more memory than block boxes, I don't really know why. That's why you get memory problems with tables: there are lots of text lines in tables (at least one per cell). |
I've done my best to both add optimizations and clean the code, and I'm really happy with the result. I've done some benchmarks with Python 3.6 on Linux, for 4 different versions:
I had also tested 0.31 with Python 3.5 in #384. I'm closing this issue as the easy part has been done. If anyone is interested in even better performance, you only have to:
This work can be done here for sure, but also in CSSSelect2 (if possible, it would be better). Good luck! (Spoiler alert: named pages coming soon may hurt this speed a little bit.) Large documentIt's the large document given as example here, with 3000 pages of paragraphs and tables. 0.39:
Alice Adventures in Wonderlandhttps://www.gutenberg.org/files/11/11-h/11-h.htm Long document with repetitive justified text. 0.31:
0.39:
HTML5 SpecificationLong document with a lot of lists and underlined links. 0.31:
0.39:
Online Wikipediahttps://en.wikipedia.org/w/index.php?title=HTML5&printable=yes Printable version of a Wikipedia page, not downloaded before, long left-aligned paragraphs with floats. 0.31:
0.39:
|
This fixes blocks split between pages and inlines boxes split by block boxes. Increases memory usage, related to #70.
http://pastebin.com/HLVrTmxt
The text was updated successfully, but these errors were encountered: