Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WeasyPrint consuming a lot of memory when rendering tables with 5000 rows #1104

Closed
gaurav1999 opened this issue Apr 22, 2020 · 9 comments
Closed
Labels
performance Too slow renderings

Comments

@gaurav1999
Copy link

Hi, Weasyprint made my life a lot easier, but recently I noticed that it's consuming a lot of memory and on top of that on every print call the previous memory adds up, I am running the latest version 51 of Weasyprint.

Python-> Python 3.7
Distro-> Fedora Workstation 31

        df = self.get_df(limit=True)
        no_of_col = len(df.columns)
        include_index = not isinstance(df.index, pd.RangeIndex)
        pd.set_option('display.width', 1000)
        pd.set_option('colheader_justify', 'center')

        html_string = """<html>
                        <head></head>
                        <style>
                                .tablestyle {
                                    font-size: 10pt;
                                    font-family: Arial;
                                    border-collapse: collapse;
                                    border: 1px solid silver;
​
                                }
​
                                .tablestyle td, th {
                                    padding: 5px;
                                }
​
                                .tablestyle tr:nth-child(even) {
                                    background: #E0E0E0;
                                }
​
                                .tablestyle tr:hover {
                                    background: silver;
                                    cursor: pointer;
                                }
                        </style>
                        <body>
                        """+ df.to_html(classes='tablestyle')+"""</body></html>"""

        # size_css = weasyprint.CSS(string=("@page {size: A3; margin: 0in 0.44in 0.2in 0.44in;}"))
        # pdf = weasyprint.HTML(string=html_string).write_pdf(stylesheets=[size_css])
        # gc.collect()
       #return pdf

Note: I actually commented last lines of code for testing purposes.

I am trying to print a pandas Dataframe , everything works good except memory and CSS.

Version 1.
I simply printed the page without mentioning the size, and the styling on tables worked.

Version 2.
I introduced size_css because my content was large, and I needed A3 paper, and post that the styling on tables is not working, which I am not sure why ?

I noticed performance issues as well when I ran this on 1000+ rows, it eats up a lot of memory, not sure why .. I read issue #220 about this, and tried the @font-face but it's not helping.

I ran this once it ate up 1.4 Gig of Ram, then on second time just after the previous one it added up and ate 2.1 Gig of memory.

I thought I might need to manually do gc.collect() but it has no effect.
Hence it's commented in the code.

Also, I thought that maybe the HTML string is getting a lot big, so I tested without rendering any PDF, but turns out it's less than 10Mb.

And, when I limit the dataset size to 50-100 rows something small, it behaves quite well, and on subsequent prints the memory do not add up like it happens with large ones.

I will attach the table's CSV, for your testing and also attach the Rendered PDF where you will be able to notice the table styling difference which I mentioned about.

Thanks!

@gaurav1999
Copy link
Author

gaurav1999 commented Apr 22, 2020

Data_and_pdf.zip

In this, the Data is of the large table, and you will see:

  1. My rows printed are just 4999
  2. My table formatting is not like mentioned in styling

Note: I queried 10,000 rows out of this data of 50,000 rows.

There is another Boys0604_2212.pdf file, which displays that before adding @page size CSS I was able to get CSS rendered on PDF.

Thanks.

@liZe
Copy link
Member

liZe commented Apr 22, 2020

#70 is probably interesting to read and could give expected levels of memory needed to render long tables.

I’ll check your example as soon as possible.

@gaurav1999
Copy link
Author

Yes, I think I checked the issue out, you mentioned about StyleDict, and deduplication of some rules. I might not be aware of those, but probably they are not in my code.

Can you point out what I might need to improve in my code on this reference ?

Also an Update:

I think CSS is applied at some extent. But if you will notice in the Boys.pdf and the Test.pdf borders are there, but Cell highlighting is maybe what's missing.
So I think I might re-frame my issue and say that:

Once I applied @page to resize my page to A3
the CSS rules in style:

.tablestyle tr:nth-child(even) {
      background: #E0E0E0;
 }
​ .tablestyle tr:hover {
      background: silver;
     cursor: pointer;
 }

Have no effect.

Thanks for looking into this, really appreciate it.

@gaurav1999
Copy link
Author

gaurav1999 commented Apr 30, 2020

Any updates on the issue ?

@liZe
Copy link
Member

liZe commented Jun 10, 2020

I’m back, sorry for the delay…

  1. My rows printed are just 4999

They’re 5000, the first one is 0 😉.

2. My table formatting is not like mentioned in styling

There’s no reason why it shouldn’t work. Maybe there’s a problem in the CSS you generate? Could you please provide the generated HTML file?

I ran this once it ate up 1.4 Gig of Ram, then on second time just after the previous one it added up and ate 2.1 Gig of memory.

That’s not normal to have such a difference. If the variable holding the first document is deleted (by using del or going out of scope), most of the memory should be freed (it is for me). Could you please provide a simple Python script with this problem?

@liZe liZe added the performance Too slow renderings label Jun 10, 2020
@gaurav1999
Copy link
Author

Hey, lize thanks for getting back, seems like the row thing was my own fault, I get it now 👍 , and same is for styling. About performance, I will get back to you.

Really embarrassed for making typo which messed my styling.

When I visited the code once again, to give you samples, after long time, I realised my mistake thanks to you :)

@gaurav1999
Copy link
Author

gaurav1999 commented Jun 10, 2020

The code I am using is:

def get_pdf(self):	
        df = self.get_df()	
        no_of_col = len(df.columns)	
        html_string = """<html>	
                    <head></head>	
                    <style>	
                            .tablestyle {	
                                font-size: 11pt;	
                                font-family: Arial;	
                                border-collapse: collapse;	
                                border: 1px solid silver;	
                            }	
                            .tablestyle td, th {	
                                padding: 5px;	
                            }	
                            .tablestyle tr:nth-child(even) {	
                                background: #E0E0E0;	
                            }	
                            .tablestyle tr:hover {	
                                background: silver;	
                                cursor: pointer;	
                            }	
                    </style>	
                    <body>	
                    """+ df.to_html(classes='tablestyle')+"""</body></html>"""	
        if no_of_col > config.get("SOME_CONFIG_FLAG"):
            size_css = weasyprint.CSS(string=("@page {size: A3; margin: 0in 0.44in 0.2in 0.44in;}"))
        else:
            size_css = weasyprint.CSS(string=("@page {size: A4;}"))
        
        pdf = weasyprint.HTML(string=html_string).write_pdf(stylesheets=[size_css])
        del html_string
        return pdf

So I am modifying this code for export features in apache/incubator-superset project, under the file viz.py

When, I downloaded a chart with 6000 rows in pdf, I got the response, but initially it consumed 1.6 Gig of ram, then when I launched second request once the first got over the number jumped to 2.3 gigs, later on I launched two multiple requests and number further jumped to 3.9 gigs, not sure why is this happening, and it's of-course not good for multiple people using the web app and printing the chart.

I will be posting the csv data and pdf which gets printed.

So seems like styling is working, I am getting all the rows, at the end performance is huge bottle neck.

Thanks for taking a look, I will be happy to assist you with providing a modified superset branch if you want to test this out yourself on apache/superset.

Screenshot from 2020-06-10 17-30-58
Screenshot from 2020-06-10 17-27-29
Screenshot from 2020-06-10 17-26-08
Screenshot from 2020-06-10 17-02-51

Test_Flight_Data1006_173.zip

@liZe liZe changed the title Weasyprint consuming a lot of memory when rendering tables of size 5000*8 , not freeing memory and limits itself to 4999 rows, Not implementing table CSS as well when I use the Page CSS. WeasyPrint consuming a lot of memory when rendering tables of size 5000 rows Jan 16, 2021
@liZe liZe changed the title WeasyPrint consuming a lot of memory when rendering tables of size 5000 rows WeasyPrint consuming a lot of memory when rendering tables with 5000 rows Jan 16, 2021
@liZe
Copy link
Member

liZe commented Aug 31, 2023

Related to #1950 and #1923.

@liZe
Copy link
Member

liZe commented Aug 3, 2024

With recent versions of WeasyPrint, there’s a difference of less than 20% between rendering long tables or the same amount of divs. WeasyPrint still uses too much memory, but tables are now not that much worse than other boxes.

Rendering times have been improved with 50456df too.

@liZe liZe closed this as completed Aug 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Too slow renderings
Projects
None yet
Development

No branches or pull requests

2 participants