Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Applying a lot of styles causes spreadsheet to need repair #32

Closed
rickharrison opened this issue Apr 1, 2017 · 19 comments
Closed

Applying a lot of styles causes spreadsheet to need repair #32

rickharrison opened this issue Apr 1, 2017 · 19 comments

Comments

@rickharrison
Copy link

When applying a ton of styles to a large amount of cells in a workbook, I then get the message "Excel could not open filename because some content is unreadable. Do you want to open and repair this workbook?"

As an example, I am using this code:

lineItemSheet.range(1, 2, lineItemRowOffset + 1, 2).style('numberFormat', 'mm/dd/yy');
lineItemSheet.range(1, 3, lineItemRowOffset + 1, 3).style('numberFormat', 'h:mm AM/PM');
lineItemSheet.range(1, 13, lineItemRowOffset + 1, 21).style('numberFormat', '$#,##0.00');

lineItemRowOffset in this case is around 15,000 so styles are being applied to tens of thousands of cells.

Is there a better way to apply styles or is this a bug?

@dtjohnson
Copy link
Owner

Hmmm... it seems that Excel has an undocumented limit of about 65k styles. xlsx-populate currently creates a new style for each cell, which isn't the most efficient. There are 3 style items on the roadmap that each should help with your problem (in order of priority):

  1. Because a style is created per cell, there are many duplicate styles in place. Instead we should be reusing nodes across styles if they are identical. That's a little tricky to do, but it's the right approach.
  2. We currently create font, border, etc. nodes per style even if they aren't used. We'd save some space in the XML output by not emitting those nodes if they are empty.
  3. Setting row/column default styles. Then you could do something like sheet.column('A').style('numberFormat', 'mm/dd/yy') to apply the style to all of the cells in the column.

Unfortunately, none of these are in place right now. The only suggestion I have for the moment is to style a workbook in Excel, save it in your repo, and then populate the template with xlsx-populate.

@rickharrison
Copy link
Author

rickharrison commented Apr 1, 2017

Ahhh got it thanks. # 3 would be the best approach for us. I'll see if the template idea works, thanks so much for this library!

@rickharrison
Copy link
Author

@dtjohnson Are you aware of any memory issues when working with very large datasets (I.E. inserting basic text values into tens of thousands of rows across dozens of columns)?

My company would be willing to sponsor your time to look into this issue we are having. Please let me know if you would be open to that.

@dtjohnson
Copy link
Owner

@rickharrison, I have a very demanding day job already so I'm not interested in compensation, but I'd love it if you'd support the project by submitting code, fielding issues, writing docs, etc. :)

There was an earlier issue with regard to memory usage. I did some enhancements and was able to populate about 2 million cells in V8 (less in IE). That was about the upper limit though. There is some more optimization possible to maybe get it to 3 or 4 million, but I'd prefer to get more of the XLSX spec covered before optimizing too much.

If you are adding styles that will chew up a lot more memory. I did just push this branch which has initial support for row/column styles.

I'm happy to look at your workbook (or a representative example) to see if anything strange pops out.

@rickharrison
Copy link
Author

Hmm do you have any suggestions on how to get around all the object instantiation? All we actually need to do is insert basic text columns. Should we try to edit the sheet1.xml file directly?

@dtjohnson
Copy link
Owner

From my tests it seems the major culprit is that for each cell node in the XML we instantiate two child arrays to represent the XML structure. Those arrays take up a lot of memory. It's certainly possible to eliminate them, but it may only get us to 3 million cells...

You might also consider trying two other libraries: xlsx and ExcelJS. I can't vouch for them personally. In order to preserve existing workbook features, xlsx-populate uses an inefficient data model. These other libraries might perform better.

@rickharrison
Copy link
Author

Thanks for the suggestion. I am using your library to insert super basic text data, but the caveat is I'm inserting it into a pretty complicated spreadsheet. That is why your library has worked so well thus far (and thanks so much for creating it!)

Are those arrays created when reading a workbook with data already in it? I could close and re-open the workbook in pages to drop memory

@dtjohnson
Copy link
Owner

They are, I'm afraid. Is all of your data in one sheet? Or is it split in multiple?

@rickharrison
Copy link
Author

There are 7 sheets in the workbook and I am inserting data into the first 2 sheets both of which are blank initially (I could just use the first sheet if that would solve something)

@dtjohnson
Copy link
Owner

There's not really a way to get around having an entire sheet in memory. Currently, all sheets are loaded into memory when you parse a workbook. If the data is split into multiple sheets, we could limit the memory hit by only loading sheets that are needed. Otherwise, the only solution is to optimize the data structure for the single sheet.

@rickharrison
Copy link
Author

@dtjohnson This library has been super awesome to us, but I think I am going to try manually editing the workbook. If I am just inserting very basic data to one sheet is it as simple as adding <row>'s to xl/worksheets/sheet1.xml ?

@dtjohnson
Copy link
Owner

That's the basic idea, but it depends on what exactly you are doing. There are some annoying issues that come up though.

On the bright side, I did think of a solution to massively decrease memory usage while still retaining the original workbook features. In my tests I was able to raise the max number of populated cells from 2 million to almost 10 million. It looks like it could go higher, but I ran into the V8 max string length limit when generating the XML.

There's more work to do on the code to get it out the door, but stay tuned.

@rickharrison
Copy link
Author

Awesome, I'm looking forward to your solution :)

@dtjohnson
Copy link
Owner

I just pushed the improved performance to https://github.com/dtjohnson/xlsx-populate/tree/performance

It's not fully tested yet but you can give it a whirl as-is. I was able to produce a workbook with 9.5 million numbers. You'll get less with other data types and with styles, etc.

@rickharrison
Copy link
Author

Great, thanks for working on this! I'll try it out

@dtjohnson
Copy link
Owner

I just published v1.6.0, which included fully-tested support for this higher performance feature.

@rickharrison
Copy link
Author

Thanks again for all your hard work!

@davidconvista
Copy link

Hello,

I am facing this issue in an excel export for which I'm applying individual cell styles. @dtjohnson I would like to know how to take the number 1 approach you mentioned:

Because a style is created per cell, there are many duplicate styles in place. Instead we should be reusing nodes across styles if they are identical. That's a little tricky to do, but it's the right approach.

I have a few styles that I would like to reuse across cells.

Could you please help me with that?

Thank you in advance

@Sleeper9
Copy link

For anyone interested: if you need per-column custom styles, there's an easy way to prevent style overflow. I left a comment in this issue #188

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants