-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For xlsx generated by java POI5, it cannot be opened by MS Excel after modification by OpenXLSX #275
Comments
Hi & thank you for the report. This sounds like an odd problem that I can only explain happening from a mix of data from different zip formats. Considering that Excel has no problem opening archives created by OpenXLSX, I would suggest the best way forward for me to address this would be to make sure that all "metadata" generated by opczip is wiped when saving in OpenXLSX. Would you see any issues with that approach? |
just a quick test: LibreOffice (v24.2.5.2) has no problems opening your example file after opening and saving it from OpenXLSX - is it possible that:
|
It is also possible to erase all the "metadata" generated by opczip when saving, but for larger xlsx files, saving and reading will be more time-consuming due to the need to parse the XML. |
Yes, there is no problem opening the xlsx modified by OpenXLSX through other tools (I guess it is because other tools will perform automatic repairs, and MS Excel may perform additional zip verification). I think it is the first reason. I am not very familiar with xml, but I manually modified the problematic binary xlsx file :
After my manual modification, it can be opened by MS Excel, so I think it is very likely a problem with the zip implementation. MS Excel's zip implementation is not compatible with zippy. The two have different implementations of some details of zip64. Of course, this is also because the zip specification is very vague about zip64. |
Well - there is zero metadata in the actual XML as far as I could tell from your example. To be honest, without having looked too closely into zippy functionality, I am surprised that it is not re-creating the zip archive from scratch upon saving anyways - and that consequently, you get different behavior on xlsx-files created with OpenXLSX, and those modified with it. Just to repeat - is it correct that you can open the files created with OpenXLSX in MS Excel on your system without problems? E.g. the Demo0?.xlsx files? From your last comment, it appears that maybe it is enough to "downgrade" the version number if zippy finds it "too high"? |
Yes, xlsx files created with OpenXLSX can be opened by MS Excel. The "correct value" of the data descriptor is the size of each XML file. As you know, if the XML file in the zip file has a Data Descriptor, the CRC of the XML file before compression, the size before compression, and the size after compression are saved in the Data Descriptor. |
If the Data Descriptor is in zip64 standard, the size before and after compression will be represented by 8 bytes respectively, otherwise it will be represented by 4 bytes. In the zippy implementation, if an XML file is not modified, the Data Descriptor in the source file will be copied to the new file. During the copying process, since the source file is recognized by zippy as zip32 format, it will be copied incorrectly, as shown below:
|
I don't know, actually - I just started contributing to OpenXLSX and didn't touch zippy beyond superficial patches so far :) I put some debugging output in the function
Unfortunately, that code - opening your example file and saving it again - never gets invoked. What I am doing is:
any idea why the code section where you proposed a workaround is not even being invoked here? |
And thank you very much for the hex output of the problematic part - that helps me a lot to understand the compatibility issue. |
Are you referring to the OpenXLSX/OpenXLSX/external/zippy/zippy.hpp Lines 10774 to 10792 in 607360e
|
Yes - in my development-aral branch, the "IsModified()" is always true. And I am trying to find out why.. But also, it appears that the bug you encounter is triggered by the files considered "unmodified" in the source archive. |
Yes, sorry for forgetting to tell you, I modify the code here OpenXLSX/OpenXLSX/sources/XLDocument.cpp Lines 463 to 476 in 607360e
to improve the speed by reducing the number of XML files opened and parsed. Thank you very much for your reply. When I have time, I will study OpenXLSX and zippy's support for zip64. See how to call |
Wait, so you added the code change yourself that broke things, and then you created an issue and forgot to mention that? :P That would have been nice to know sooner. So what did you change? Did you exclude some worksheets from being opened at all & that is what triggered the bug? It would still be good to know because the optimization you implemented might be interesting to support. |
Okay, so with that knowledge, if I skip a worksheet from being parsed, I can get the Zippy save routine to return false from "IsModified()" for that file. But that creates a whole lot of different problems - worksheets are not supposed to be "skipped" when loading. I'll mark this as invalid and close it for now - when you have a good working suggestion for a loading optimization (at the user's risk) please open a new issue with a "feature request" :) |
Yes, in order to achieve a special function, I did not read out all the XML files, skipped some XML, so some XML files were not modified. Thanks again for your reply. |
One more thing - if you are playing around with the code, please consider working with the development-aral branch that fixes some other issues with "non vanilla" workbooks. However, the issue you triggered with your modifications could also happen for a workbook that contains files which are currently not supported by OpenXLSX (and therefore ignored, just like in the change that I believe you added). |
Background: Apache POI library is specially used to process MS office files. The fifth version of POI(POI5), relies on the zip library Apache Commons Comprese. The zip64 implementation of this library is opczip, which is specially implemented for MS Excel.
Problem: After opening, modifying and saving the xlsx file generated by POI5 through OpenXLSX, MS Excel will not be able to open the modified file. I tried to look at the source code of POI and OpenXLSX and found the problem here. The root cause is that there are slight differences between the zip64 standard implemented by zippy (which looks like miniz) and opczip that OpenXLSX relies on for zip64.
Notes:
mz_zip_writer_add_from_zip_reader
, check whether the zip version is0X002D
, like this:The text was updated successfully, but these errors were encountered: