Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

For xlsx generated by java POI5, it cannot be opened by MS Excel after modification by OpenXLSX #275

Closed
abo-H opened this issue Sep 22, 2024 · 16 comments
Labels
invalid This doesn't seem right

Comments

@abo-H
Copy link

abo-H commented Sep 22, 2024

Background: Apache POI library is specially used to process MS office files. The fifth version of POI(POI5), relies on the zip library Apache Commons Comprese. The zip64 implementation of this library is opczip, which is specially implemented for MS Excel.

Problem: After opening, modifying and saving the xlsx file generated by POI5 through OpenXLSX, MS Excel will not be able to open the modified file. I tried to look at the source code of POI and OpenXLSX and found the problem here. The root cause is that there are slight differences between the zip64 standard implemented by zippy (which looks like miniz) and opczip that OpenXLSX relies on for zip64.

Notes:

  1. Because opczip is a special implementation of MS Excel (I have tested that the xlsx generated by opczip can be opened by MS Excel), I tried to modify the zippy code. There is a bad way to make it compatible with opczip (that is, MS Excel). When dealing with the optional data descriptor in mz_zip_writer_add_from_zip_reader, check whether the zip version is 0X002D, like this:
    ...
    mz_bool is_src_zip64 = MZ_FALSE; // a bad way for compatible with POI5
    if (MZ_READ_LE16(pLocal_header + MZ_ZIP_LDH_VERSION_NEEDED_OFS) >= 0X002D) {
        unsigned long len = (sizeof(mz_uint32) * 6) + 2;
        mz_uint8 tmp[len];
        if (pSource_zip->m_pRead(pSource_zip->m_pIO_opaque, cur_src_file_ofs, tmp, len) == len) {
            n = sizeof(mz_uint32) * ((MZ_READ_LE32(tmp) == MZ_ZIP_DATA_DESCRIPTOR_ID) ? 6 : 5);
            if (MZ_READ_LE16((mz_uint8*)tmp + n) == 0X4B50) {
                is_src_zip64 = MZ_TRUE;
            }
        }
    }
    if ((pSource_zip->m_pState->m_zip64) || (found_zip64_ext_data_in_ldir) || is_src_zip64) {
    ...
    
  2. Maybe this isn't Zippy's fault, but I found that zippy has many problems with its implementation of zip64. Have you considered updating OpenXLSX's dependency on zippy (it seems that zippy does not fully support zip64)?
  3. This is the file in question.POI5.2.3.xlsx
@aral-matrix
Copy link
Collaborator

Hi & thank you for the report. This sounds like an odd problem that I can only explain happening from a mix of data from different zip formats.

Considering that Excel has no problem opening archives created by OpenXLSX, I would suggest the best way forward for me to address this would be to make sure that all "metadata" generated by opczip is wiped when saving in OpenXLSX. Would you see any issues with that approach?

@aral-matrix aral-matrix self-assigned this Sep 23, 2024
@aral-matrix
Copy link
Collaborator

aral-matrix commented Sep 23, 2024

just a quick test: LibreOffice (v24.2.5.2) has no problems opening your example file after opening and saving it from OpenXLSX - is it possible that:

  1. the problem is with Excel having a shitty implementation of zip-files that chokes on something that deviates from "how Microsoft does it"? I'd obviously still try to find a solution but it'd be yet another point to add to the list why to use non-MS products ;)

  2. I saw that in the docProps/app.xml, the namespace "vt" is not defined, which OpenXLSX assumes as "the namespace" for the <HeadingPairs><vt:vector> and its children and for <TitlesOfParts><vt:vector> and its children. Is it possible that what Excel actually complains about is that the app.xml is using this namespace without declaring it?

@abo-H
Copy link
Author

abo-H commented Sep 24, 2024

Hi & thank you for the report. This sounds like an odd problem that I can only explain happening from a mix of data from different zip formats.

Considering that Excel has no problem opening archives created by OpenXLSX, I would suggest the best way forward for me to address this would be to make sure that all "metadata" generated by opczip is wiped when saving in OpenXLSX. Would you see any issues with that approach?

It is also possible to erase all the "metadata" generated by opczip when saving, but for larger xlsx files, saving and reading will be more time-consuming due to the need to parse the XML.

@abo-H
Copy link
Author

abo-H commented Sep 24, 2024

just a quick test: LibreOffice (v24.2.5.2) has no problems opening your example file after opening and saving it from OpenXLSX - is it possible that:

  1. the problem is with Excel having a shitty implementation of zip-files that chokes on something that deviates from "how Microsoft does it"? I'd obviously still try to find a solution but it'd be yet another point to add to the list why to use non-MS products ;)
  2. I saw that in the docProps/app.xml, the namespace "vt" is not defined, which OpenXLSX assumes as "the namespace" for the <HeadingPairs><vt:vector> and its children and for <TitlesOfParts><vt:vector> and its children. Is it possible that what Excel actually complains about is that the app.xml is using this namespace without declaring it?

Yes, there is no problem opening the xlsx modified by OpenXLSX through other tools (I guess it is because other tools will perform automatic repairs, and MS Excel may perform additional zip verification).

I think it is the first reason. I am not very familiar with xml, but I manually modified the problematic binary xlsx file :

  1. filled its Data Descriptor extension with the correct value.
  2. changed its compressed version number to 0X14 instead of the original 0X2D). 0X14 represents zip2.0, 0X2D represents zip4.5 (zip64 extension was introduced in zip4.5)

After my manual modification, it can be opened by MS Excel, so I think it is very likely a problem with the zip implementation. MS Excel's zip implementation is not compatible with zippy. The two have different implementations of some details of zip64. Of course, this is also because the zip specification is very vague about zip64.

@aral-matrix
Copy link
Collaborator

Hi & thank you for the report. This sounds like an odd problem that I can only explain happening from a mix of data from different zip formats.
Considering that Excel has no problem opening archives created by OpenXLSX, I would suggest the best way forward for me to address this would be to make sure that all "metadata" generated by opczip is wiped when saving in OpenXLSX. Would you see any issues with that approach?

It is also possible to erase all the "metadata" generated by opczip when saving, but for larger xlsx files, saving and reading will be more time-consuming due to the need to parse the XML.

Well - there is zero metadata in the actual XML as far as I could tell from your example. To be honest, without having looked too closely into zippy functionality, I am surprised that it is not re-creating the zip archive from scratch upon saving anyways - and that consequently, you get different behavior on xlsx-files created with OpenXLSX, and those modified with it.

Just to repeat - is it correct that you can open the files created with OpenXLSX in MS Excel on your system without problems?

E.g. the Demo0?.xlsx files?

From your last comment, it appears that maybe it is enough to "downgrade" the version number if zippy finds it "too high"?
May I ask what the "correct value" is for the Data Descriptor?

@abo-H
Copy link
Author

abo-H commented Sep 24, 2024

Hi & thank you for the report. This sounds like an odd problem that I can only explain happening from a mix of data from different zip formats.
Considering that Excel has no problem opening archives created by OpenXLSX, I would suggest the best way forward for me to address this would be to make sure that all "metadata" generated by opczip is wiped when saving in OpenXLSX. Would you see any issues with that approach?

It is also possible to erase all the "metadata" generated by opczip when saving, but for larger xlsx files, saving and reading will be more time-consuming due to the need to parse the XML.

Well - there is zero metadata in the actual XML as far as I could tell from your example. To be honest, without having looked too closely into zippy functionality, I am surprised that it is not re-creating the zip archive from scratch upon saving anyways - and that consequently, you get different behavior on xlsx-files created with OpenXLSX, and those modified with it.

Just to repeat - is it correct that you can open the files created with OpenXLSX in MS Excel on your system without problems?

E.g. the Demo0?.xlsx files?

From your last comment, it appears that maybe it is enough to "downgrade" the version number if zippy finds it "too high"? May I ask what the "correct value" is for the Data Descriptor?

Yes, xlsx files created with OpenXLSX can be opened by MS Excel.

The "correct value" of the data descriptor is the size of each XML file.

As you know, if the XML file in the zip file has a Data Descriptor, the CRC of the XML file before compression, the size before compression, and the size after compression are saved in the Data Descriptor.

@abo-H
Copy link
Author

abo-H commented Sep 24, 2024

Hi & thank you for the report. This sounds like an odd problem that I can only explain happening from a mix of data from different zip formats.
Considering that Excel has no problem opening archives created by OpenXLSX, I would suggest the best way forward for me to address this would be to make sure that all "metadata" generated by opczip is wiped when saving in OpenXLSX. Would you see any issues with that approach?

It is also possible to erase all the "metadata" generated by opczip when saving, but for larger xlsx files, saving and reading will be more time-consuming due to the need to parse the XML.

Well - there is zero metadata in the actual XML as far as I could tell from your example. To be honest, without having looked too closely into zippy functionality, I am surprised that it is not re-creating the zip archive from scratch upon saving anyways - and that consequently, you get different behavior on xlsx-files created with OpenXLSX, and those modified with it.
Just to repeat - is it correct that you can open the files created with OpenXLSX in MS Excel on your system without problems?
E.g. the Demo0?.xlsx files?
From your last comment, it appears that maybe it is enough to "downgrade" the version number if zippy finds it "too high"? May I ask what the "correct value" is for the Data Descriptor?

Yes, xlsx files created with OpenXLSX can be opened by MS Excel.

The "correct value" of the data descriptor is the size of each XML file.

As you know, if the XML file in the zip file has a Data Descriptor, the CRC of the XML file before compression, the size before compression, and the size after compression are saved in the Data Descriptor.

If the Data Descriptor is in zip64 standard, the size before and after compression will be represented by 8 bytes respectively, otherwise it will be represented by 4 bytes.

In the zippy implementation, if an XML file is not modified, the Data Descriptor in the source file will be copied to the new file. During the copying process, since the source file is recognized by zippy as zip32 format, it will be copied incorrectly, as shown below:

// origin xml's Data Descriptor
 50 4B 07 08 73 B9 D9 10 06 66 30 21 00 00 00 00 9A 90 DC 15 00 00 00 00
|-----------|-----------|-----------------------|-----------------------|
 'PK\07\08'    CRC-32    compressed size          uncompressed size


// new xml's Data Descriptor
 50 4B 07 08 73 B9 D9 10 06 66 30 21 00 00 00 00
|-----------|-----------|-----------|-----------|
 'PK\07\08'    CRC-32    compressed 	uncompressed
 				        size                   size

// The correct one should be as follows (That is the bytes I modified manually)
 50 4B 07 08 73 B9 D9 10 06 66 30 21 9A 90 DC 15
|-----------|-----------|-----------|-----------|
 'PK\07\08'    CRC-32    compressed 	uncompressed
 				        size                   size

@aral-matrix
Copy link
Collaborator

I don't know, actually - I just started contributing to OpenXLSX and didn't touch zippy beyond superficial patches so far :)

I put some debugging output in the function mz_zip_writer_add_from_zip_reader you mentioned, right before this line:

            /* Now deal with the optional data descriptor */

Unfortunately, that code - opening your example file and saving it again - never gets invoked. What I am doing is:

	XLDocument doc("./POI5.2.3.xlsx"s);
	doc.setSavingDeclaration( XLXmlSavingDeclaration(XLXmlDefaultVersion, "utf-8", XLXmlNotStandalone) );

	std::cout << "doc.name() is " << doc.name() << std::endl;

        doc.workbook().addWorksheet( "test new worksheet" );
	wks.setActive();

	doc.saveAs("./out.xlsx", XLForceOverwrite);
	doc.close();

any idea why the code section where you proposed a workaround is not even being invoked here?

@aral-matrix
Copy link
Collaborator

And thank you very much for the hex output of the problematic part - that helps me a lot to understand the compatibility issue.

@abo-H
Copy link
Author

abo-H commented Sep 24, 2024

Are you referring to the mz_zip_writer_add_from_zip_reader method not being called?

// ===== Iterate through the ZipEntries and add entries to the temporary file
for (auto& file : m_ZipEntries) {
if (file.IsDirectory()) continue; // TODO: Ensure this is the right thing to do (Excel issue)
if (!file.IsModified()) {
if (!mz_zip_writer_add_from_zip_reader(&tempArchive, &m_Archive, file.Index())) {
throw ZipRuntimeError(mz_zip_get_error_string(m_Archive.m_last_error));
}
}
else {
if (!mz_zip_writer_add_mem(&tempArchive,
file.GetName().c_str(),
file.m_EntryData.data(),
file.m_EntryData.size(),
MZ_DEFAULT_COMPRESSION)) {
throw ZipRuntimeError(mz_zip_get_error_string(m_Archive.m_last_error));
}
}
}

@aral-matrix
Copy link
Collaborator

aral-matrix commented Sep 24, 2024

Yes - in my development-aral branch, the "IsModified()" is always true. And I am trying to find out why..

But also, it appears that the bug you encounter is triggered by the files considered "unmodified" in the source archive.

@abo-H
Copy link
Author

abo-H commented Sep 24, 2024

Yes, sorry for forgetting to tell you, I modify the code here

// ===== Add remaining spreadsheet elements to the vector of XLXmlData objects.
for (auto& item : m_contentTypes.getContentItems()) {
if (item.path().substr(0, 4) == "/xl/" && item.path() != "/xl/workbook.xml")
m_data.emplace_back(/* parentDoc */ this,
/* xmlPath */ item.path().substr(1),
/* xmlID */ m_wbkRelationships.relationshipByTarget(item.path().substr(4)).id(),
/* xmlType */ item.type());
else
m_data.emplace_back(/* parentDoc */ this,
/* xmlPath */ item.path().substr(1),
/* xmlID */ m_docRelationships.relationshipByTarget(item.path().substr(1)).id(),
/* xmlType */ item.type());
}

to improve the speed by reducing the number of XML files opened and parsed.

Thank you very much for your reply. When I have time, I will study OpenXLSX and zippy's support for zip64. See how to call mz_zip_writer_add_from_zip_reader

@aral-matrix
Copy link
Collaborator

Wait, so you added the code change yourself that broke things, and then you created an issue and forgot to mention that? :P

That would have been nice to know sooner. So what did you change? Did you exclude some worksheets from being opened at all & that is what triggered the bug?

It would still be good to know because the optimization you implemented might be interesting to support.

@aral-matrix
Copy link
Collaborator

Okay, so with that knowledge, if I skip a worksheet from being parsed, I can get the Zippy save routine to return false from "IsModified()" for that file. But that creates a whole lot of different problems - worksheets are not supposed to be "skipped" when loading.
And your example does not work for me then, because without a single worksheet in the workbook, saving it will throw an exception because the workbook validation fails.

I'll mark this as invalid and close it for now - when you have a good working suggestion for a loading optimization (at the user's risk) please open a new issue with a "feature request" :)

@aral-matrix aral-matrix added the invalid This doesn't seem right label Sep 24, 2024
@aral-matrix aral-matrix removed their assignment Sep 24, 2024
@abo-H
Copy link
Author

abo-H commented Sep 24, 2024

Yes, in order to achieve a special function, I did not read out all the XML files, skipped some XML, so some XML files were not modified.

Thanks again for your reply.

@aral-matrix
Copy link
Collaborator

One more thing - if you are playing around with the code, please consider working with the development-aral branch that fixes some other issues with "non vanilla" workbooks.

However, the issue you triggered with your modifications could also happen for a workbook that contains files which are currently not supported by OpenXLSX (and therefore ignored, just like in the change that I believe you added).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants