Xlsx Reader Better Namespace Handling Phase 1 Try2 #2173

oleibman · 2021-06-19T03:47:33Z

This is a replacement for #2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions.

As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged.

The original commit message follows.

There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment #860 (comment) in issue #860 by @IMSoP has triggered an idea about how to proceed.

Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of:

listWorkSheetNames
listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected.
active sheet
selected cell and top left cell
cell content (formulas, numbers, text)
hyperlinks
comments (partial - see below)

This PR does not address:

styles
images and charts
VBA and ribbons
many other items, I'm sure

The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue #2067 PR #2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR.

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.

I will add more detail as comments after I push this change.

This is:

- [x] a bugfix
- [ ] a new feature

Checklist:

Changes are covered by unit tests
Code style is respected
Commit message explains why the change is made (see https://github.com/erlang/otp/wiki/Writing-good-commit-messages)
CHANGELOG.md contains a short summary of the change
Documentation is updated as necessary

Why this change is needed?

@IMSoP

This is a replacement for PHPOffice#2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions. As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged. The original commit message follows. There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment PHPOffice#860 (comment) in issue PHPOffice#860 by @IMSoP has triggered an idea about how to proceed. Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of: - listWorkSheetNames - listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected. - active sheet - selected cell and top left cell - cell content (formulas, numbers, text) - hyperlinks - comments (partial - see below) This PR does not address: - styles - images and charts - VBA and ribbons - many other items, I'm sure The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue PHPOffice#2067 PR PHPOffice#2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR. While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult. Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts. I will add more detail as comments after I push this change.

Add 2 casts to eliminate minor Scrutinizer problem.

We shall sess.

MarkBaker · 2021-06-19T08:19:02Z

Let me know when you're happy with it: it might be one of the riskiest changes we've ever made to the codebase, but it's long overdue, and you've done great work in introducing it.
I know I do some refactoring work to benefit the Pivot Table reading; but I've more than enough other things I want to look at (Array functions, the new spilled range and implicit intersection operators, empty operands in IF() with lazy evaluation, LET() and LAMBDA(), some TLC for charts, etc) that will keep me occupied and away from the Xlsx Reader until then.

oleibman · 2021-06-21T13:25:38Z

Forgot to add "more detail" when I pushed this change. Here it is.

I think this change is ready, but it requires more scrutiny than most changes. It is unusually likely that I didn't dot an "i" or cross a "t" somewhere that isn't caught in unit testing. Part of the reason is because test coverage of Reader/Xlsx isn't all that high in the first place, and a fair proportion of the coverage comes from samples. Here's an explanation for some of what I've done, since my motives won't always be obvious.

Added a Reader/Xlsx/Namespaces class to eliminate the hard-coded urls found in dozens of places throughout Reader/Xlsx and its subclasses.

Added 2 routines, loadZip and loadZipNoNamespace, to handle the reading of the zip files. Both can specify a namespace, but the latter ignores it. Any calls to loadZipNoNamespace should act exactly the same as before, but they indicate that I either haven't analyzed the call yet, or haven't been able to determine whether specifying a namespace would be a more correct action. In a couple of cases, I call both because namespaced seems to work with part of the code that follows, but not with other parts.

Introduced 2 functions, testSimpleXml and xpathNoFalse, to attempt to eliminate the plethora of SimpleXMLElement|false and array|false possibilities that Phpstan (and, doubtless, Scrutinizer) dislike. Similarly, I have tried to cast SimpleXMLElement to string once in cases where it is done several times within a few statements.

I did not do any major refactoring with this PR. Reader/Xlsx is desparately in need of it; I just couldn't figure out how to break it up with this PR. I plan to come back to it. I do, however, have some ideas for how to refactor Styles, which is what I plan to do next if this PR is merged.

The new test file namespacestd.xlsx and the tests associated with are based on the Sylk test file, because I know that covers a lot of cases. The new test file namespacenonstd.xlsx is crafted by hand from this so that they can run identical tests (and also because, despite the number of namespace issues, I couldn't find much in the way of useful example files). Some of the nonstd tests are, of course, marked incomplete for now.

The test file namespacepurl.xlsx and its tests come from @Lambik PR #2068.

The new test file issue2109b.xlsx is based on a file uploaded for issue #2109. I have manually confirmed that I can read and make an unstyled copy of the uploaded file, but it is much too large to add to the unit test suite. I used sed and vim to delete all but the first few rows of the spreadsheet. This brings it to a manageable size for the test suite. Although it required some manual manipulation on my part, I think we can classify it as "naturally" created (as opposed to namespacenonstd).

The new test file openpyxl35.xlsx was created with ancient version 2.3.5 of openpyxl, before its namespacing was changed to be PhpSpreadsheet-compatible. Not all the xml files within it use unexpected namespacing, but xl/themes/theme1.xml, xl/workbook.xml, docprops/app.xml, and docprops/core.xml all do, and I think that the last 3 are definitely involved in the unit tests. The technique used to create this file might prove helpful in targeting future tests.

So, we now have files generated in 4 different ways that would not have been readable by PhpSpreadsheet before this change. I am open to adding new methods, but I think this is a sufficient number.

oleibman · 2021-06-21T13:27:57Z

I am going to continue to study the change for a few more days to see if any tweaks or new test files are needed. I expect to give it a "go ahead" signal before Friday.

MarkBaker · 2021-06-24T10:44:33Z

JFI. I'd just merged a PR that created a merge conflict with this PR, so I've resolved it. All checks still passing.

oleibman · 2021-06-24T19:53:56Z

Well, I'm nervous, but I guess I'll always be so for this PR. I have not thought of any meaningful additional tests that I can add. So, whenever you're ready, cross your fingers and let 'er rip.

I do not plan to even start on phase 2 for at least a month (longer if you deem advisable). This will, I hope, give time for errors to be noticed and bugs filed. It will also give time for other modifications to be made to Reader/Xlsx.

MarkBaker

It's too big a change for a proper review; but I know what you were doing with this, and it's passing all tests including the new tests to verify the namespacing changes

oleibman added 3 commits June 18, 2021 20:45

Scrutinizer

2723167

Add 2 casts to eliminate minor Scrutinizer problem.

One More Attempt to Satisfy Scrutinizer

a488574

We shall sess.

oleibman mentioned this pull request Jun 22, 2021

Xlsx Reader Better Namespace Handling Phase 1 #2088

Closed

5 tasks

Merge branch 'master' into namexlsx5

773db9e

Merge branch 'master' into namexlsx5

2f67e81

MarkBaker merged commit cd84020 into PHPOffice:master Jun 25, 2021

MarkBaker reviewed Jun 25, 2021

View reviewed changes

This was referenced Oct 30, 2021

Fixes issue 2067: an xlsx file with a different xml namespace than the usual #2068

Closed

Issue with reading this xlsx file #2067

Closed

oleibman deleted the namexlsx5 branch February 13, 2022 09:02

oleibman mentioned this pull request Sep 30, 2024

Excel (.xlsx) parsing fails with an NPE #1172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xlsx Reader Better Namespace Handling Phase 1 Try2 #2173

Xlsx Reader Better Namespace Handling Phase 1 Try2 #2173

oleibman commented Jun 19, 2021

MarkBaker commented Jun 19, 2021

oleibman commented Jun 21, 2021

oleibman commented Jun 21, 2021

MarkBaker commented Jun 24, 2021

oleibman commented Jun 24, 2021

MarkBaker left a comment

Xlsx Reader Better Namespace Handling Phase 1 Try2 #2173

Xlsx Reader Better Namespace Handling Phase 1 Try2 #2173

Conversation

oleibman commented Jun 19, 2021

Why this change is needed?

MarkBaker commented Jun 19, 2021

oleibman commented Jun 21, 2021

oleibman commented Jun 21, 2021

MarkBaker commented Jun 24, 2021

oleibman commented Jun 24, 2021

MarkBaker left a comment

Choose a reason for hiding this comment