Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xlsx Reader Better Namespace Handling Phase 1 Try2 #2173

Merged
merged 5 commits into from
Jun 25, 2021

Conversation

oleibman
Copy link
Collaborator

This is a replacement for #2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions.

As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged.

The original commit message follows.

There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment #860 (comment) in issue #860 by @IMSoP has triggered an idea about how to proceed.

Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of:

  • listWorkSheetNames
  • listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected.
  • active sheet
  • selected cell and top left cell
  • cell content (formulas, numbers, text)
  • hyperlinks
  • comments (partial - see below)

This PR does not address:

  • styles
  • images and charts
  • VBA and ribbons
  • many other items, I'm sure

The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue #2067 PR #2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR.

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.

I will add more detail as comments after I push this change.

This is:

- [x] a bugfix
- [ ] a new feature

Checklist:

Why this change is needed?

oleibman added 3 commits June 18, 2021 20:45
This is a replacement for PHPOffice#2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions.

As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged.

The original commit message follows.

There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment PHPOffice#860 (comment) in issue PHPOffice#860 by @IMSoP has triggered an idea about how to proceed.

Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of:
- listWorkSheetNames
- listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected.
- active sheet
- selected cell and top left cell
- cell content (formulas, numbers, text)
- hyperlinks
- comments (partial - see below)

This PR does not address:
- styles
- images and charts
- VBA and ribbons
- many other items, I'm sure

The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue PHPOffice#2067 PR PHPOffice#2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR.

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.

I will add more detail as comments after I push this change.
Add 2 casts to eliminate minor Scrutinizer problem.
@MarkBaker
Copy link
Member

Let me know when you're happy with it: it might be one of the riskiest changes we've ever made to the codebase, but it's long overdue, and you've done great work in introducing it.
I know I do some refactoring work to benefit the Pivot Table reading; but I've more than enough other things I want to look at (Array functions, the new spilled range and implicit intersection operators, empty operands in IF() with lazy evaluation, LET() and LAMBDA(), some TLC for charts, etc) that will keep me occupied and away from the Xlsx Reader until then.

@oleibman
Copy link
Collaborator Author

Forgot to add "more detail" when I pushed this change. Here it is.

I think this change is ready, but it requires more scrutiny than most changes. It is unusually likely that I didn't dot an "i" or cross a "t" somewhere that isn't caught in unit testing. Part of the reason is because test coverage of Reader/Xlsx isn't all that high in the first place, and a fair proportion of the coverage comes from samples. Here's an explanation for some of what I've done, since my motives won't always be obvious.

Added a Reader/Xlsx/Namespaces class to eliminate the hard-coded urls found in dozens of places throughout Reader/Xlsx and its subclasses.

Added 2 routines, loadZip and loadZipNoNamespace, to handle the reading of the zip files. Both can specify a namespace, but the latter ignores it. Any calls to loadZipNoNamespace should act exactly the same as before, but they indicate that I either haven't analyzed the call yet, or haven't been able to determine whether specifying a namespace would be a more correct action. In a couple of cases, I call both because namespaced seems to work with part of the code that follows, but not with other parts.

Introduced 2 functions, testSimpleXml and xpathNoFalse, to attempt to eliminate the plethora of SimpleXMLElement|false and array|false possibilities that Phpstan (and, doubtless, Scrutinizer) dislike. Similarly, I have tried to cast SimpleXMLElement to string once in cases where it is done several times within a few statements.

I did not do any major refactoring with this PR. Reader/Xlsx is desparately in need of it; I just couldn't figure out how to break it up with this PR. I plan to come back to it. I do, however, have some ideas for how to refactor Styles, which is what I plan to do next if this PR is merged.

The new test file namespacestd.xlsx and the tests associated with are based on the Sylk test file, because I know that covers a lot of cases. The new test file namespacenonstd.xlsx is crafted by hand from this so that they can run identical tests (and also because, despite the number of namespace issues, I couldn't find much in the way of useful example files). Some of the nonstd tests are, of course, marked incomplete for now.

The test file namespacepurl.xlsx and its tests come from @Lambik PR #2068.

The new test file issue2109b.xlsx is based on a file uploaded for issue #2109. I have manually confirmed that I can read and make an unstyled copy of the uploaded file, but it is much too large to add to the unit test suite. I used sed and vim to delete all but the first few rows of the spreadsheet. This brings it to a manageable size for the test suite. Although it required some manual manipulation on my part, I think we can classify it as "naturally" created (as opposed to namespacenonstd).

The new test file openpyxl35.xlsx was created with ancient version 2.3.5 of openpyxl, before its namespacing was changed to be PhpSpreadsheet-compatible. Not all the xml files within it use unexpected namespacing, but xl/themes/theme1.xml, xl/workbook.xml, docprops/app.xml, and docprops/core.xml all do, and I think that the last 3 are definitely involved in the unit tests. The technique used to create this file might prove helpful in targeting future tests.

So, we now have files generated in 4 different ways that would not have been readable by PhpSpreadsheet before this change. I am open to adding new methods, but I think this is a sufficient number.

@oleibman
Copy link
Collaborator Author

I am going to continue to study the change for a few more days to see if any tweaks or new test files are needed. I expect to give it a "go ahead" signal before Friday.

@MarkBaker
Copy link
Member

JFI. I'd just merged a PR that created a merge conflict with this PR, so I've resolved it. All checks still passing.

@oleibman
Copy link
Collaborator Author

Well, I'm nervous, but I guess I'll always be so for this PR. I have not thought of any meaningful additional tests that I can add. So, whenever you're ready, cross your fingers and let 'er rip.

I do not plan to even start on phase 2 for at least a month (longer if you deem advisable). This will, I hope, give time for errors to be noticed and bugs filed. It will also give time for other modifications to be made to Reader/Xlsx.

@MarkBaker MarkBaker merged commit cd84020 into PHPOffice:master Jun 25, 2021
Copy link
Member

@MarkBaker MarkBaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too big a change for a proper review; but I know what you were doing with this, and it's passing all tests including the new tests to verify the namespacing changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants