Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xlsx Reader Better Namespace Handling Phase 1 #2088

Closed
wants to merge 18 commits into from

Conversation

oleibman
Copy link
Collaborator

There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment #860 (comment) in issue #860 by @IMSoP has triggered an idea about how to proceed.

Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of:

  • listWorkSheetNames
  • listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected.
  • active sheet
  • selected cell and top left cell
  • cell content (formulas, numbers, text)
  • hyperlinks
  • comments (partial - see below)

This PR does not address:

  • styles
  • images and charts
  • VBA and ribbons
  • many other items, I'm sure

The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue #2067 PR #2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR.

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.

I will add more detail as comments after I push this change.

This is:

- [x] a bugfix
- [ ] a new feature

Checklist:

Why this change is needed?

oleibman added 5 commits May 10, 2021 21:58
There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment PHPOffice#860 (comment) in issue PHPOffice#860 by @IMSoP has triggered an idea about how to proceed.

Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of:
- listWorkSheetNames
- listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected.
- active sheet
- selected cell and top left cell
- cell content (formulas, numbers, text)
- hyperlinks
- comments (partial - see below)

This PR does not address:
- styles
- images and charts
- VBA and ribbons
- many other items, I'm sure

The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue PHPOffice#2067 PR PHPOffice#2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR.

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.

I will add more detail as comments after I push this change.
Case-sensitive file name.
Case-sensitive file name.
Case-sensitive file name.
@oleibman oleibman marked this pull request as draft May 11, 2021 05:45
oleibman added 2 commits May 10, 2021 23:42
Temporarily disable failing test. Will not merge until test is restored.
@oleibman
Copy link
Collaborator Author

@MarkBaker, @PowerKiKi - it would be lying to say this is anything other than a risky change. I think it will be worthwhile in the long run, but there may be some short term pain. For that reason, I am initially putting it in draft status. I think it's ready, but it requires more scrutiny than most changes. It is unusually likely that I didn't dot an "i" or cross a "t" somewhere that isn't caught in unit testing. Part of the reason is because test coverage of Reader/Xlsx isn't all that high in the first place, and a fair proportion of the coverage comes from samples, not test assertions, so I hope that you and @IMSoP will give it a thorough going over. Here's an explanation for some of what I've done, since my motives won't always be obvious.

Added a Reader/Xlsx/Namespaces class to eliminate the hard-coded urls found in dozens of places throughout Reader/Xlsx and its subclasses. I was truly unsure whether it was better to define each item as a single string, or to attempt to show their relationship by using concatenations.

Added 2 routines, loadZip and loadZipNoNamespace, to handle the reading of the zip files. Both can specify a namespace, but the latter ignores it. Any calls to loadZipNoNamespace should act exactly the same as before, but they indicate that I either haven't analyzed the call yet, or haven't been able to determine whether specifying a namespace would be a more correct action. In a couple of cases, I call both because namespaced seems to work with part of the code that follows, but not with other parts.

Introduced 2 functions, testSimpleXml and xpathNoFalse, to attempt to eliminate the plethora of SimpleXMLElement|false and array|false possibilities that Phpstan (and, doubtless, Scrutinizer) dislike. Similarly, I have tried to cast SimpleXMLElement to string once in cases where it is done several times within a few statements.

I did not do any major refactoring with this PR. Reader/Xlsx is desparately in need of it; I just couldn't figure out how to break it up with this PR. I plan to come back to it. I do, however, have some ideas for how to refactor Styles, which is what I plan to do next if this PR is merged.

The new test file namespacestd.xlsx and the tests associated with are based on the Sylk test file, because I know that covers a lot of cases. The new test file namespacenonstd.xlsx is crafted by hand from this so that they can run identical tests (and also because, despite the number of namespace issues, I couldn't find much in the way of useful example files). Some of the nonstd tests are, of course, marked incomplete for now. The test file namespacepurl.xlsx and its tests come from @Lambik PR #2068.

Reader/Xlsx seems to be a very busy module - 8 changes between April 18 and 24, and Mark just opened a new one today, and there are 2 complex differences showing as merge conflicts (and a test failure to debug after I resolved those) - so, although there is risk in installing this PR, there is also risk in letting it become stale by waiting on it too long. So, I plan to mark this change ready when discussion on it appears to have petered out. I have not yet attempted to merge any of my own changes (although there are a couple of small ones I might be willing to do so for); this will certainly not be an exception.

@MarkBaker
Copy link
Member

MarkBaker commented May 11, 2021

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

I'd seen this, but wasn't treating it as a priority for the moment. It's useful to support, but there's a lot more that needs doing before we can take a look at it, especially as only a few users seem interested in expanding comments support, and nobody has raised the topic of threaded comments yet... the worst thing about it seems to be a renaming between Comments (threaded) and Notes (which used to be called Comments)

it would be lying to say this is anything other than a risky change. I think it will be worthwhile in the long run, but there may be some short term pain.

It is high risk: one of the main reasons why I'd put off addressing it for so long... but I'm seeing more and more requests to handle it as more languages have their own Excel Writers using different namespacing, and users then want to be able to import those spreadsheets as well as Excel-created workbooks.

Added a Reader/Xlsx/Namespaces class to eliminate the hard-coded urls found in dozens of places throughout Reader/Xlsx and its subclasses. I was truly unsure whether it was better to define each item as a single string, or to attempt to show their relationship by using concatenations.

My feeling would be single string for each item; relationships should be an aspect of documentation, not executable code.

I did not do any major refactoring with this PR. Reader/Xlsx is desparately in need of it

Desperately in need; but that is a separate exercise, and shouldn't be conflated with this PR.

Reader/Xlsx seems to be a very busy module - 8 changes between April 18 and 24, and Mark just opened a new one today,

Ignore my new one for the moment: it's another experiment... I made the mistake of started looking at Pivot Tables again, and realised that I really needed to extract all the Zip handling to a common class so that it would be available across a number of Reader classes for Xlsx that would be created by any significant refactoring, DI'ed into their constructor. So that was purely an experiment of mine just to get a feel for how it would work... although it does look like it's an approach that could have a lot of benefits for a major breakdown of that monolithic Reader

There's a few other things I have in my pipeline (and I may be quieter here for a week or so): I want to start providing specific tests for every individual feature that we support for the different file formats, which would then allow me to properly update the feature grid in documentation... and explore parallel running of tests in the pipeline to try and speed that up; not to mention that another PR has reminded me how out of date the language translation tables are for Excel Functions, with many functions missing in the lists... they need some TLC, and I've recently installed the Function translator plug-in for Excel, which gives me that information, and for a whole range of new languages as well.... so I won't be making any changes that affect this PR.

oleibman added 7 commits May 11, 2021 06:47
Scrutinizer.
Scrutinizer.
Mark Baker expressed preference for single strings rather than concatenations here.
Final Scrutinizer changes for now, I hope.
@oleibman
Copy link
Collaborator Author

Satisfied Scrutinizer at last. I have, per Mark Baker's preference, changed the Namespaces member to use single strings. I have one failing test which is currently commented out; I hope to be able to debug it tonight.

@oleibman
Copy link
Collaborator Author

The code responsible for the failing test is now fixed and the test is restored.

Scrutinizer is such fun.
@MarkBaker
Copy link
Member

Given the significance of this change, I'm not going to fully review/merge it until after the next release

@oleibman
Copy link
Collaborator Author

Seems right. I will leave it in draft status at least till then.

@oleibman
Copy link
Collaborator Author

If I can take this opportunity to ask a quick question unrelated to this change. Looking at the commits on May 11, there are two labeled with the original commit title, and described as "so-and-so authored and MarkBaker committed ...". On the same day, there is one labeled "merge pull request ... from so-and-so", and described as "oleibman committed ...". What should I be doing to get the label and description to look like yours?

@MarkBaker MarkBaker mentioned this pull request May 21, 2021
@MarkBaker
Copy link
Member

Good to see how well it works with the first Spreadsheet that I've tested it with

@oleibman
Copy link
Collaborator Author

Sigh of relief! Much better to have a "natural" example than the one I hand-crafted.

@MarkBaker
Copy link
Member

re "so-and-so authored and MarkBaker committed ..." I had to rebase and merge, and that creates a new commit with my name tagged as a co-author.... sometimes I forget to remove that tag from the comments before pressing the button

MarkBaker referenced this pull request Jun 3, 2021
* Make DefinedNames Samples Consistent With Other Samples (#1707)

All other Samples write to temporary directory. DefinedNames samples
write to main directory, which (a) means they aren't stored with others,
and (b) they aren't ignored by git so look like changed files.
The tests are also simplified by requiring Header rather than Bootstrap,
making use of Helper.

* Resolve XSS Vulnerability in the HTML Writer (#1719)

Resolve XSS Vulnerability in the HTML Writer

* Drop Travis

* Automatic GitHub releases from git tags

* Improve Coverage in src/PhpSpreadsheet

There are no changes to code. Additional tests are added,
so that the following 6 items now have 100% test coverage:
- Comment
- DefinedName
- DocumentGenerator
- IOFactory
- NamedFormula
- NamedRange

* Changes for Scrutinizer

Two changes to fix minor problems reported by Scrutinizer.

* Spelling: Tou -> You

* Fix for 1735 (Incorrect activeSheetIndex after RemoveSheetByIndex) (#1743)

This is a fix for issue #1735.
It adds tests for this situation, and similar situations involving
adding new sheets and accessing existing ones.
Coverage for Spreadsheet.php increases from 69% to 75% as a result.

* Update change log

* Fix for 3 Issues Involving ReadXlsx and NamedRange (#1742)

* Fix for 3 Issues Involving ReadXlsx and NamedRange

Issues #1686 and #1723, which provide sample spreadsheets, are probably
solved by this ticket. Issue #1730 is also probably solved, but I have
no way to verify.

There are two problems with how PhpSpreadsheet is handling things now.
Although the first problem is much less severe, and isn't really a factor
in the issues named above, it is helpful to get it out of the way first.
If you define a named range in Excel, and then delete the sheet where
the range exists, Excel saves the range as #REF!. If there is a cell which
references the range, it will similarly have the value #REF! when you open
the Excel file.
Currently, PhpSpreadsheet discards the #REF! definition, so a cell which
references the range will appear as #NAME? rather than #REF!.
This PR changes the behavior so that PhpSpreadsheet retains the #REF!
definition, and cells which reference it will appear as #REF!.

The second problem is the more severe, and is, I believe, responsible
for the 3 issues identified above.
If you define a named range and the sheet on which the range is defined
does not exist at the time, Excel will save the range as something like:

'[1]Unknown Sheet'!$A$1

If a cell references such a range, Excel will again display #REF!.
PhpSpreadsheet currently throws an Exception when it encounters
such a definition while reading the file. This PR changes
the behavior so that PhpSpreadsheet saves the definition as #REF!,
and cells which reference it will behave similarly.

For the record, I will note that Excel does not magically recalculate when a
missing sheet is subsequently added, despite the fact that the reference
might now become resolvable. PhpSpreadsheet behaves likewise.

* Remove Dead Code in Test

Identified it after push but before merge.

* Update change log

* Apply Column and Row Styles to Existing Cells (#1721)

* Apply Column and Row Styles to Existing Cells

This is a fix for issue #1712.
When a style is applied to an entire row or column, it is currently
only effective for cells which don't already contain a value.
The code needs to iterate through existing cells in the row/column
in order to apply the style to them.
This could be considered a breaking change, however, I believe that
the change makes things operate as users would expect, and that the
existing implementation is incomplete.

The change also removes protected element conditionalStyles from
the Style class. That element is an unused remnant, and can no longer be
set or retrieved - methods getConditionalStyles and setConditionalStyles
actually act on an element in the Worksheet class.

Finally, additional tests are added so that Style, and in fact the
entire Style directory, now has 100% test coverage.

* Scrutinizer Changes

Scrutinizer flagged 6 statements. 5 can be easily corrected.
One is absolutely wrong (it thinks iterating through cells in column
can return null). Let's see if we can satisfy it.

* Remove Exception For CellIterator on Empty Row/Column

For my first attempt at this change, which corrects a bug by updating styles
for non-empty cells when a style is set on a row or column, I wished to make things
more efficient by using setIterateOnlyExistingCells, something which the
existing documentation recommends. This caused an exception to be generated
when the row or column is empty. So I removed that part of the change while I
researched what was going on.

I have completed that research. The existing code does throw an exception
when the row/column is empty and iterateOnlyExistingCells is true. However,
that does not seem like a reasonable action. This situation is analagous to
iterating over an empty array, and that action is legal and does not throw.
The same should apply here. There were no tests for this situation,
and now there are.

I have added additional tests, and coverage for all of RowCellIterator,
ColumnCellIterator, and CellIterator are all now 100%. Some of my new tests
were added in new members, because the existing tests all relied on mocking,
which was not the best choice for the new tests. One of the existing tests
for RowCellIteratorTest (testSeekOutOfRange) was wrong; it issued the expected
exception, but for the wrong reason. I have added an additional test to
ensure that it fails "correctly".

The existing documentation says that the default value for
IterateOnlyExistingCells is true. In fact, the default value is false.
I have corrected the documentation.

* More Scrutinizer

I believe its analysis is incorrect, but this should silence it.

* DocBlock Correction

ColumnCellIterator DocBlock for current indicated it could return null
or Cell, but it can really return only Cell. This had caused Scrutinizer
to complain earlier.

* PHP8 Environment Appears to be Fixed

Cosmetic change to Doc member. I suspect there is a way to rerun all
the tests without another push, but I have been unable to figure out how.

* Update change log

* TextData Coverage and Minor Bug Fixes (#1744)

This had been intended to get 100% coverage for TextData functions, and it does that.
However, some minor bugs requiring source changes arose during testing.
- the Excel CHAR function restricts its argument to 1-255. PhpSpreadsheet CHARACTER
  had been allowing 0+. Also, there is no need to test if iconv exists,
  since it is part of Composer requirements.
- The DOLLAR function had been returning NUM for invalid arguments. Excel returns VALUE.
  Also, negative amounts were not being handled correctly.
- The FIXEDFORMAT function had been returning NUM for invalid arguments. Excel FIXED returns VALUE.

* Replace anti-xss with html purifier (#1751)

* Replace voku/anti-xss with ezyang/htmlpurifier. Despite anti-xss being a smaller footprint dependency, an a better license fit with our MIT license, there are issues with it's automatic it sanitisation of global variables causing side effects
* Additional unit tests for xss in html writer cell comments

* Fix bug #1626 where values of 0 were "rounded" up/down as if they were not 0 (#1627)

* Fix bug where values of 0 were "rounded" up/down as if they were not 0

* Update change log

* Fix for #1612 - SLK Long File Name (#1706)

Issue has been marked stale, but ...
Sylk read sets worksheet title to filename (minus .slk).
If that is >31 characters, PhpSpreadsheet throws Exception.
This change truncates sheet title, as Excel does, to 31 characters.

* Update change log

* worksheet: fix if cellValue does not exist (#1727)

The condition is FALSE if the cell does not exist in the flipped table,
but anyway, it is sent in to a method requiring 'string' type, causing
it to fail.

* fixes #1655 issue (#1656)

Resolve problem with incorrectly defined hyperlinks

* Add 'ps' suffix to printer settings resources IDs (#1690)

* Add 'ps' suffix to printer settings resources IDs

* Update change log

* Fix pixelsToPoints conversion (for HTML col width) (#1733)

* DocBlock Change in Styles/Conditional (#1697)

Scrutinizer reported a minor error in a test involving a module
which I was not changing.
Styles/Conditional function setConditions can take a scalar or an
array as a parameter, but DocBlock says it only expects array.
I did not wish to add the extra module to my PR, but made a note to
self to fix that after PR was installed.
That has now happened, and it makes for a good case for me
to see all the PHP8/Composer2/etc. changes that have happened recently.

* Merge pull request #1698

* Merge pull request #4 from PHPOffice/master

* Restore Omitted Read XML Test

* Fix for bug #1592 (UPDATED) (#1623)

* Fix for Xls when BIFF8 SST (FCh) has bad Shared string length

* Update change log

* Add nightly PHP 8.1 dev to github actions (#1763)

* Fix compatibility with ext-gd on php 8 (#1762)

* CSV - Guess Encoding, Handle Null-string Escape (#1717)

* CSV - Guess Encoding, Handle Null-string Escape

This is in response to issue #1647 (detect CSV character encoding).
First, my tests with mb_detect_encoding indicate that it doesn't work
well enough; regardless, users can always do that on their own
if they deem it useful.
Rolling my own is also troublesome, but I can at least:
a. Check for BOM (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE).
b. Do some heuristic tests for each of the above encodings.
c. Fallback to a user-specified encoding (default CP1252)
  if a and b don't yield result.
I think this is probably useful enough to include, and relatively
easy to expand if other potential encodings should be considered.

Starting with PHP7.4, fgetcsv allows specification of null string as
escape character in fgetcsv. This is a much better choice than the PHP
(and PhpSpreadsheet) default of backslash in that it handles the file
in the same manner as Excel does. There is one statement in Reader/CSV
which would be adversely affected if the caller so specified (building
a regular expression under the assumption that escape character is
a single character). Fix that statement appropriately and add tests.

* Update changelog

* Update Units of Measure supported by the CONVERT() function (#1768)

Now supports all current UoM in all categories, with both 1- and 2-character multiplier prefixes, and binary multiplier prefixes, including the new Temperature scales

* Changelog for 1.16.0 release

* Fix date tests withut specified year for current year 2021 (#1774)

* Mrand of zero to any multiple should return 0 (#1773)

* WIP Inherited Scrutinizer Recommendations - 1 of 3

I tried to sync my fork with the main project, as I have done several times before.
However, the GitHub interface to do this had changed, and it appears that I
did not make the optimal selection when I had a choice. Consequently, all
the merges that happened to base between the last time I synchronized and this time
appear to be part of any PR that I push. "Files changed" remains correct for
my new PRs, but there appear to be many more commits involved than is actually the
case. I will, at some point, delete and re-create my fork, and pay much closer
attention in future when I want to sync my fork with the main project.

Because of this set-up, Scrutinizer reports flaws in code that I haven't
actually changed in my PRs #1799 and #1800.
It still passes them, but, as long as I'm aware of the problems,
I may as well attempt to correct them. The following are not part of those PRs:
- 5 problems spread over 4 different members
- 12 problems in Calculation/Engineering
- 15 problems in Reader/XML

I shall attempt to resolve these via 3 separate PRs,
of which this is the first.

* Try Hyperlink Again

Scrutinizer still didn't like it as fixed.

* Another Crack at Hyperlink

This should work.

Co-authored-by: Mark Baker <[email protected]>
Co-authored-by: Adrien Crivelli <[email protected]>
Co-authored-by: Ryan McAllen <[email protected]>
Co-authored-by: Flinsch <[email protected]>
Co-authored-by: Jan Sverre Riksfjord <[email protected]>
Co-authored-by: Max Kalyabin <[email protected]>
Co-authored-by: Sébastien Despont <[email protected]>
Co-authored-by: Guilliam Xavier <[email protected]>
Co-authored-by: Gianluca Giovinazzo <[email protected]>
Co-authored-by: Alexander M. Turek <[email protected]>
Co-authored-by: Martins Sipenko <[email protected]>
@MarkBaker
Copy link
Member

Now that 1.18.0 is released, I think we can continue work on this with a view to including it in the next release; I think it needs more extensive testing, particularly with real-world spreadsheet files, but from the few tests I've done, it looks pretty good so far

A file created with a very old release of Openpyxl (2.3.5), which used unexpected namespacing.

A small subset of a very large file submitted with issue PHPOffice#2109.
@oleibman
Copy link
Collaborator Author

@MarkBaker @PowerKiKi - I require some help. I added two new spreadsheet and two new unit test members in response to Mark's latest comments. The unit tests did not proceed ("Waiting for status to be reported"). I was not able to resolve the differences in Github ("merge conflict between head and base"). The Github "Resolve conflicts" button is grayed out. I clicked on "Use the command line", which gives me the following information:

Conflicting files:

  • src/PhpSpreadsheet/Reader/Xlsx.php
  • src/PhpSpreadsheet/Reader/Xlsx/Properties.php
  • tests/PhpSpreadsheetTests/Reader/SheetsXlsxChartTest.php
  • tests/PhpSpreadsheetTests/Reader/XlsxTest.php

I believe the "tests" members are there because I earlier moved them out of the Reader directory into the Reader/Xlsx directory. I am satisfied with the contents of Xlsx.php and Xlsx/Properties.php.

Step 1: From your project repository, check out a new branch and test the changes.

git checkout -b oleibman-namexlsx master
git pull https://github.com/oleibman/PhpSpreadsheet.git namexlsx

Step 2: Merge the changes and update on GitHub.

git checkout master
git merge --no-ff oleibman-namexlsx
git push origin master

I have done the checkout and pull from Step 1, confirmed that the modules are as expected, and confirmed that the test suite runs without error. My repository says I have a branch named oleibman-namexlsx with no changes. So, I guess I'm ready to go on to the parts of "Step 2" above, but, having done nothing like that before, I need some reassurance that it's the right thing to do. I have captured all the output from my terminal session if you want me to upload it.

When I do all of that, is that just like a new push to namexlsx, or is it creating a new branch oleibman-namexlsx? Will the tests now run, or is there something else that will need to be done? Will the diff report for all the changes be meaningful? What else should I be looking out for?

@MarkBaker MarkBaker marked this pull request as ready for review June 14, 2021 06:45
@MarkBaker
Copy link
Member

I'll take a look at lunchtime (CEST) but most of the merge conflicts seem to resolve around reading the date properties from the document, and a big conflict in XlsxTest.php

@MarkBaker
Copy link
Member

Don't go ahead! Those merge conflicts need resolution first; I've tried manually resolving them once already, and ended up with a slew of errors. I'll have another go later, and try to identify what the actual problems are with the merge

@oleibman
Copy link
Collaborator Author

Yipe! I'm glad my call for help wasn't just because I was missing something blindingly obvious, but am sorry that this has become such a problem. If need be, I think I ought to be able to rebase and recreate this as a different PR - not my first choice, obviously, but possibly safer than trying to untangle everything.

@oleibman
Copy link
Collaborator Author

To help unravel things, I would like to open a separate PR to move SheetsXlsxChartTest and XlsxTest from Reader to Reader/Xlsx, with no other changes. It is really helpful to have all the Xlsx tests in the same directory. Do you think that would be okay, or would it mess things up even more?

@MarkBaker
Copy link
Member

It seems sensible to move the Reader tests to the specific Reader/xxxx folders, and to maintain Readers just for generic IOFactory Reader testing (identify, etc)... and moving forward, I'm wanting to extract some of those reader tests for specific features (DataValidations, Conditional Formatting, etc) into their own test files as well.... so moving those two tests is a sensible idea

oleibman added a commit to oleibman/PhpSpreadsheet that referenced this pull request Jun 17, 2021
PR PHPOffice#2088 is having major merge problems. This is partly because it moves some tests from Reader to Reader/Xlsx. Making this move beforehand may help. Or it may make things worse, but they are already bad enough that I am contemplating redoing the PR. If I do that, having this done beforehand will make things easier.

This PR does nothing but move some tests. This will make it easier to test changes to Xlsx Reader without having to run each test individually, and without having to run tests for all the other readers at the same time.
@oleibman
Copy link
Collaborator Author

I opened #2170, which is a pretty simple change, and it is also in "must resolve conflicts by hand" state. So I've converted it to a draft as well. I think I just need to start this over.

@oleibman
Copy link
Collaborator Author

Aha! Found the problem in 2170. It lacks the changes from #2149. I'll try to apply those and see what happens to it.

@oleibman
Copy link
Collaborator Author

Added the needed changes to 2170, but Github still thinks I need to do things by hand. Sigh.

oleibman added a commit to oleibman/PhpSpreadsheet that referenced this pull request Jun 17, 2021
PR PHPOffice#2088 is having major merge problems. This is partly because it moves some tests from Reader to Reader/Xlsx. Making this move beforehand may help. Or it may make things worse, but they are already bad enough that I am contemplating redoing the PR. If I do that, having this done beforehand will make things easier.

This PR does nothing but move some tests. This will make it easier to test changes to Xlsx Reader without having to run each test individually, or without having to run tests for all the other readers at the same time.
@oleibman
Copy link
Collaborator Author

Trying test moves again as PR #2171. This seems to be more successful, and, if that holds, that will give me a new base to start from if need be.

@oleibman
Copy link
Collaborator Author

PR 2171 is ready to merge. If we haven't figured a way out of the mess in this PR by the start of next week, I will re-do it using 2171 as my starting point.

@oleibman
Copy link
Collaborator Author

Good news! I have been able to create a new branch combining 2171 and the changes in this PR, and it passes all the old and new unit tests, including the new spreadsheets added this week. Need to do a bit more testing, but expect a new PR early next week.

oleibman added a commit to oleibman/PhpSpreadsheet that referenced this pull request Jun 19, 2021
This is a replacement for PHPOffice#2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions.

As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged.

The original commit message follows.

There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment PHPOffice#860 (comment) in issue PHPOffice#860 by @IMSoP has triggered an idea about how to proceed.

Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of:
- listWorkSheetNames
- listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected.
- active sheet
- selected cell and top left cell
- cell content (formulas, numbers, text)
- hyperlinks
- comments (partial - see below)

This PR does not address:
- styles
- images and charts
- VBA and ribbons
- many other items, I'm sure

The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue PHPOffice#2067 PR PHPOffice#2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR.

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.

I will add more detail as comments after I push this change.
@oleibman
Copy link
Collaborator Author

Closing in favor of #2173.

@oleibman oleibman closed this Jun 22, 2021
MarkBaker pushed a commit that referenced this pull request Jun 25, 2021
* Xlsx Reader Better Namespace Handling Phase 1 Try2

This is a replacement for #2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions.

As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged.

The original commit message follows.

There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment #860 (comment) in issue #860 by @IMSoP has triggered an idea about how to proceed.

Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of:
- listWorkSheetNames
- listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected.
- active sheet
- selected cell and top left cell
- cell content (formulas, numbers, text)
- hyperlinks
- comments (partial - see below)

This PR does not address:
- styles
- images and charts
- VBA and ribbons
- many other items, I'm sure

The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue #2067 PR #2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR.

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.
@oleibman oleibman deleted the namexlsx branch February 13, 2022 09:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants