Xlsx Reader Better Namespace Handling Phase 1 #2088

oleibman · 2021-05-11T05:00:21Z

There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment #860 (comment) in issue #860 by @IMSoP has triggered an idea about how to proceed.

Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of:

listWorkSheetNames
listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected.
active sheet
selected cell and top left cell
cell content (formulas, numbers, text)
hyperlinks
comments (partial - see below)

This PR does not address:

styles
images and charts
VBA and ribbons
many other items, I'm sure

The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue #2067 PR #2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR.

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.

I will add more detail as comments after I push this change.

This is:

- [x] a bugfix
- [ ] a new feature

Checklist:

Changes are covered by unit tests
Code style is respected
Commit message explains why the change is made (see https://github.com/erlang/otp/wiki/Writing-good-commit-messages)
CHANGELOG.md contains a short summary of the change
Documentation is updated as necessary

Why this change is needed?

@IMSoP

There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment PHPOffice#860 (comment) in issue PHPOffice#860 by @IMSoP has triggered an idea about how to proceed. Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of: - listWorkSheetNames - listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected. - active sheet - selected cell and top left cell - cell content (formulas, numbers, text) - hyperlinks - comments (partial - see below) This PR does not address: - styles - images and charts - VBA and ribbons - many other items, I'm sure The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue PHPOffice#2067 PR PHPOffice#2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR. While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult. Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts. I will add more detail as comments after I push this change.

Case-sensitive file name.

Temporarily disable failing test. Will not merge until test is restored.

oleibman · 2021-05-11T07:07:52Z

@MarkBaker, @PowerKiKi - it would be lying to say this is anything other than a risky change. I think it will be worthwhile in the long run, but there may be some short term pain. For that reason, I am initially putting it in draft status. I think it's ready, but it requires more scrutiny than most changes. It is unusually likely that I didn't dot an "i" or cross a "t" somewhere that isn't caught in unit testing. Part of the reason is because test coverage of Reader/Xlsx isn't all that high in the first place, and a fair proportion of the coverage comes from samples, not test assertions, so I hope that you and @IMSoP will give it a thorough going over. Here's an explanation for some of what I've done, since my motives won't always be obvious.

Added a Reader/Xlsx/Namespaces class to eliminate the hard-coded urls found in dozens of places throughout Reader/Xlsx and its subclasses. I was truly unsure whether it was better to define each item as a single string, or to attempt to show their relationship by using concatenations.

Added 2 routines, loadZip and loadZipNoNamespace, to handle the reading of the zip files. Both can specify a namespace, but the latter ignores it. Any calls to loadZipNoNamespace should act exactly the same as before, but they indicate that I either haven't analyzed the call yet, or haven't been able to determine whether specifying a namespace would be a more correct action. In a couple of cases, I call both because namespaced seems to work with part of the code that follows, but not with other parts.

Introduced 2 functions, testSimpleXml and xpathNoFalse, to attempt to eliminate the plethora of SimpleXMLElement|false and array|false possibilities that Phpstan (and, doubtless, Scrutinizer) dislike. Similarly, I have tried to cast SimpleXMLElement to string once in cases where it is done several times within a few statements.

I did not do any major refactoring with this PR. Reader/Xlsx is desparately in need of it; I just couldn't figure out how to break it up with this PR. I plan to come back to it. I do, however, have some ideas for how to refactor Styles, which is what I plan to do next if this PR is merged.

The new test file namespacestd.xlsx and the tests associated with are based on the Sylk test file, because I know that covers a lot of cases. The new test file namespacenonstd.xlsx is crafted by hand from this so that they can run identical tests (and also because, despite the number of namespace issues, I couldn't find much in the way of useful example files). Some of the nonstd tests are, of course, marked incomplete for now. The test file namespacepurl.xlsx and its tests come from @Lambik PR #2068.

Reader/Xlsx seems to be a very busy module - 8 changes between April 18 and 24, and Mark just opened a new one today, and there are 2 complex differences showing as merge conflicts (and a test failure to debug after I resolved those) - so, although there is risk in installing this PR, there is also risk in letting it become stale by waiting on it too long. So, I plan to mark this change ready when discussion on it appears to have petered out. I have not yet attempted to merge any of my own changes (although there are a couple of small ones I might be willing to do so for); this will certainly not be an exception.

MarkBaker · 2021-05-11T11:43:08Z

While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult.

I'd seen this, but wasn't treating it as a priority for the moment. It's useful to support, but there's a lot more that needs doing before we can take a look at it, especially as only a few users seem interested in expanding comments support, and nobody has raised the topic of threaded comments yet... the worst thing about it seems to be a renaming between Comments (threaded) and Notes (which used to be called Comments)

it would be lying to say this is anything other than a risky change. I think it will be worthwhile in the long run, but there may be some short term pain.

It is high risk: one of the main reasons why I'd put off addressing it for so long... but I'm seeing more and more requests to handle it as more languages have their own Excel Writers using different namespacing, and users then want to be able to import those spreadsheets as well as Excel-created workbooks.

Added a Reader/Xlsx/Namespaces class to eliminate the hard-coded urls found in dozens of places throughout Reader/Xlsx and its subclasses. I was truly unsure whether it was better to define each item as a single string, or to attempt to show their relationship by using concatenations.

My feeling would be single string for each item; relationships should be an aspect of documentation, not executable code.

I did not do any major refactoring with this PR. Reader/Xlsx is desparately in need of it

Desperately in need; but that is a separate exercise, and shouldn't be conflated with this PR.

Reader/Xlsx seems to be a very busy module - 8 changes between April 18 and 24, and Mark just opened a new one today,

Ignore my new one for the moment: it's another experiment... I made the mistake of started looking at Pivot Tables again, and realised that I really needed to extract all the Zip handling to a common class so that it would be available across a number of Reader classes for Xlsx that would be created by any significant refactoring, DI'ed into their constructor. So that was purely an experiment of mine just to get a feel for how it would work... although it does look like it's an approach that could have a lot of benefits for a major breakdown of that monolithic Reader

There's a few other things I have in my pipeline (and I may be quieter here for a week or so): I want to start providing specific tests for every individual feature that we support for the different file formats, which would then allow me to properly update the feature grid in documentation... and explore parallel running of tests in the pipeline to try and speed that up; not to mention that another PR has reminded me how out of date the language translation tables are for Excel Functions, with many functions missing in the lists... they need some TLC, and I've recently installed the Function translator plug-in for Excel, which gives me that information, and for a whole range of new languages as well.... so I won't be making any changes that affect this PR.

Scrutinizer.

Mark Baker expressed preference for single strings rather than concatenations here.

Final Scrutinizer changes for now, I hope.

oleibman · 2021-05-11T15:16:22Z

Satisfied Scrutinizer at last. I have, per Mark Baker's preference, changed the Namespaces member to use single strings. I have one failing test which is currently commented out; I hope to be able to debug it tonight.

oleibman · 2021-05-12T02:04:58Z

The code responsible for the failing test is now fixed and the test is restored.

Scrutinizer is such fun.

MarkBaker · 2021-05-14T17:48:12Z

Given the significance of this change, I'm not going to fully review/merge it until after the next release

oleibman · 2021-05-14T19:29:26Z

Seems right. I will leave it in draft status at least till then.

oleibman · 2021-05-14T19:38:08Z

If I can take this opportunity to ask a quick question unrelated to this change. Looking at the commits on May 11, there are two labeled with the original commit title, and described as "so-and-so authored and MarkBaker committed ...". On the same day, there is one labeled "merge pull request ... from so-and-so", and described as "oleibman committed ...". What should I be doing to get the label and description to look like yours?

MarkBaker · 2021-05-21T19:22:54Z

Good to see how well it works with the first Spreadsheet that I've tested it with

oleibman · 2021-05-21T19:26:11Z

Sigh of relief! Much better to have a "natural" example than the one I hand-crafted.

MarkBaker · 2021-05-21T20:56:47Z

re "so-and-so authored and MarkBaker committed ..." I had to rebase and merge, and that creates a new commit with my name tagged as a co-author.... sometimes I forget to remove that tag from the comments before pressing the button

* Make DefinedNames Samples Consistent With Other Samples (#1707) All other Samples write to temporary directory. DefinedNames samples write to main directory, which (a) means they aren't stored with others, and (b) they aren't ignored by git so look like changed files. The tests are also simplified by requiring Header rather than Bootstrap, making use of Helper. * Resolve XSS Vulnerability in the HTML Writer (#1719) Resolve XSS Vulnerability in the HTML Writer * Drop Travis * Automatic GitHub releases from git tags * Improve Coverage in src/PhpSpreadsheet There are no changes to code. Additional tests are added, so that the following 6 items now have 100% test coverage: - Comment - DefinedName - DocumentGenerator - IOFactory - NamedFormula - NamedRange * Changes for Scrutinizer Two changes to fix minor problems reported by Scrutinizer. * Spelling: Tou -> You * Fix for 1735 (Incorrect activeSheetIndex after RemoveSheetByIndex) (#1743) This is a fix for issue #1735. It adds tests for this situation, and similar situations involving adding new sheets and accessing existing ones. Coverage for Spreadsheet.php increases from 69% to 75% as a result. * Update change log * Fix for 3 Issues Involving ReadXlsx and NamedRange (#1742) * Fix for 3 Issues Involving ReadXlsx and NamedRange Issues #1686 and #1723, which provide sample spreadsheets, are probably solved by this ticket. Issue #1730 is also probably solved, but I have no way to verify. There are two problems with how PhpSpreadsheet is handling things now. Although the first problem is much less severe, and isn't really a factor in the issues named above, it is helpful to get it out of the way first. If you define a named range in Excel, and then delete the sheet where the range exists, Excel saves the range as #REF!. If there is a cell which references the range, it will similarly have the value #REF! when you open the Excel file. Currently, PhpSpreadsheet discards the #REF! definition, so a cell which references the range will appear as #NAME? rather than #REF!. This PR changes the behavior so that PhpSpreadsheet retains the #REF! definition, and cells which reference it will appear as #REF!. The second problem is the more severe, and is, I believe, responsible for the 3 issues identified above. If you define a named range and the sheet on which the range is defined does not exist at the time, Excel will save the range as something like: '[1]Unknown Sheet'!$A$1 If a cell references such a range, Excel will again display #REF!. PhpSpreadsheet currently throws an Exception when it encounters such a definition while reading the file. This PR changes the behavior so that PhpSpreadsheet saves the definition as #REF!, and cells which reference it will behave similarly. For the record, I will note that Excel does not magically recalculate when a missing sheet is subsequently added, despite the fact that the reference might now become resolvable. PhpSpreadsheet behaves likewise. * Remove Dead Code in Test Identified it after push but before merge. * Update change log * Apply Column and Row Styles to Existing Cells (#1721) * Apply Column and Row Styles to Existing Cells This is a fix for issue #1712. When a style is applied to an entire row or column, it is currently only effective for cells which don't already contain a value. The code needs to iterate through existing cells in the row/column in order to apply the style to them. This could be considered a breaking change, however, I believe that the change makes things operate as users would expect, and that the existing implementation is incomplete. The change also removes protected element conditionalStyles from the Style class. That element is an unused remnant, and can no longer be set or retrieved - methods getConditionalStyles and setConditionalStyles actually act on an element in the Worksheet class. Finally, additional tests are added so that Style, and in fact the entire Style directory, now has 100% test coverage. * Scrutinizer Changes Scrutinizer flagged 6 statements. 5 can be easily corrected. One is absolutely wrong (it thinks iterating through cells in column can return null). Let's see if we can satisfy it. * Remove Exception For CellIterator on Empty Row/Column For my first attempt at this change, which corrects a bug by updating styles for non-empty cells when a style is set on a row or column, I wished to make things more efficient by using setIterateOnlyExistingCells, something which the existing documentation recommends. This caused an exception to be generated when the row or column is empty. So I removed that part of the change while I researched what was going on. I have completed that research. The existing code does throw an exception when the row/column is empty and iterateOnlyExistingCells is true. However, that does not seem like a reasonable action. This situation is analagous to iterating over an empty array, and that action is legal and does not throw. The same should apply here. There were no tests for this situation, and now there are. I have added additional tests, and coverage for all of RowCellIterator, ColumnCellIterator, and CellIterator are all now 100%. Some of my new tests were added in new members, because the existing tests all relied on mocking, which was not the best choice for the new tests. One of the existing tests for RowCellIteratorTest (testSeekOutOfRange) was wrong; it issued the expected exception, but for the wrong reason. I have added an additional test to ensure that it fails "correctly". The existing documentation says that the default value for IterateOnlyExistingCells is true. In fact, the default value is false. I have corrected the documentation. * More Scrutinizer I believe its analysis is incorrect, but this should silence it. * DocBlock Correction ColumnCellIterator DocBlock for current indicated it could return null or Cell, but it can really return only Cell. This had caused Scrutinizer to complain earlier. * PHP8 Environment Appears to be Fixed Cosmetic change to Doc member. I suspect there is a way to rerun all the tests without another push, but I have been unable to figure out how. * Update change log * TextData Coverage and Minor Bug Fixes (#1744) This had been intended to get 100% coverage for TextData functions, and it does that. However, some minor bugs requiring source changes arose during testing. - the Excel CHAR function restricts its argument to 1-255. PhpSpreadsheet CHARACTER had been allowing 0+. Also, there is no need to test if iconv exists, since it is part of Composer requirements. - The DOLLAR function had been returning NUM for invalid arguments. Excel returns VALUE. Also, negative amounts were not being handled correctly. - The FIXEDFORMAT function had been returning NUM for invalid arguments. Excel FIXED returns VALUE. * Replace anti-xss with html purifier (#1751) * Replace voku/anti-xss with ezyang/htmlpurifier. Despite anti-xss being a smaller footprint dependency, an a better license fit with our MIT license, there are issues with it's automatic it sanitisation of global variables causing side effects * Additional unit tests for xss in html writer cell comments * Fix bug #1626 where values of 0 were "rounded" up/down as if they were not 0 (#1627) * Fix bug where values of 0 were "rounded" up/down as if they were not 0 * Update change log * Fix for #1612 - SLK Long File Name (#1706) Issue has been marked stale, but ... Sylk read sets worksheet title to filename (minus .slk). If that is >31 characters, PhpSpreadsheet throws Exception. This change truncates sheet title, as Excel does, to 31 characters. * Update change log * worksheet: fix if cellValue does not exist (#1727) The condition is FALSE if the cell does not exist in the flipped table, but anyway, it is sent in to a method requiring 'string' type, causing it to fail. * fixes #1655 issue (#1656) Resolve problem with incorrectly defined hyperlinks * Add 'ps' suffix to printer settings resources IDs (#1690) * Add 'ps' suffix to printer settings resources IDs * Update change log * Fix pixelsToPoints conversion (for HTML col width) (#1733) * DocBlock Change in Styles/Conditional (#1697) Scrutinizer reported a minor error in a test involving a module which I was not changing. Styles/Conditional function setConditions can take a scalar or an array as a parameter, but DocBlock says it only expects array. I did not wish to add the extra module to my PR, but made a note to self to fix that after PR was installed. That has now happened, and it makes for a good case for me to see all the PHP8/Composer2/etc. changes that have happened recently. * Merge pull request #1698 * Merge pull request #4 from PHPOffice/master * Restore Omitted Read XML Test * Fix for bug #1592 (UPDATED) (#1623) * Fix for Xls when BIFF8 SST (FCh) has bad Shared string length * Update change log * Add nightly PHP 8.1 dev to github actions (#1763) * Fix compatibility with ext-gd on php 8 (#1762) * CSV - Guess Encoding, Handle Null-string Escape (#1717) * CSV - Guess Encoding, Handle Null-string Escape This is in response to issue #1647 (detect CSV character encoding). First, my tests with mb_detect_encoding indicate that it doesn't work well enough; regardless, users can always do that on their own if they deem it useful. Rolling my own is also troublesome, but I can at least: a. Check for BOM (UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE). b. Do some heuristic tests for each of the above encodings. c. Fallback to a user-specified encoding (default CP1252) if a and b don't yield result. I think this is probably useful enough to include, and relatively easy to expand if other potential encodings should be considered. Starting with PHP7.4, fgetcsv allows specification of null string as escape character in fgetcsv. This is a much better choice than the PHP (and PhpSpreadsheet) default of backslash in that it handles the file in the same manner as Excel does. There is one statement in Reader/CSV which would be adversely affected if the caller so specified (building a regular expression under the assumption that escape character is a single character). Fix that statement appropriately and add tests. * Update changelog * Update Units of Measure supported by the CONVERT() function (#1768) Now supports all current UoM in all categories, with both 1- and 2-character multiplier prefixes, and binary multiplier prefixes, including the new Temperature scales * Changelog for 1.16.0 release * Fix date tests withut specified year for current year 2021 (#1774) * Mrand of zero to any multiple should return 0 (#1773) * WIP Inherited Scrutinizer Recommendations - 1 of 3 I tried to sync my fork with the main project, as I have done several times before. However, the GitHub interface to do this had changed, and it appears that I did not make the optimal selection when I had a choice. Consequently, all the merges that happened to base between the last time I synchronized and this time appear to be part of any PR that I push. "Files changed" remains correct for my new PRs, but there appear to be many more commits involved than is actually the case. I will, at some point, delete and re-create my fork, and pay much closer attention in future when I want to sync my fork with the main project. Because of this set-up, Scrutinizer reports flaws in code that I haven't actually changed in my PRs #1799 and #1800. It still passes them, but, as long as I'm aware of the problems, I may as well attempt to correct them. The following are not part of those PRs: - 5 problems spread over 4 different members - 12 problems in Calculation/Engineering - 15 problems in Reader/XML I shall attempt to resolve these via 3 separate PRs, of which this is the first. * Try Hyperlink Again Scrutinizer still didn't like it as fixed. * Another Crack at Hyperlink This should work. Co-authored-by: Mark Baker <[email protected]> Co-authored-by: Adrien Crivelli <[email protected]> Co-authored-by: Ryan McAllen <[email protected]> Co-authored-by: Flinsch <[email protected]> Co-authored-by: Jan Sverre Riksfjord <[email protected]> Co-authored-by: Max Kalyabin <[email protected]> Co-authored-by: Sébastien Despont <[email protected]> Co-authored-by: Guilliam Xavier <[email protected]> Co-authored-by: Gianluca Giovinazzo <[email protected]> Co-authored-by: Alexander M. Turek <[email protected]> Co-authored-by: Martins Sipenko <[email protected]>

MarkBaker · 2021-06-04T06:58:38Z

Now that 1.18.0 is released, I think we can continue work on this with a view to including it in the next release; I think it needs more extensive testing, particularly with real-world spreadsheet files, but from the few tests I've done, it looks pretty good so far

A file created with a very old release of Openpyxl (2.3.5), which used unexpected namespacing. A small subset of a very large file submitted with issue PHPOffice#2109.

oleibman · 2021-06-13T21:53:24Z

@MarkBaker @PowerKiKi - I require some help. I added two new spreadsheet and two new unit test members in response to Mark's latest comments. The unit tests did not proceed ("Waiting for status to be reported"). I was not able to resolve the differences in Github ("merge conflict between head and base"). The Github "Resolve conflicts" button is grayed out. I clicked on "Use the command line", which gives me the following information:

Conflicting files:

src/PhpSpreadsheet/Reader/Xlsx.php
src/PhpSpreadsheet/Reader/Xlsx/Properties.php
tests/PhpSpreadsheetTests/Reader/SheetsXlsxChartTest.php
tests/PhpSpreadsheetTests/Reader/XlsxTest.php

I believe the "tests" members are there because I earlier moved them out of the Reader directory into the Reader/Xlsx directory. I am satisfied with the contents of Xlsx.php and Xlsx/Properties.php.

Step 1: From your project repository, check out a new branch and test the changes.

git checkout -b oleibman-namexlsx master
git pull https://github.com/oleibman/PhpSpreadsheet.git namexlsx

Step 2: Merge the changes and update on GitHub.

git checkout master
git merge --no-ff oleibman-namexlsx
git push origin master

I have done the checkout and pull from Step 1, confirmed that the modules are as expected, and confirmed that the test suite runs without error. My repository says I have a branch named oleibman-namexlsx with no changes. So, I guess I'm ready to go on to the parts of "Step 2" above, but, having done nothing like that before, I need some reassurance that it's the right thing to do. I have captured all the output from my terminal session if you want me to upload it.

When I do all of that, is that just like a new push to namexlsx, or is it creating a new branch oleibman-namexlsx? Will the tests now run, or is there something else that will need to be done? Will the diff report for all the changes be meaningful? What else should I be looking out for?

MarkBaker · 2021-06-14T07:04:42Z

I'll take a look at lunchtime (CEST) but most of the merge conflicts seem to resolve around reading the date properties from the document, and a big conflict in XlsxTest.php

MarkBaker · 2021-06-15T13:57:43Z

Don't go ahead! Those merge conflicts need resolution first; I've tried manually resolving them once already, and ended up with a slew of errors. I'll have another go later, and try to identify what the actual problems are with the merge

oleibman · 2021-06-15T14:10:18Z

Yipe! I'm glad my call for help wasn't just because I was missing something blindingly obvious, but am sorry that this has become such a problem. If need be, I think I ought to be able to rebase and recreate this as a different PR - not my first choice, obviously, but possibly safer than trying to untangle everything.

oleibman · 2021-06-16T14:53:32Z

To help unravel things, I would like to open a separate PR to move SheetsXlsxChartTest and XlsxTest from Reader to Reader/Xlsx, with no other changes. It is really helpful to have all the Xlsx tests in the same directory. Do you think that would be okay, or would it mess things up even more?

MarkBaker · 2021-06-17T07:37:00Z

It seems sensible to move the Reader tests to the specific Reader/xxxx folders, and to maintain Readers just for generic IOFactory Reader testing (identify, etc)... and moving forward, I'm wanting to extract some of those reader tests for specific features (DataValidations, Conditional Formatting, etc) into their own test files as well.... so moving those two tests is a sensible idea

PR PHPOffice#2088 is having major merge problems. This is partly because it moves some tests from Reader to Reader/Xlsx. Making this move beforehand may help. Or it may make things worse, but they are already bad enough that I am contemplating redoing the PR. If I do that, having this done beforehand will make things easier. This PR does nothing but move some tests. This will make it easier to test changes to Xlsx Reader without having to run each test individually, and without having to run tests for all the other readers at the same time.

oleibman · 2021-06-17T14:24:32Z

I opened #2170, which is a pretty simple change, and it is also in "must resolve conflicts by hand" state. So I've converted it to a draft as well. I think I just need to start this over.

oleibman · 2021-06-17T14:33:39Z

Aha! Found the problem in 2170. It lacks the changes from #2149. I'll try to apply those and see what happens to it.

oleibman · 2021-06-17T14:44:34Z

Added the needed changes to 2170, but Github still thinks I need to do things by hand. Sigh.

PR PHPOffice#2088 is having major merge problems. This is partly because it moves some tests from Reader to Reader/Xlsx. Making this move beforehand may help. Or it may make things worse, but they are already bad enough that I am contemplating redoing the PR. If I do that, having this done beforehand will make things easier. This PR does nothing but move some tests. This will make it easier to test changes to Xlsx Reader without having to run each test individually, or without having to run tests for all the other readers at the same time.

oleibman · 2021-06-17T16:49:26Z

Trying test moves again as PR #2171. This seems to be more successful, and, if that holds, that will give me a new base to start from if need be.

oleibman · 2021-06-17T17:09:11Z

PR 2171 is ready to merge. If we haven't figured a way out of the mess in this PR by the start of next week, I will re-do it using 2171 as my starting point.

oleibman · 2021-06-18T00:38:43Z

Good news! I have been able to create a new branch combining 2171 and the changes in this PR, and it passes all the old and new unit tests, including the new spreadsheets added this week. Need to do a bit more testing, but expect a new PR early next week.

@IMSoP

This is a replacement for PHPOffice#2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions. As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged. The original commit message follows. There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment PHPOffice#860 (comment) in issue PHPOffice#860 by @IMSoP has triggered an idea about how to proceed. Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of: - listWorkSheetNames - listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected. - active sheet - selected cell and top left cell - cell content (formulas, numbers, text) - hyperlinks - comments (partial - see below) This PR does not address: - styles - images and charts - VBA and ribbons - many other items, I'm sure The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue PHPOffice#2067 PR PHPOffice#2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR. While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult. Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts. I will add more detail as comments after I push this change.

oleibman · 2021-06-22T16:06:42Z

Closing in favor of #2173.

@IMSoP

* Xlsx Reader Better Namespace Handling Phase 1 Try2 This is a replacement for #2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions. As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged. The original commit message follows. There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment #860 (comment) in issue #860 by @IMSoP has triggered an idea about how to proceed. Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of: - listWorkSheetNames - listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected. - active sheet - selected cell and top left cell - cell content (formulas, numbers, text) - hyperlinks - comments (partial - see below) This PR does not address: - styles - images and charts - VBA and ribbons - many other items, I'm sure The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue #2067 PR #2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR. While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult. Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.

oleibman added 5 commits May 10, 2021 21:58

Merge branch 'master' into namexlsx

d7e042f

Update NamespaceNonStdTest.php

1cdb40c

Case-sensitive file name.

Update NamespacePurlTest.php

80d8e59

Case-sensitive file name.

Update NamespaceStdTest.php

784c878

Case-sensitive file name.

oleibman marked this pull request as draft May 11, 2021 05:45

oleibman added 2 commits May 10, 2021 23:42

Temporary Change to NamespaceStdTest.php

f4a483b

Temporarily disable failing test. Will not merge until test is restored.

Update phpstan-baseline.neon

c1d41c0

oleibman added 7 commits May 11, 2021 06:47

Update Xlsx.php

3f94893

Scrutinizer.

Update Styles.php

22fce2b

Scrutinizer.

Update Hyperlinks.php

c341e19

Scrutinizer.

Update Properties.php

a8e3cc8

Scrutinizer.

Update phpstan-baseline.neon

307321d

Update Namespaces.php

191122a

Mark Baker expressed preference for single strings rather than concatenations here.

Update Xlsx.php

83c6dc1

Final Scrutinizer changes for now, I hope.

oleibman added 2 commits May 11, 2021 18:56

Fix Incorrect Reading of Style Protection

b97190b

Reenable Commented Out Test

fdc35d1

Update Xlsx.php

754d732

Scrutinizer is such fun.

MarkBaker mentioned this pull request May 21, 2021

No sheets detected #2109

Closed

MarkBaker mentioned this pull request Jun 3, 2021

Incorrect loading of sheets in PhpOffice\PhpSpreadsheet\Reader\Xlsx::listWorksheetInfo() #2142

Closed

More Tests

2785fc9

A file created with a very old release of Openpyxl (2.3.5), which used unexpected namespacing. A small subset of a very large file submitted with issue PHPOffice#2109.

MarkBaker marked this pull request as ready for review June 14, 2021 06:45

oleibman mentioned this pull request Jun 17, 2021

Move Reader Xlsx Tests from Reader to Reader/Xlsx #2170

Closed

5 tasks

oleibman mentioned this pull request Jun 17, 2021

Move Reader Xlsx Tests from Reader to Reader/Xlsx Try 2 #2171

Merged

5 tasks

oleibman mentioned this pull request Jun 19, 2021

Xlsx Reader Better Namespace Handling Phase 1 Try2 #2173

Merged

5 tasks

oleibman closed this Jun 22, 2021

oleibman deleted the namexlsx branch February 13, 2022 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xlsx Reader Better Namespace Handling Phase 1 #2088

Xlsx Reader Better Namespace Handling Phase 1 #2088

oleibman commented May 11, 2021

oleibman commented May 11, 2021

MarkBaker commented May 11, 2021 •

edited

Loading

oleibman commented May 11, 2021

oleibman commented May 12, 2021

MarkBaker commented May 14, 2021

oleibman commented May 14, 2021

oleibman commented May 14, 2021

MarkBaker commented May 21, 2021

oleibman commented May 21, 2021

MarkBaker commented May 21, 2021

MarkBaker commented Jun 4, 2021

oleibman commented Jun 13, 2021

MarkBaker commented Jun 14, 2021

MarkBaker commented Jun 15, 2021

oleibman commented Jun 15, 2021

oleibman commented Jun 16, 2021

MarkBaker commented Jun 17, 2021

oleibman commented Jun 17, 2021

oleibman commented Jun 17, 2021

oleibman commented Jun 17, 2021

oleibman commented Jun 17, 2021

oleibman commented Jun 17, 2021

oleibman commented Jun 18, 2021

oleibman commented Jun 22, 2021

Xlsx Reader Better Namespace Handling Phase 1 #2088

Xlsx Reader Better Namespace Handling Phase 1 #2088

Conversation

oleibman commented May 11, 2021

Why this change is needed?

oleibman commented May 11, 2021

MarkBaker commented May 11, 2021 • edited Loading

oleibman commented May 11, 2021

oleibman commented May 12, 2021

MarkBaker commented May 14, 2021

oleibman commented May 14, 2021

oleibman commented May 14, 2021

MarkBaker commented May 21, 2021

oleibman commented May 21, 2021

MarkBaker commented May 21, 2021

MarkBaker commented Jun 4, 2021

oleibman commented Jun 13, 2021

MarkBaker commented Jun 14, 2021

MarkBaker commented Jun 15, 2021

oleibman commented Jun 15, 2021

oleibman commented Jun 16, 2021

MarkBaker commented Jun 17, 2021

oleibman commented Jun 17, 2021

oleibman commented Jun 17, 2021

oleibman commented Jun 17, 2021

oleibman commented Jun 17, 2021

oleibman commented Jun 17, 2021

oleibman commented Jun 18, 2021

oleibman commented Jun 22, 2021

MarkBaker commented May 11, 2021 •

edited

Loading