-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read XLSX with namespaces in their XML #860
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Just up this issue else it'll be closed. |
Hello, I have exactly the same problem that is related in this subject: "Can´t read data from Excel file generated by openpyxl #842" PHPOffice/PHPExcel#842. Thank you in advance. |
@LouiseGs I think I documented this somewhere else; however, I can't find it so I'll post it here. I was having an issue reading an .xlsx file that appears to be generated from SpreadsheetLight. When digging into the code it's due to this issue which is having namespaces in the file. I used the following solution: <?php
# src/Company/Bridge/PhpOffice/PhpSpreadsheet/Reader/Excel2007_NonStandardNamespacesWorkaround.php
namespace Company\Bridge\PhpOffice\PhpSpreadsheet\Reader;
use PhpOffice\PhpSpreadsheet\Reader\Xlsx;
/**
* @see https://stackoverflow.com/questions/29514200/phpexcel-parsing-xlsx-files-with-x-namespace
* @see https://github.com/PHPOffice/PHPExcel/issues/1187#issuecomment-295032648
* @see https://github.com/PHPOffice/PhpSpreadsheet/issues/500
*/
class Excel2007_NonStandardNamespacesWorkaround extends Xlsx
{
public function __construct()
{
parent::__construct();
$this->securityScanner->setAdditionalCallback([self::class, 'securityScan']);
}
public static function securityScan($xml)
{
return str_replace(
[
'<x:',
'</x:',
/*':x=',*/
'<d:',
'</d:',
/*, ':d='*/
],
[
'<',
'</',
/*'=',*/
'<',
'</',
/*, '='*/
],
$xml
);
}
} <?php
# src/Company/Bundle/Device/WidgetBundle/Tests/Component/Import/Importer/WidgetImporterTest.php
namespace Company\Bundle\Device\WidgetBundle\Tests\Component\Import\Importer;
use Ramsey\Uuid\Uuid;
use Company\Bridge\PhpOffice\PhpSpreadsheet\Reader\Excel2007_NonStandardNamespacesWorkaround;
class WidgetImporterTest extends \vi_Test_Case_Unit
{
/**
* @dataProvider fileProvider
*/
public function testPhpSpreadSheet($inputFileName)
{
$inputFileName = __DIR__ . '/./../../../../resources/data/' . $inputFileName;
$reader = new Excel2007_NonStandardNamespacesWorkaround();
$spreadsheet = $reader->load($inputFileName);
$sheets = $spreadsheet->getSheetNames();
$this->assertNotEmpty($sheets);
$this->assertEquals('Widget Test Results', $sheets[0]);
} |
Thank you for your help @isleshocky77. I think I have already tried it but I will try again. |
Hello, I just read the PHPOffice/PHPExcel#842 and it seems that your library is using the namespace "nso". In my solution (and the one from @isleshocky77), we are only removing namespace "x" and "d". In my solution, you can try to add "nso" (or "ns0" I don't really know which one is used) in the array |
I had already added “nso” but not “ns0” and I’m not sure that I have called the function via the securityScanner callback. I will try this as soon as possible and tell you the result. |
Thank you isleshocky77 and JulienChavee for your help ! |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This is still an issue. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This is still an issue. |
This is still an issue, please help |
Simpler solutionInside the XmlScanner.php, add the following line: public function scan($xml)
{
//fails if we replace <v:
//$xml = preg_replace('/(<\/?)(\w+:)/', "$1", $xml);
//$xml = preg_replace('/(<\/?)((ap|vt|x|v)+:)/', "$1", $xml);
//$xml = preg_replace('/(<\/?)((ap|vt|x)+:)/', "$1", $xml);
//doesn't fail with <ap: and <vt:
$xml = preg_replace('/(<\/?)(x+:)/', "$1", $xml);
...
...
} Or place that code inside Xlsx.php: private function getFromZipArchive(ZipArchive $archive, $fileName = '')
{
...
...
if ($contents === false) {
$contents = $archive->getFromName(substr($fileName, 1), 0, ZipArchive::FL_NOCASE);
}
$contents = preg_replace('/(<\/?)(x+:)/', "$1", $contents);
return $contents;
} I unzipped xlsx file and open all xml file, to investigate the differences between xlsx files (one ok and the other returning an empty sheet). I get into this because of the following link: PhpWord - Reader - Word2007.php // Remove namespaces from elements and attributes name
if ($removeNamespace) {
$contents = preg_replace('~(</?|\s)w:~is', '$1', $contents);
} But It doesn't work, don't know why, so I did my own version. Hope it helps, and comment about that link I found, it could help. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Hello, this is still an issue |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Still an issue. |
Still an issue |
I am experiencing the same issue - any plans to fix this in future releases? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This is still an issue |
Is this still an issue? |
Yes it is |
We also ran into this bug and the 'hack' mentioned by esteban-code does result in data (sheets) becoming available (while there were no sheets loaded before) A permanent solution for this bug would be nice. |
I believe we're facing a similar issue. An XLSX file from an external source gives the error:
Converting the file to CSV beforehand fixes it, or opening in Excel and re-saving it. Not ideal for large numbers of files though. |
Since XML namespaces are frequently misunderstood, I'd like to point out a few things: Firstly, the local prefix of a namespace does not affect its meaning, it is just for use in that particular document (or fragment). The identifier of a namespace is a URI (not a URI that points anywhere, just a unique URI for ownership purposes). The default namespace is basically the same, just think of it as having an empty prefix. The following three XML documents should all be treated as completely equivalent:
Secondly, PHP's SimpleXML library has full support for namespaces, but the programmer needs to account for them by "switching" namespaces using the So, while mangling the XML to forcefully change the namespaces will work around this bug, it is not the correct fix. There is nothing wrong, or particularly "exotic" about the XML. The actual bug is that somewhere in the PhpSpreadsheet library, it is not selecting the correct namespace. It just happens that most XLSX files (those created by Excel itself, I'm guessing) have the required namespace as the default (no-prefix) namespace, and that gets selected by SimpleXML by default. |
Actually, looking at the source code, I think the trouble goes deeper - I see lots of use of this anti-pattern: $namespaces = $xml->getNamespaces(true);
$xml_ss = $xml->children($namespaces['ss']); This looks at the local aliases in the XML document, and picks whichever happens to have the prefix 'ss'. Every instance of that pattern needs to be replaced with a reference to the actual namespace URI, e.g. $xml_ss = $xml->children('urn:schemas-microsoft-com:office:spreadsheet'); Probably the list of namespaces in this comment could be put into constants somewhere in the library. Not listed there is what namespace is expected to be the default, which should be treated exactly the same as the others. |
There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment PHPOffice#860 (comment) in issue PHPOffice#860 by @IMSoP has triggered an idea about how to proceed. Although the issues exclusively concern Xlsx format, I am starting out by dealing with Gnumeric. It is simpler and smaller than Xlsx, and, more important, already has a test for an unexpected prefix, since, at some point, it changed its generic prefix from gmr to gnm. I added support and a test for that some time ago, but almost certainly not in the best possible manner. The code as changed for this PR seems simpler and less kludgey, both for that exceptional case as well as for normal handling. My hope is that this change can be a template for similar Reader changes for Xml, Ods, and, especially, Xlsx. All grandfathered Phpstan issues with Gnumeric are fixed and eliminated from baseline as part of this change.
* Gnumeric Better Namespace Handling There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment #860 (comment) in issue #860 by @IMSoP has triggered an idea about how to proceed. Although the issues exclusively concern Xlsx format, I am starting out by dealing with Gnumeric. It is simpler and smaller than Xlsx, and, more important, already has a test for an unexpected prefix, since, at some point, it changed its generic prefix from gmr to gnm. I added support and a test for that some time ago, but almost certainly not in the best possible manner. The code as changed for this PR seems simpler and less kludgey, both for that exceptional case as well as for normal handling. My hope is that this change can be a template for similar Reader changes for Xml, Ods, and, especially, Xlsx. All grandfathered Phpstan issues with Gnumeric are fixed and eliminated from baseline as part of this change. * Namespace Handling using XMLReader Adopt a suggestion from @IMSoP affecting listWorkSheetInfo, which uses XMLReader rather than SimpleXML for its processing. * Update GnumericLoadTest.php PR #2024 was pushed last night, causing a Phpstan problem with this member. * Update Gnumeric.php Suggestions from Mark Baker - strict equality test, more descriptive variable names.
There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment PHPOffice#860 (comment) in issue PHPOffice#860 by @IMSoP has triggered an idea about how to proceed. Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of: - listWorkSheetNames - listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected. - active sheet - selected cell and top left cell - cell content (formulas, numbers, text) - hyperlinks - comments (partial - see below) This PR does not address: - styles - images and charts - VBA and ribbons - many other items, I'm sure The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue PHPOffice#2067 PR PHPOffice#2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR. While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult. Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts. I will add more detail as comments after I push this change.
This is a replacement for PHPOffice#2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions. As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged. The original commit message follows. There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment PHPOffice#860 (comment) in issue PHPOffice#860 by @IMSoP has triggered an idea about how to proceed. Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of: - listWorkSheetNames - listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected. - active sheet - selected cell and top left cell - cell content (formulas, numbers, text) - hyperlinks - comments (partial - see below) This PR does not address: - styles - images and charts - VBA and ribbons - many other items, I'm sure The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue PHPOffice#2067 PR PHPOffice#2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR. While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult. Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts. I will add more detail as comments after I push this change.
* Xlsx Reader Better Namespace Handling Phase 1 Try2 This is a replacement for #2088, which has run into merge conflicts. I will close that PR in the near future, however the comments in that PR may prove useful for this one. While that PR has been in draft status all along, I am marking this one as ready. I will gladly add additional tests (and, of course, make code changes) that anyone has to suggest, but, with my most recent test files which I will describe in a separate comment, I have no further ideas on useful additions. As mentioned in the earlier ticket, this is a risky change. But, as has been demonstrated, delaying it comes with its own set of risks. It would be helpful to have a temporary moratorium on changes to Reader/Xlsx until this change is merged. The original commit message follows. There have been a number of issues concerning the handling of legitimate but unexpected namespace prefixes in Xlsx spreadsheets created by software other than Excel and PhpSpreadsheet/PhpExcel.I have studied them, but, till now, have not had a good idea on how to act on them. A recent comment #860 (comment) in issue #860 by @IMSoP has triggered an idea about how to proceed. Gnumeric Reader was recently changed to handle namespaces better. Using that as a model, this PR begins the process of doing the same for Xlsx. Xlsx is much larger and more complicated than Gnumeric, hence the need to tackle it in multiple phases. I believe that this PR handles all of: - listWorkSheetNames - listWorkSheetInfo. Note that there was a bug in this function which would cause it to count only used columns rather than all columns. That bug is corrected. - active sheet - selected cell and top left cell - cell content (formulas, numbers, text) - hyperlinks - comments (partial - see below) This PR does not address: - styles - images and charts - VBA and ribbons - many other items, I'm sure The issue for non-standard namespacing till now has been the use of unexpected prefixes. While I was working on this change, @Lambik introduced issue #2067 PR #2068 which introduced a completely different problem - the use of unexpected URLs. That PR and the issue associated with it were quite well documented, including the supplying of a test file and tests for it. I asked if I could take a look to see if it could be integrated with my change, and the result seems to be yes, so those changes are also part of this PR. While adding a comment to my test file, I discovered that Microsoft had added "threaded comments" as a new feature. I believe these are not yet supported by PhpSpreadsheet, and I am not going to add it, at least not now. I believe that, among other things, this will make identifying the author of a comment more difficult. Although there are a number of Phpstan baseline changes as part of this PR, I did not attempt to resolve all Phpstan reports for Reader/Xlsx. Nor did I do anything to increase coverage. This change is already large and complex enough without those efforts.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Is this still an issue? |
@esteban-code The fix for this is a complex and far-reaching change to the library, so is always going to be part of a new release, not something that can trivially be applied as a patch to an old one. The version of the library you are on is two and a half years' old, and PHP 7.1 itself stopped receiving official security updates more than 2 years ago, so you're unlikely to find many people volunteering to test their libraries on it. I know updating projects to work on newer versions can be a lot of effort, but that's just the price you have to pay for progress in this business, I'm afraid. |
@IMSoP Ok, I will update my environment. So the issue is finally fixed? And since which version of PhpSpreadSheet is finally working as expected? |
1.20 was the first phase, will mostly handle the data, but not styles. |
PR #2471, which handles merging of styles, has just been merged. Closing this issue. |
This is:
What is the expected behavior?
Xlsx that contains namespaces in their XML are correctly loaded
What is the current behavior?
The array of data after reading the xlsx is empty
What are the steps to reproduce?
I'm using the following code to read this file: example.xlsx
Which versions of PhpSpreadsheet and PHP are affected?
Solution
As requested in #206, I tried to find a better solution instead of just
str_replace
.As I'm not an expert in XLSX and XML, I don't know if it works correctly for all cases.
All tests I've done are OK with this solution, so can you take a look this and give me your review? If it's seems good, I can submit a pull request with this.
The text was updated successfully, but these errors were encountered: