Bug in function XmlScanner::toUtf8 -> Suspicious Double-encoded XML #1681

JulienChavee · 2020-10-14T16:15:16Z

This is:

- [ x ] a bug report
- [ ] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

What is the expected behavior?

A file encoded with something different than UTF-8 can be imported

What is the current behavior?

When we have an XML file that has the "encoding=..." the function in XmlScanner::toUtf8 will always throw an exception.

The call the mb_convert_encoding() do not modify the XML to change the "encoding=...".
As of this, when you check again if the charset is now UTF-8 the second time, it's still false as it's based on the preg_match.

Which versions of PhpSpreadsheet and PHP are affected?

>= 1.12

The text was updated successfully, but these errors were encountered:

stale · 2020-12-25T12:29:26Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If this is still an issue for you, please try to help by debugging it further and sharing your results.
Thank you for your contributions.

JulienChavee · 2020-12-26T02:25:42Z

This is still an issue.

oleibman · 2022-09-16T17:53:31Z

Adding in the necessary support would involve a lot of work, without a lot of gain. Your suggestion might work for certain character sets, but would require extra work for others (e.g. UTF-16). Even so, there would be other things we need to check for beyond the XML declaration (e.g. charset or content-type declaration) which would also have to be changed to make the document usable, and we would need to change them only when they are attributes rather than content, and even then only when they are attributes for the right tag (e.g. we want to find charset in a meta tag but not in a script tag). In addition, I don't think the Php DOM parser handles the HTML5-style meta charset (it does handle the old meta content-type). I don't think the UTF-8 restriction is an onerous requirement, so I cannot justify the effort that would be required to make this change; however, I will be glad to review a PR if you wish to submit one.

This restriction should be documented. I will open a PR that updates the documentation appropriately.

Fix PHPOffice#1681, although probably not to the originator's satisfaction. Html and Xml readers will handle documents only with a charset of UTF-8. This PR documents that restriction. No change to source code; see the original issue for explanation.

Fix #1681, although probably not to the originator's satisfaction. Html and Xml readers will handle documents only with a charset of UTF-8. This PR documents that restriction. No change to source code; see the original issue for explanation.

Fix PHPOffice#3995. Fix PHPOffice#866. Fix PHPOffice#1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag. This will allow us to use: - UTF-8 files or snippets without BOM, with or without charset - UTF-8 files with BOM (charset should not be specified and will be ignored if it is) - UTF-16 files with BOM (charset should not be specified and will be ignored if it is) - all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets. We cannot use: - UTF-16BE or UTF-16LE declared in a meta tag - UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it) - unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?) I will note that the way I detect the `charset` attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the `encoding` attribute, and, if it's good enough there, it ought to be good enough here.

stale bot added the stale label Dec 25, 2020

stale bot removed the stale label Dec 26, 2020

oleibman mentioned this issue Sep 16, 2022

Document Charset Restriction for Html/Xml Reader #3068

Merged

7 tasks

oleibman closed this as completed in #3068 Sep 19, 2022

oleibman mentioned this issue Apr 22, 2024

Checking and converting loaded content in HTML Reader based on charset #3995

Closed

oleibman mentioned this issue May 6, 2024

Html Reader Non-UTF8 Charsets #4019

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in function XmlScanner::toUtf8 -> Suspicious Double-encoded XML #1681

Bug in function XmlScanner::toUtf8 -> Suspicious Double-encoded XML #1681

JulienChavee commented Oct 14, 2020 •

edited

Loading

stale bot commented Dec 25, 2020

JulienChavee commented Dec 26, 2020

oleibman commented Sep 16, 2022

Bug in function XmlScanner::toUtf8 -> Suspicious Double-encoded XML #1681

Bug in function XmlScanner::toUtf8 -> Suspicious Double-encoded XML #1681

Comments

JulienChavee commented Oct 14, 2020 • edited Loading

What is the expected behavior?

What is the current behavior?

Which versions of PhpSpreadsheet and PHP are affected?

stale bot commented Dec 25, 2020

JulienChavee commented Dec 26, 2020

oleibman commented Sep 16, 2022

JulienChavee commented Oct 14, 2020 •

edited

Loading