Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in function XmlScanner::toUtf8 -> Suspicious Double-encoded XML #1681

Closed
JulienChavee opened this issue Oct 14, 2020 · 3 comments · Fixed by #3068 or #4019
Closed

Bug in function XmlScanner::toUtf8 -> Suspicious Double-encoded XML #1681

JulienChavee opened this issue Oct 14, 2020 · 3 comments · Fixed by #3068 or #4019

Comments

@JulienChavee
Copy link

JulienChavee commented Oct 14, 2020

This is:

- [ x ] a bug report
- [ ] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

What is the expected behavior?

A file encoded with something different than UTF-8 can be imported

What is the current behavior?

When we have an XML file that has the "encoding=..." the function in XmlScanner::toUtf8 will always throw an exception.

The call the mb_convert_encoding() do not modify the XML to change the "encoding=...".
As of this, when you check again if the charset is now UTF-8 the second time, it's still false as it's based on the preg_match.

Which versions of PhpSpreadsheet and PHP are affected?

>= 1.12

@stale
Copy link

stale bot commented Dec 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If this is still an issue for you, please try to help by debugging it further and sharing your results.
Thank you for your contributions.

@stale stale bot added the stale label Dec 25, 2020
@JulienChavee
Copy link
Author

This is still an issue.

@stale stale bot removed the stale label Dec 26, 2020
@oleibman
Copy link
Collaborator

Adding in the necessary support would involve a lot of work, without a lot of gain. Your suggestion might work for certain character sets, but would require extra work for others (e.g. UTF-16). Even so, there would be other things we need to check for beyond the XML declaration (e.g. charset or content-type declaration) which would also have to be changed to make the document usable, and we would need to change them only when they are attributes rather than content, and even then only when they are attributes for the right tag (e.g. we want to find charset in a meta tag but not in a script tag). In addition, I don't think the Php DOM parser handles the HTML5-style meta charset (it does handle the old meta content-type). I don't think the UTF-8 restriction is an onerous requirement, so I cannot justify the effort that would be required to make this change; however, I will be glad to review a PR if you wish to submit one.

This restriction should be documented. I will open a PR that updates the documentation appropriately.

oleibman added a commit to oleibman/PhpSpreadsheet that referenced this issue Sep 16, 2022
Fix PHPOffice#1681, although probably not to the originator's satisfaction. Html and Xml readers will handle documents only with a charset of UTF-8. This PR documents that restriction. No change to source code; see the original issue for explanation.
oleibman added a commit that referenced this issue Sep 19, 2022
Fix #1681, although probably not to the originator's satisfaction. Html and Xml readers will handle documents only with a charset of UTF-8. This PR documents that restriction. No change to source code; see the original issue for explanation.
oleibman added a commit to oleibman/PhpSpreadsheet that referenced this issue May 6, 2024
Fix PHPOffice#3995. Fix PHPOffice#866. Fix PHPOffice#1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag.

This will allow us to use:
- UTF-8 files or snippets without BOM, with or without charset
- UTF-8 files with BOM (charset should not be specified and will be ignored if it is)
- UTF-16 files with BOM (charset should not be specified and will be ignored if it is)
- all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets.

We cannot use:
- UTF-16BE or UTF-16LE declared in a meta tag
- UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it)
- unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?)

I will note that the way I detect the `charset` attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the `encoding` attribute, and, if it's good enough there, it ought to be good enough here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants