-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in function XmlScanner::toUtf8 -> Suspicious Double-encoded XML #1681
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This is still an issue. |
Adding in the necessary support would involve a lot of work, without a lot of gain. Your suggestion might work for certain character sets, but would require extra work for others (e.g. UTF-16). Even so, there would be other things we need to check for beyond the XML declaration (e.g. charset or content-type declaration) which would also have to be changed to make the document usable, and we would need to change them only when they are attributes rather than content, and even then only when they are attributes for the right tag (e.g. we want to find charset in a meta tag but not in a script tag). In addition, I don't think the Php DOM parser handles the HTML5-style meta charset (it does handle the old meta content-type). I don't think the UTF-8 restriction is an onerous requirement, so I cannot justify the effort that would be required to make this change; however, I will be glad to review a PR if you wish to submit one. This restriction should be documented. I will open a PR that updates the documentation appropriately. |
Fix PHPOffice#1681, although probably not to the originator's satisfaction. Html and Xml readers will handle documents only with a charset of UTF-8. This PR documents that restriction. No change to source code; see the original issue for explanation.
Fix #1681, although probably not to the originator's satisfaction. Html and Xml readers will handle documents only with a charset of UTF-8. This PR documents that restriction. No change to source code; see the original issue for explanation.
Fix PHPOffice#3995. Fix PHPOffice#866. Fix PHPOffice#1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag. This will allow us to use: - UTF-8 files or snippets without BOM, with or without charset - UTF-8 files with BOM (charset should not be specified and will be ignored if it is) - UTF-16 files with BOM (charset should not be specified and will be ignored if it is) - all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets. We cannot use: - UTF-16BE or UTF-16LE declared in a meta tag - UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it) - unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?) I will note that the way I detect the `charset` attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the `encoding` attribute, and, if it's good enough there, it ought to be good enough here.
This is:
What is the expected behavior?
A file encoded with something different than UTF-8 can be imported
What is the current behavior?
When we have an XML file that has the "encoding=..." the function in XmlScanner::toUtf8 will always throw an exception.
The call the mb_convert_encoding() do not modify the XML to change the "encoding=...".
As of this, when you check again if the charset is now UTF-8 the second time, it's still false as it's based on the preg_match.
Which versions of PhpSpreadsheet and PHP are affected?
>= 1.12
The text was updated successfully, but these errors were encountered: