-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading UTF8 content is not possible with HTML reader class #866
Comments
HTML markup shouldn't be identified as UTF-8 by a BOM, but by There is actually a block of code in the HTML Reader that is intended for exactly that:
though as you can see it's flagged as However, while this will be implemented in due course, testing for a BOM will not be, as that isn't valid for html markup |
I would respectfully like to disagree. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
Fix PHPOffice#3995. Fix PHPOffice#866. Fix PHPOffice#1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag. This will allow us to use: - UTF-8 files or snippets without BOM, with or without charset - UTF-8 files with BOM (charset should not be specified and will be ignored if it is) - UTF-16 files with BOM (charset should not be specified and will be ignored if it is) - all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets. We cannot use: - UTF-16BE or UTF-16LE declared in a meta tag - UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it) - unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?) I will note that the way I detect the `charset` attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the `encoding` attribute, and, if it's good enough there, it ought to be good enough here.
Fixed by PR #4019. |
This is:
What is the expected behavior?
UTF8 HTML content loaded into the spreadsheet.
What is the current behavior?
Failure to load because the validation of tags cannot be completed.
startsWithTag needs to check for BOM when looking for first character in document.
What are the steps to reproduce?
Use Simple example 46. Open template file and add BOM. Retry example.
Which versions of PhpSpreadsheet and PHP are affected?
"phpoffice/phpspreadsheet": "1.6.0"
"php": ">=7.1.20"
Suggested simple solution in startsWithTag :
Make function not static and add
The text was updated successfully, but these errors were encountered: