-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Html Reader Non-UTF8 Charsets #4019
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Fix PHPOffice#3995. Fix PHPOffice#866. Fix PHPOffice#1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag. This will allow us to use: - UTF-8 files or snippets without BOM, with or without charset - UTF-8 files with BOM (charset should not be specified and will be ignored if it is) - UTF-16 files with BOM (charset should not be specified and will be ignored if it is) - all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets. We cannot use: - UTF-16BE or UTF-16LE declared in a meta tag - UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it) - unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?) I will note that the way I detect the `charset` attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the `encoding` attribute, and, if it's good enough there, it ought to be good enough here.
11 tasks
Scrutinizer new issues are both false positives, and are now suppressed. |
oleibman
added a commit
to oleibman/PhpSpreadsheet
that referenced
this pull request
May 13, 2024
Continuing work started with PR PHPOffice#4019. Improve documentation within program by making explicit what types of values are allowed for variables described as "mixed". In order to avoid broken functionality, this is done mainly through doc-blocks. This will get us closer to Phpstan Level 9, but many changes will be needed before we can consider that. This change has more executable code changes than its predecessor. I will wait longer than normal before merging it to allow for additional testing.
12 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fix #3995. Fix #866. Fix #1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag.
This will allow us to use:
We cannot use:
I will note that the way I detect the
charset
attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find theencoding
attribute, and, if it's good enough there, it ought to be good enough here.This is:
Checklist:
Why this change is needed?
Provide an explanation of why this change is needed, with links to any Issues (if appropriate).
If this is a bugfix or a new feature, and there are no existing Issues, then please also create an issue that will make it easier to track progress with this PR.