Html Reader Non-UTF8 Charsets #4019

oleibman · 2024-05-06T23:45:09Z

Fix #3995. Fix #866. Fix #1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag.

This will allow us to use:

UTF-8 files or snippets without BOM, with or without charset
UTF-8 files with BOM (charset should not be specified and will be ignored if it is)
UTF-16 files with BOM (charset should not be specified and will be ignored if it is)
all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets.

We cannot use:

UTF-16BE or UTF-16LE declared in a meta tag
UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it)
unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?)

I will note that the way I detect the charset attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the encoding attribute, and, if it's good enough there, it ought to be good enough here.

This is:

a bugfix
a new feature
refactoring
additional unit tests

Checklist:

Changes are covered by unit tests
- Changes are covered by existing unit tests
- New unit tests have been added
Code style is respected
Commit message explains why the change is made (see https://github.com/erlang/otp/wiki/Writing-good-commit-messages)
CHANGELOG.md contains a short summary of the change and a link to the pull request if applicable
Documentation is updated as necessary

Why this change is needed?

Provide an explanation of why this change is needed, with links to any Issues (if appropriate).
If this is a bugfix or a new feature, and there are no existing Issues, then please also create an issue that will make it easier to track progress with this PR.

Fix PHPOffice#3995. Fix PHPOffice#866. Fix PHPOffice#1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag. This will allow us to use: - UTF-8 files or snippets without BOM, with or without charset - UTF-8 files with BOM (charset should not be specified and will be ignored if it is) - UTF-16 files with BOM (charset should not be specified and will be ignored if it is) - all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets. We cannot use: - UTF-16BE or UTF-16LE declared in a meta tag - UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it) - unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?) I will note that the way I detect the `charset` attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the `encoding` attribute, and, if it's good enough there, it ought to be good enough here.

oleibman · 2024-05-07T00:09:31Z

Scrutinizer new issues are both false positives, and are now suppressed.

Continuing work started with PR PHPOffice#4019. Improve documentation within program by making explicit what types of values are allowed for variables described as "mixed". In order to avoid broken functionality, this is done mainly through doc-blocks. This will get us closer to Phpstan Level 9, but many changes will be needed before we can consider that. This change has more executable code changes than its predecessor. I will wait longer than normal before merging it to allow for additional testing.

oleibman added 2 commits May 6, 2024 16:43

Correct Wrong-Case Directory Name

b2befb4

oleibman mentioned this pull request May 6, 2024

#3995 check charset in HTML reader #4015

Closed

11 tasks

oleibman added 4 commits May 8, 2024 23:37

Make New Function Static

148165c

Merge branch 'master' into issue3995

4577f61

Update CHANGELOG.md

f8e31b3

Update reading-and-writing-to-file.md

bdcbc04

oleibman added this pull request to the merge queue May 10, 2024

Merged via the queue into PHPOffice:master with commit 791c8cf May 10, 2024
13 of 14 checks passed

oleibman deleted the issue3995 branch May 10, 2024 06:34

oleibman mentioned this pull request May 13, 2024

Better Definitions for Mixed Parameters and Values Part 2 of Many #4026

Merged

12 tasks

oleibman mentioned this pull request Jul 3, 2024

Loading UTF8 content is not possible with HTML reader class #866

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Html Reader Non-UTF8 Charsets #4019

Html Reader Non-UTF8 Charsets #4019

oleibman commented May 6, 2024 •

edited

Loading

oleibman commented May 7, 2024

Html Reader Non-UTF8 Charsets #4019

Html Reader Non-UTF8 Charsets #4019

Conversation

oleibman commented May 6, 2024 • edited Loading

Why this change is needed?

oleibman commented May 7, 2024

oleibman commented May 6, 2024 •

edited

Loading