Loading UTF8 content is not possible with HTML reader class #866

T00by1 · 2019-01-30T15:36:50Z

This is:

- [x] a bug report
- [ ] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

What is the expected behavior?

UTF8 HTML content loaded into the spreadsheet.

What is the current behavior?

Failure to load because the validation of tags cannot be completed.
startsWithTag needs to check for BOM when looking for first character in document.

What are the steps to reproduce?

Use Simple example 46. Open template file and add BOM. Retry example.

Which versions of PhpSpreadsheet and PHP are affected?

"phpoffice/phpspreadsheet": "1.6.0"
"php": ">=7.1.20"

Suggested simple solution in startsWithTag :
Make function not static and add

if($this->inputEncoding == 'UTF-8') {
    if(substr($data, 0, 3) === chr(0xEF) . chr(0xBB) . chr(0xBF)) {
      $data = substr($data, 3);
     }
}

The text was updated successfully, but these errors were encountered:

MarkBaker · 2019-02-08T21:04:54Z

HTML markup shouldn't be identified as UTF-8 by a BOM, but by <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> in the header block of the markup.

There is actually a block of code in the HTML Reader that is intended for exactly that:

foreach ($attributeArray as $attributeName => $attributeValue) {
    switch ($attributeName) {
        case 'content':
            //    TODO
            //    Extract character set, so we can convert to UTF-8 if required
            break;
    }
}

though as you can see it's flagged as TODO

However, while this will be implemented in due course, testing for a BOM will not be, as that isn't valid for html markup

T00by1 · 2019-02-09T10:26:29Z

I would respectfully like to disagree.
https://www.w3.org/TR/html5/syntax.html states that HTML documents can have an optional BOM at the start.

stale · 2019-04-12T15:56:50Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If this is still an issue for you, please try to help by debugging it further and sharing your results.
Thank you for your contributions.

Fix PHPOffice#3995. Fix PHPOffice#866. Fix PHPOffice#1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag. This will allow us to use: - UTF-8 files or snippets without BOM, with or without charset - UTF-8 files with BOM (charset should not be specified and will be ignored if it is) - UTF-16 files with BOM (charset should not be specified and will be ignored if it is) - all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets. We cannot use: - UTF-16BE or UTF-16LE declared in a meta tag - UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it) - unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?) I will note that the way I detect the `charset` attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the `encoding` attribute, and, if it's good enough there, it ought to be good enough here.

oleibman · 2024-07-03T23:55:52Z

Fixed by PR #4019.

stale bot added the stale label Apr 12, 2019

stale bot closed this as completed Apr 19, 2019

oleibman mentioned this issue May 6, 2024

Html Reader Non-UTF8 Charsets #4019

Merged

11 tasks

oleibman removed the stale label Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading UTF8 content is not possible with HTML reader class #866

Loading UTF8 content is not possible with HTML reader class #866

T00by1 commented Jan 30, 2019

MarkBaker commented Feb 8, 2019 •

edited

Loading

T00by1 commented Feb 9, 2019 •

edited

Loading

stale bot commented Apr 12, 2019

oleibman commented Jul 3, 2024

Loading UTF8 content is not possible with HTML reader class #866

Loading UTF8 content is not possible with HTML reader class #866

Comments

T00by1 commented Jan 30, 2019

What is the expected behavior?

What is the current behavior?

What are the steps to reproduce?

Which versions of PhpSpreadsheet and PHP are affected?

MarkBaker commented Feb 8, 2019 • edited Loading

T00by1 commented Feb 9, 2019 • edited Loading

stale bot commented Apr 12, 2019

oleibman commented Jul 3, 2024

MarkBaker commented Feb 8, 2019 •

edited

Loading

T00by1 commented Feb 9, 2019 •

edited

Loading