Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading UTF8 content is not possible with HTML reader class #866

Closed
T00by1 opened this issue Jan 30, 2019 · 4 comments · Fixed by #4019
Closed

Loading UTF8 content is not possible with HTML reader class #866

T00by1 opened this issue Jan 30, 2019 · 4 comments · Fixed by #4019

Comments

@T00by1
Copy link

T00by1 commented Jan 30, 2019

This is:

- [x] a bug report
- [ ] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

What is the expected behavior?

UTF8 HTML content loaded into the spreadsheet.

What is the current behavior?

Failure to load because the validation of tags cannot be completed.
startsWithTag needs to check for BOM when looking for first character in document.

What are the steps to reproduce?

Use Simple example 46. Open template file and add BOM. Retry example.

Which versions of PhpSpreadsheet and PHP are affected?

"phpoffice/phpspreadsheet": "1.6.0"
"php": ">=7.1.20"

Suggested simple solution in startsWithTag :
Make function not static and add

if($this->inputEncoding == 'UTF-8') {
    if(substr($data, 0, 3) === chr(0xEF) . chr(0xBB) . chr(0xBF)) {
      $data = substr($data, 3);
     }
}
@MarkBaker
Copy link
Member

MarkBaker commented Feb 8, 2019

HTML markup shouldn't be identified as UTF-8 by a BOM, but by <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> in the header block of the markup.

There is actually a block of code in the HTML Reader that is intended for exactly that:

foreach ($attributeArray as $attributeName => $attributeValue) {
    switch ($attributeName) {
        case 'content':
            //    TODO
            //    Extract character set, so we can convert to UTF-8 if required
            break;
    }
}

though as you can see it's flagged as TODO

However, while this will be implemented in due course, testing for a BOM will not be, as that isn't valid for html markup

@T00by1
Copy link
Author

T00by1 commented Feb 9, 2019

I would respectfully like to disagree.
https://www.w3.org/TR/html5/syntax.html states that HTML documents can have an optional BOM at the start.

@stale
Copy link

stale bot commented Apr 12, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
If this is still an issue for you, please try to help by debugging it further and sharing your results.
Thank you for your contributions.

@stale stale bot added the stale label Apr 12, 2019
@stale stale bot closed this as completed Apr 19, 2019
oleibman added a commit to oleibman/PhpSpreadsheet that referenced this issue May 6, 2024
Fix PHPOffice#3995. Fix PHPOffice#866. Fix PHPOffice#1681. Php DOM loadhtml defaults to character set ISO-8859-1, but our data is UTF-8. So Html Reader alters its html so that loadhtml will not misinterpret characters outside the ASCII range. This works for UTF-8, but breaks other charsets. However, loadhtml uses the correct non-default charset when charset is specified in a meta tag, or when the html starts with a BOM. So, it is sufficient for us to alter the non-ASCII characters only when (a) the data does not start with a BOM, and (b) there is no charset tag.

This will allow us to use:
- UTF-8 files or snippets without BOM, with or without charset
- UTF-8 files with BOM (charset should not be specified and will be ignored if it is)
- UTF-16 files with BOM (charset should not be specified and will be ignored if it is)
- all charsets which are ASCII-compatible for 0x00-0x7f when the charset is declared. This applies to ASCII itself, many Windows and Mac charsets, all of ISO-8859, and most CJK and other-language-specific charsets.

We cannot use:
- UTF-16BE or UTF-16LE declared in a meta tag
- UTF-32, with or without a BOM (browser recommendation is to not support UTF-32, and most browsers do not support it)
- unknown (to loadhtml) or non-ASCII-compatible charsets (EBCDIC?)

I will note that the way I detect the `charset` attribute is imperfect (e.g. might find it in text rather than a meta tag). I think we'd need to write a browser to get it perfect. Anyhow, it is about the same as XmlScanner's attempt to find the `encoding` attribute, and, if it's good enough there, it ought to be good enough here.
@oleibman
Copy link
Collaborator

oleibman commented Jul 3, 2024

Fixed by PR #4019.

@oleibman oleibman removed the stale label Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants