Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recent change in #234 munges utf8mb4 bytes (like 🧮) #237

Closed
jstanden opened this issue Oct 5, 2022 · 7 comments · Fixed by #238
Closed

Recent change in #234 munges utf8mb4 bytes (like 🧮) #237

jstanden opened this issue Oct 5, 2022 · 7 comments · Fixed by #238

Comments

@jstanden
Copy link

jstanden commented Oct 5, 2022

This commit in #234 changing from mb_convert_encoding() to mb_encode_numericentity() introduces an issue that munges utf8mb4 bytes in a string.

This appears to affect any 4-byte character, like 🧮, but not fewer bytes like ✈️.

The output from $cssToInlineStyles->convert() ends up looking like ð§.

We had to pin back to 2.2.4 in Composer to resolve an issue with outgoing mail that runs through this function.

@jstanden
Copy link
Author

jstanden commented Oct 5, 2022

@Ugoku This test case should probably be added to the unit tests.

@mhujer
Copy link
Contributor

mhujer commented Oct 12, 2022

Here is a comparison of the functions' behaviour: https://3v4l.org/bD68C

@stof
Copy link
Collaborator

stof commented Oct 12, 2022

@alexdowad looks like you were the one deprecating the HTML-ENTITIES encoding in mbstring in php/php-src#7594. Could you provide the equivalent code for the old code (mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'))) ? I haven't found any migration documentation related to that documentation, and it looks like the existing replacement attempt is not equivalent.

@alexdowad
Copy link

alexdowad commented Oct 12, 2022

Hi, @stof. I've never heard of this project before and don't have any context about what you are doing here, but from the sample code which @mhujer provided, it looks like you want to convert codepoints above U+FFFF to HTML numeric entities. If that is the case, then you need to pass mb_encode_numericentity a convmap argument which tells it to do that. Example: https://3v4l.org/sRM3F

@alexdowad
Copy link

@stof Sorry, one more comment, just in case... in the above sample code, I told mb_encode_numericentity to encode all codepoints from U+0080 up to U+1FFFF, but actually, the highest possible legal value for a Unicode codepoint (although it's not currently allocated) is U+10FFFF. Therefore, you probably want to use a convmap of [0x80, 0x10FFFF, 0, 0x1FFFFF]. Please do not set the 4th "bitmask" value in the convmap to 0x10FFFF; it needs to be 0x1FFFFF.

You would do well to include test cases in your test suite for U+007E (ASCII tilda), U+007F (ASCII "delete character"), U+0080 (lowest codepoint which will be converted to HTML entity), U+10FFFF (highest codepoint which will be converted), and U+110000 (illegal). You would also do well to include test cases for various types of invalid UTF-8 strings: strings with over-long code units, strings with continuation bytes appearing outside of a multi-byte character, strings which are truncated so a multi-byte character doesn't have the required number of continuation bytes, etc...

All the best with your project!

@Ugoku
Copy link
Contributor

Ugoku commented Dec 20, 2022

@jstanden it seems #234 is indeed to blame. This line was how Symfony "fixed" it, but it breaks certain characters as you say. The fix from @alexdowad works for me, I will submit a PR with this fix and add some tests.

@alexdowad
Copy link

BTW, I am thinking of submitting a PHP RFC so we add another built-in function like mb_encode_numericentity, but with a better API. The convmap argument for mb_encode_numericentity is very confusing and I doubt that more than a handful of users actually need anything other than [0, 0x10FFFF, 0, 0x1FFFFF].

I have half a mind to reach out to Moriyoshi-san and try to find out what that convmap argument was originally intended to be used for.

Ugoku added a commit to Recras/CssToInlineStyles that referenced this issue Dec 20, 2022
Ugoku added a commit to Recras/CssToInlineStyles that referenced this issue Dec 21, 2022
@stof stof closed this as completed in #238 Jan 3, 2023
bytestream pushed a commit to bytestream/CssToInlineStyles that referenced this issue Oct 3, 2023
nicolas-grekas added a commit to symfony/symfony that referenced this issue Jan 22, 2024
This PR was merged into the 5.4 branch.

Discussion
----------

[Translation] fix multi-byte code area to convert

| Q             | A
| ------------- | ---
| Branch?       | 5.4
| Bug fix?      | yes
| New feature?  | no
| Deprecations? | no
| Issues        |
| License       | MIT

While debugging #47783 I stumbled upon tijsverkoyen/CssToInlineStyles#237 which made me realise that we suffer from the same issue in `PseudoLocalizationTranslator`. So I decided to apply the patch from tijsverkoyen/CssToInlineStyles#238 here.

Commits
-------

2eadd39 fix multi-byte code area to convert
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants