Recent change in #234 munges utf8mb4 bytes (like 🧮) #237

jstanden · 2022-10-05T00:26:26Z

This commit in #234 changing from mb_convert_encoding() to mb_encode_numericentity() introduces an issue that munges utf8mb4 bytes in a string.

This appears to affect any 4-byte character, like 🧮, but not fewer bytes like ✈️.

The output from $cssToInlineStyles->convert() ends up looking like ð§.

We had to pin back to 2.2.4 in Composer to resolve an issue with outgoing mail that runs through this function.

The text was updated successfully, but these errors were encountered:

jstanden · 2022-10-05T00:28:06Z

@Ugoku This test case should probably be added to the unit tests.

mhujer · 2022-10-12T13:01:46Z

Here is a comparison of the functions' behaviour: https://3v4l.org/bD68C

stof · 2022-10-12T13:24:53Z

@alexdowad looks like you were the one deprecating the HTML-ENTITIES encoding in mbstring in php/php-src#7594. Could you provide the equivalent code for the old code (mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'))) ? I haven't found any migration documentation related to that documentation, and it looks like the existing replacement attempt is not equivalent.

alexdowad · 2022-10-12T14:21:52Z

Hi, @stof. I've never heard of this project before and don't have any context about what you are doing here, but from the sample code which @mhujer provided, it looks like you want to convert codepoints above U+FFFF to HTML numeric entities. If that is the case, then you need to pass mb_encode_numericentity a convmap argument which tells it to do that. Example: https://3v4l.org/sRM3F

alexdowad · 2022-10-13T00:00:15Z

@stof Sorry, one more comment, just in case... in the above sample code, I told mb_encode_numericentity to encode all codepoints from U+0080 up to U+1FFFF, but actually, the highest possible legal value for a Unicode codepoint (although it's not currently allocated) is U+10FFFF. Therefore, you probably want to use a convmap of [0x80, 0x10FFFF, 0, 0x1FFFFF]. Please do not set the 4th "bitmask" value in the convmap to 0x10FFFF; it needs to be 0x1FFFFF.

You would do well to include test cases in your test suite for U+007E (ASCII tilda), U+007F (ASCII "delete character"), U+0080 (lowest codepoint which will be converted to HTML entity), U+10FFFF (highest codepoint which will be converted), and U+110000 (illegal). You would also do well to include test cases for various types of invalid UTF-8 strings: strings with over-long code units, strings with continuation bytes appearing outside of a multi-byte character, strings which are truncated so a multi-byte character doesn't have the required number of continuation bytes, etc...

All the best with your project!

Ugoku · 2022-12-20T09:44:11Z

@jstanden it seems #234 is indeed to blame. This line was how Symfony "fixed" it, but it breaks certain characters as you say. The fix from @alexdowad works for me, I will submit a PR with this fix and add some tests.

alexdowad · 2022-12-20T09:51:19Z

BTW, I am thinking of submitting a PHP RFC so we add another built-in function like mb_encode_numericentity, but with a better API. The convmap argument for mb_encode_numericentity is very confusing and I doubt that more than a handful of users actually need anything other than [0, 0x10FFFF, 0, 0x1FFFFF].

I have half a mind to reach out to Moriyoshi-san and try to find out what that convmap argument was originally intended to be used for.

Refs tijsverkoyen#237

Refs tijsverkoyen#237 (cherry picked from commit 2af5077)

This PR was merged into the 5.4 branch. Discussion ---------- [Translation] fix multi-byte code area to convert | Q | A | ------------- | --- | Branch? | 5.4 | Bug fix? | yes | New feature? | no | Deprecations? | no | Issues | | License | MIT While debugging #47783 I stumbled upon tijsverkoyen/CssToInlineStyles#237 which made me realise that we suffer from the same issue in `PseudoLocalizationTranslator`. So I decided to apply the patch from tijsverkoyen/CssToInlineStyles#238 here. Commits ------- 2eadd39 fix multi-byte code area to convert

Ugoku added a commit to Recras/CssToInlineStyles that referenced this issue Dec 20, 2022

Fix multibyte string conversion

f25b68b

Refs tijsverkoyen#237

Ugoku mentioned this issue Dec 20, 2022

Fix multibyte string conversion #238

Merged

Ugoku added a commit to Recras/CssToInlineStyles that referenced this issue Dec 21, 2022

Fix multibyte string conversion

2af5077

Refs tijsverkoyen#237

Ugoku mentioned this issue Jan 2, 2023

Еmoji icons break after processing by the convert() function #239

Closed

stof closed this as completed in #238 Jan 3, 2023

bytestream pushed a commit to bytestream/CssToInlineStyles that referenced this issue Oct 3, 2023

Fix multibyte string conversion

45378c5

Refs tijsverkoyen#237 (cherry picked from commit 2af5077)

stof mentioned this issue Jan 19, 2024

[Translation] fix multi-byte code area to convert symfony/symfony#53588

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recent change in #234 munges utf8mb4 bytes (like 🧮) #237

Recent change in #234 munges utf8mb4 bytes (like 🧮) #237

jstanden commented Oct 5, 2022 •

edited

Loading

jstanden commented Oct 5, 2022

mhujer commented Oct 12, 2022

stof commented Oct 12, 2022

alexdowad commented Oct 12, 2022 •

edited

Loading

alexdowad commented Oct 13, 2022

Ugoku commented Dec 20, 2022

alexdowad commented Dec 20, 2022

Recent change in #234 munges utf8mb4 bytes (like 🧮) #237

Recent change in #234 munges utf8mb4 bytes (like 🧮) #237

Comments

jstanden commented Oct 5, 2022 • edited Loading

jstanden commented Oct 5, 2022

mhujer commented Oct 12, 2022

stof commented Oct 12, 2022

alexdowad commented Oct 12, 2022 • edited Loading

alexdowad commented Oct 13, 2022

Ugoku commented Dec 20, 2022

alexdowad commented Dec 20, 2022

jstanden commented Oct 5, 2022 •

edited

Loading

alexdowad commented Oct 12, 2022 •

edited

Loading