-
Notifications
You must be signed in to change notification settings - Fork 186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recent change in #234 munges utf8mb4 bytes (like 🧮) #237
Comments
@Ugoku This test case should probably be added to the unit tests. |
Here is a comparison of the functions' behaviour: https://3v4l.org/bD68C |
@alexdowad looks like you were the one deprecating the |
Hi, @stof. I've never heard of this project before and don't have any context about what you are doing here, but from the sample code which @mhujer provided, it looks like you want to convert codepoints above U+FFFF to HTML numeric entities. If that is the case, then you need to pass |
@stof Sorry, one more comment, just in case... in the above sample code, I told You would do well to include test cases in your test suite for U+007E (ASCII tilda), U+007F (ASCII "delete character"), U+0080 (lowest codepoint which will be converted to HTML entity), U+10FFFF (highest codepoint which will be converted), and U+110000 (illegal). You would also do well to include test cases for various types of invalid UTF-8 strings: strings with over-long code units, strings with continuation bytes appearing outside of a multi-byte character, strings which are truncated so a multi-byte character doesn't have the required number of continuation bytes, etc... All the best with your project! |
@jstanden it seems #234 is indeed to blame. This line was how Symfony "fixed" it, but it breaks certain characters as you say. The fix from @alexdowad works for me, I will submit a PR with this fix and add some tests. |
BTW, I am thinking of submitting a PHP RFC so we add another built-in function like I have half a mind to reach out to Moriyoshi-san and try to find out what that |
Refs tijsverkoyen#237 (cherry picked from commit 2af5077)
This PR was merged into the 5.4 branch. Discussion ---------- [Translation] fix multi-byte code area to convert | Q | A | ------------- | --- | Branch? | 5.4 | Bug fix? | yes | New feature? | no | Deprecations? | no | Issues | | License | MIT While debugging #47783 I stumbled upon tijsverkoyen/CssToInlineStyles#237 which made me realise that we suffer from the same issue in `PseudoLocalizationTranslator`. So I decided to apply the patch from tijsverkoyen/CssToInlineStyles#238 here. Commits ------- 2eadd39 fix multi-byte code area to convert
This commit in #234 changing from
mb_convert_encoding()
tomb_encode_numericentity()
introduces an issue that munges utf8mb4 bytes in a string.This appears to affect any 4-byte character, like 🧮, but not fewer bytes like✈️ .
The output from
$cssToInlineStyles->convert()
ends up looking likeð§
.We had to pin back to 2.2.4 in Composer to resolve an issue with outgoing mail that runs through this function.
The text was updated successfully, but these errors were encountered: