Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.5] Add multibyte functions where needed in Support/Str #21207

Merged
merged 2 commits into from
Sep 16, 2017
Merged

[5.5] Add multibyte functions where needed in Support/Str #21207

merged 2 commits into from
Sep 16, 2017

Conversation

jakebathman
Copy link
Contributor

@jakebathman jakebathman commented Sep 15, 2017

This PR updates string support functions to use multibyte-safe versions of substr and strlen.

Additional test assertions are also added to validate multibyte string operations.

The four methods revised are:

  • Str::endsWith
  • Str::replaceFirst
  • Str::replaceLast
  • Str::startsWith

Let me know if there are any questions. Thanks!

@jakebathman jakebathman changed the title Add multibyte functions and new test assertions Add multibyte functions where needed in Support/Str Sep 15, 2017
@jakebathman jakebathman changed the title Add multibyte functions where needed in Support/Str [5.5] Add multibyte functions where needed in Support/Str Sep 15, 2017
@taylorotwell taylorotwell merged commit 7acc2e7 into laravel:5.5 Sep 16, 2017
@sisve
Copy link
Contributor

sisve commented Sep 16, 2017

This is a breaking change. https://3v4l.org/3fMUV

@jakebathman
Copy link
Contributor Author

Multibyte strings weren’t being correctly handled by these methods before this change, so I look at it like a bug fix rather than breaking change.

@vlakoff
Copy link
Contributor

vlakoff commented Sep 16, 2017

Thank you, especially for the tests that demonstrate the usefulness of the multibyte functions.

@jakebathman jakebathman deleted the add-string-funciton-multibyte-support branch September 17, 2017 01:50
@sisve
Copy link
Contributor

sisve commented Sep 17, 2017

These new tests worked with the old implementation too. They do not provide anything in the ways of regression testing or avoid anyone going back to the old strlen/substr calls.

@vlakoff
Copy link
Contributor

vlakoff commented Sep 17, 2017

Ah, worth having a deeper look at it, then.

Iff it's not possible to make the single-byte functions to fail, the use of the multi-byte functions is not justified nor advisable.

@jakebathman
Copy link
Contributor Author

jakebathman commented Sep 17, 2017

Hmm, yeah. I had written the test case strings to prove negative against the underlying strlen, substr, strpos, and strrpos methods, but it seems like the existing single-byte implementations work (by accident, since they're using byte counts instead of character counts).

If no one can make this fail with the current implementation (I can't find a failing example string), then we can hold this off until 5.6 if @taylorotwell wants.

Thanks for the awesome help @sisve, et al., you all continue making the framework amazing!

@vlakoff
Copy link
Contributor

vlakoff commented Sep 17, 2017

If no one can figure out a way to make the single-byte functions to fail (with proper strings, not with crafted bytes as in sisve's example), then this PR should be reverted, as both code changes and added tests are technically useless, and may give a false impression of added robustness.

I'm not an expert at UTF-8, but apparently it just cannot overlap:

@jakebathman
Copy link
Contributor Author

Agreed.

I’ll start a thread in internals, this was only my first class to audit for multibyte support on these kinds of string functions, so I’m going to be looking more closely at how they’re used and when they might break across the framework to help shore up any weaknesses with respect to multibyte strings.

Thanks all for the help.

@sisve
Copy link
Contributor

sisve commented Sep 17, 2017

Fun fact (I have to make knowing these things sound fun since I spend way to much time on it ...); the native functions (strlen) count bytes and the multibyte functions (mb_strlen) counts code points, assuming a unicode encoding, while people usually describe what the grapheme functions (grapheme_strlen) does. Things like "How many characters is in $input?"

Well, still assuming unicode; there are many ways to write the same character. A simple answer is that it could be "the character Ö", or it could be "the character O followed with a character two-dots-on-top-of-the-previous-character". These things are easiest handled by a normalization middleware that transforms all your incoming strings into the way you expect them to be.

The character "Ö", in normalization mode NFD, becomes the unicode characters 0x4F and 0x308. That is "LATIN CAPITAL LETTER O" followed by "COMBINING DIAERESIS". The utf-8 representation of this becomes \x4F\xCC\x88...

$input = "\x4F\xCC\x88";

var_dump($input);
var_dump(strlen($input));
var_dump(mb_strlen($input));
var_dump(grapheme_strlen($input));

Output:

string(3) "Ö"
int(3)
int(2)
int(1)

Thus shown that the mb_strlen counts code points, not characters. And grapheme_strlen is counting graphemes, which is closer to counting characters than counting code points, but not really the same thing, and sometime around here the Unicode Monster comes for you.

Also, that COMBINING DIAERESIS can be copy/pasted. Perfect if you're b̈ö̈r̈̈̈ë̈̈̈d̈̈̈̈.

@vlakoff
Copy link
Contributor

vlakoff commented Oct 17, 2017

Any news on this? Does everyone agree this PR can and should be reverted? It would be great to not slow down these Str methods purposelessly.

@tillkruss
Copy link
Contributor

Was this not revererted in another PR?

@jakebathman
Copy link
Contributor Author

I don’t think it was ever reverted, but #21722 is doing that now.

taylorotwell pushed a commit that referenced this pull request Oct 18, 2017
This reverts #21207.
Parsing with the multibyte functions is actually useless.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants