[5.5] Add multibyte functions where needed in Support/Str #21207

jakebathman · 2017-09-15T20:52:00Z

This PR updates string support functions to use multibyte-safe versions of substr and strlen.

Additional test assertions are also added to validate multibyte string operations.

The four methods revised are:

Str::endsWith
Str::replaceFirst
Str::replaceLast
Str::startsWith

Let me know if there are any questions. Thanks!

sisve · 2017-09-16T19:50:29Z

This is a breaking change. https://3v4l.org/3fMUV

jakebathman · 2017-09-16T23:17:39Z

Multibyte strings weren’t being correctly handled by these methods before this change, so I look at it like a bug fix rather than breaking change.

vlakoff · 2017-09-16T23:38:26Z

Thank you, especially for the tests that demonstrate the usefulness of the multibyte functions.

sisve · 2017-09-17T05:34:53Z

These new tests worked with the old implementation too. They do not provide anything in the ways of regression testing or avoid anyone going back to the old strlen/substr calls.

vlakoff · 2017-09-17T06:13:00Z

Ah, worth having a deeper look at it, then.

Iff it's not possible to make the single-byte functions to fail, the use of the multi-byte functions is not justified nor advisable.

jakebathman · 2017-09-17T12:43:52Z

Hmm, yeah. I had written the test case strings to prove negative against the underlying strlen, substr, strpos, and strrpos methods, but it seems like the existing single-byte implementations work (by accident, since they're using byte counts instead of character counts).

If no one can make this fail with the current implementation (I can't find a failing example string), then we can hold this off until 5.6 if @taylorotwell wants.

Thanks for the awesome help @sisve, et al., you all continue making the framework amazing!

vlakoff · 2017-09-17T15:42:45Z

If no one can figure out a way to make the single-byte functions to fail (with proper strings, not with crafted bytes as in sisve's example), then this PR should be reverted, as both code changes and added tests are technically useless, and may give a false impression of added robustness.

I'm not an expert at UTF-8, but apparently it just cannot overlap:

jakebathman · 2017-09-17T16:11:04Z

Agreed.

I’ll start a thread in internals, this was only my first class to audit for multibyte support on these kinds of string functions, so I’m going to be looking more closely at how they’re used and when they might break across the framework to help shore up any weaknesses with respect to multibyte strings.

Thanks all for the help.

sisve · 2017-09-17T18:27:11Z

Fun fact (I have to make knowing these things sound fun since I spend way to much time on it ...); the native functions (strlen) count bytes and the multibyte functions (mb_strlen) counts code points, assuming a unicode encoding, while people usually describe what the grapheme functions (grapheme_strlen) does. Things like "How many characters is in $input?"

Well, still assuming unicode; there are many ways to write the same character. A simple answer is that it could be "the character Ö", or it could be "the character O followed with a character two-dots-on-top-of-the-previous-character". These things are easiest handled by a normalization middleware that transforms all your incoming strings into the way you expect them to be.

The character "Ö", in normalization mode NFD, becomes the unicode characters 0x4F and 0x308. That is "LATIN CAPITAL LETTER O" followed by "COMBINING DIAERESIS". The utf-8 representation of this becomes \x4F\xCC\x88...

$input = "\x4F\xCC\x88";

var_dump($input);
var_dump(strlen($input));
var_dump(mb_strlen($input));
var_dump(grapheme_strlen($input));

Output:

string(3) "Ö"
int(3)
int(2)
int(1)

Thus shown that the mb_strlen counts code points, not characters. And grapheme_strlen is counting graphemes, which is closer to counting characters than counting code points, but not really the same thing, and sometime around here the Unicode Monster comes for you.

Also, that COMBINING DIAERESIS can be copy/pasted. Perfect if you're b̈ö̈r̈̈̈ë̈̈̈d̈̈̈̈.

vlakoff · 2017-10-17T22:22:17Z

Any news on this? Does everyone agree this PR can and should be reverted? It would be great to not slow down these Str methods purposelessly.

tillkruss · 2017-10-17T22:40:01Z

Was this not revererted in another PR?

jakebathman · 2017-10-18T00:11:00Z

I don’t think it was ever reverted, but #21722 is doing that now.

This reverts #21207. Parsing with the multibyte functions is actually useless.

jakebathman added 2 commits September 15, 2017 15:49

Add multibyte functions and new test assertions

eb6b6f4

StyleCI fixes

7ae34fc

jakebathman changed the title ~~Add multibyte functions and new test assertions~~ Add multibyte functions where needed in Support/Str Sep 15, 2017

jakebathman changed the title ~~Add multibyte functions where needed in Support/Str~~ [5.5] Add multibyte functions where needed in Support/Str Sep 15, 2017

taylorotwell merged commit 7acc2e7 into laravel:5.5 Sep 16, 2017

jakebathman deleted the add-string-funciton-multibyte-support branch September 17, 2017 01:50

vlakoff mentioned this pull request Oct 18, 2017

[5.5] Revert "Add multibyte functions where needed in Support/Str" #21722

Merged

taylorotwell pushed a commit that referenced this pull request Oct 18, 2017

Revert "Add multibyte functions where needed in Support/Str" (#21722)

01837fd

This reverts #21207. Parsing with the multibyte functions is actually useless.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5.5] Add multibyte functions where needed in Support/Str #21207

[5.5] Add multibyte functions where needed in Support/Str #21207

jakebathman commented Sep 15, 2017 •

edited

Loading

sisve commented Sep 16, 2017

jakebathman commented Sep 16, 2017

vlakoff commented Sep 16, 2017

sisve commented Sep 17, 2017 •

edited

Loading

vlakoff commented Sep 17, 2017

jakebathman commented Sep 17, 2017 •

edited

Loading

vlakoff commented Sep 17, 2017

jakebathman commented Sep 17, 2017

sisve commented Sep 17, 2017

vlakoff commented Oct 17, 2017 •

edited

Loading

tillkruss commented Oct 17, 2017

jakebathman commented Oct 18, 2017

[5.5] Add multibyte functions where needed in Support/Str #21207

[5.5] Add multibyte functions where needed in Support/Str #21207

Conversation

jakebathman commented Sep 15, 2017 • edited Loading

sisve commented Sep 16, 2017

jakebathman commented Sep 16, 2017

vlakoff commented Sep 16, 2017

sisve commented Sep 17, 2017 • edited Loading

vlakoff commented Sep 17, 2017

jakebathman commented Sep 17, 2017 • edited Loading

vlakoff commented Sep 17, 2017

jakebathman commented Sep 17, 2017

sisve commented Sep 17, 2017

vlakoff commented Oct 17, 2017 • edited Loading

tillkruss commented Oct 17, 2017

jakebathman commented Oct 18, 2017

jakebathman commented Sep 15, 2017 •

edited

Loading

sisve commented Sep 17, 2017 •

edited

Loading

jakebathman commented Sep 17, 2017 •

edited

Loading

vlakoff commented Oct 17, 2017 •

edited

Loading