Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 combining characters and normalization in reverse() #6165

Closed
andrioni opened this issue Mar 14, 2014 · 42 comments
Closed

UTF-8 combining characters and normalization in reverse() #6165

andrioni opened this issue Mar 14, 2014 · 42 comments
Labels
breaking This change will break code needs decision A decision on this change is needed unicode Related to unicode characters and encodings

Comments

@andrioni
Copy link
Member

julia> "noe\u0308l"
"noël"

julia> reverse("noe\u0308l")
"l̈eon"

julia> length("noe\u0308l")
5

julia> uppercase("baffle")
"BAfflE"

(it appears that Github also doesn't like them, examples taken from here)

Backspace support is also broken on the REPL: if you press backspace twice from "noël", it shows you "no", but actually represents "noe".

What should be our approach in this case? NFKC does solve both issues above, but I'm not sure how good it is to normalize every string before operating on them:

julia> length(normalize_string("noe\u0308l", :NFKC))
4

julia> reverse(normalize_string("noe\u0308l", :NFKC))
"lëon"

julia> uppercase(normalize_string("baffle", :NFKC))
"BAFFLE"

(related to #5434, #5903)

@jiahao jiahao added the unicode label Mar 14, 2014
@jiahao
Copy link
Member

jiahao commented Mar 14, 2014

I think we decided not to normalize strings by default. Strings are user data.

@andrioni
Copy link
Member Author

Yeah, but the problem is how should be string operations like length and reverse be defined: on codepoints (as AFAIK it currently is) or on characters. If the latter, they probably should normalize their input.

@JeffBezanson
Copy link
Member

I know reversing codepoints is not what you always want, but at least it's a primitive. Having only a NFKC-normalize-and-reverse function can't be right.

@StefanKarpinski
Copy link
Member

I suspect that doing NFC normalization during reversal is the right default. There can be a normalize::Bool=true option so if you really want to just reverse the codepoints, you can do reverse(str, normalize=false). You want this as an option to the reverse function since otherwise you end up needing normalize and then reverse, creating an unnecessary intermediate copy of the string. Normalizing to NFKC rather than NFC, strikes me as bad idea even though it happens to work for the ligature since it will also change things like 𝕞 to m, which no one would expect the reverse function to do. The ligature example does show that there are cases where you might want to do this, of course. Perhaps the normalize keyword should take a symbol that can be one of :none, :nfc (default), or :nfkc.

@nalimilan
Copy link
Member

+1

@StefanKarpinski
Copy link
Member

We might want UnicodeNormalization objects with NFC <: UnicodeNormalization, etc. That way we could pass the normalization as an argument and dispatch on it, which could be quite handy.

@stevengj
Copy link
Member

What is the use case for reversing strings? The intended application should determine functionality here.

If you want to reverse graphemes, utf8proc can identify the graphemes for you. (Normalization won't necessarily eliminate all combining characters.)

Note that we already have Unicode case folding.

@StefanKarpinski
Copy link
Member

That's the problem – we don't know, since we're writing a standard library, not an application. You're right that grapheme reversal is probably the better way to do this. It would be good to come up with a reverse function that generally does what you want. It should generally operate at the level of graphemes, but cases like ligatures would, ideally, be split before reversing. Ligature splitting would imply that reverse is not its own inverse, which sucks, but I feel like that's unavoidable with Unicode.

@pao
Copy link
Member

pao commented Mar 15, 2014

Silly question: if the xs == reverse(reverse(xs)) invariant is broken, are we dealing with a function that is semantically different than reverse()? Should we be giving this operation a different name?

@StefanKarpinski
Copy link
Member

Very possible. Maybe reverse should just throw an error if it can't reverse a string in such a way that that identity holds, suggesting that you use a different function.

@stevengj
Copy link
Member

@StefanKarpinski, even library writers should have some kind of practical application in mind. Otherwise, why include a reverse(::String) function at all?

@StefanKarpinski
Copy link
Member

Largely because every language standard library seems to have one. I'm not entirely sure what it's useful for except for cute examples with sorting dictionary words. If someone has real world examples of using the reverse function on strings, I'm all ears.

@JeffBezanson
Copy link
Member

I agree it's not so important. One reason to have it is that reversing UTF-8 code points efficiently is difficult, so on the off chance you need this it's good to have built in. I prefer not to guess what people want, and instead do something primitive so that you can understand what it does and build what you want on top.

@stevengj
Copy link
Member

See e.g. this discussion on supposed uses for string reversal. It is telling that all of the answers seem to be bogus (there always seems to be a much better way to solve the problem that string reversal supposedly solves). My favorite answer is:

Reversing a string (in place or not) is a very common interview question for basic programming knowledge. A language lacking these built in functions would be difficult to interview for. The candidate would actually have to know something.

I vote that we just remove reverse(::String) function from Base unless a real use can be found. That's the normal criterion for including a function, after all, and it is impossible to sensibly design this function unless we know what it is for.

@nalimilan
Copy link
Member

This function is probably most useful as a showcase of how smart Julia is when it comes to handling Unicode strings. Not sure that justifies its existence... ;-)

@stevengj
Copy link
Member

I'm not sure whether including a useless function shows off how smart we are...

@JeffBezanson
Copy link
Member

It's possible that RevString is more useful than reverse; it's used in a few places to operate on the ends of strings, since otherwise iteration only moves forwards. Perhaps we should keep that and remove reverse(::String).

@StefanKarpinski
Copy link
Member

We could also have a reverse iteration protocol, which strings already kind of have with prevind.

@JeffBezanson
Copy link
Member

Adding that (maybe endof which we already have, prev, and isstart) and getting rid of both RevString and reverse(String) would be fine by me.

@JeffBezanson
Copy link
Member

So shall we deprecate reverse(::String)?

@carlobaldassi
Copy link
Member

I actually just used it (related: #6276).

@StefanKarpinski
Copy link
Member

Yeah, I really don't think that removing the reverse function is reasonable.

@nalimilan
Copy link
Member

@carlobaldassi Wouldn't it be better to use reverse search rather than reverting the whole string just to match a few characters?

@StefanKarpinski
Copy link
Member

It would but reverse string search has just as many difficulties as reversing a string does if you think about it. And PCRE doesn't support it.

On Apr 4, 2014, at 5:01 PM, Milan Bouchet-Valat [email protected] wrote:

@carlobaldassi Wouldn't it be better to use reverse search rather than reverting the whole string just to match a few characters?


Reply to this email directly or view it on GitHub.

@nalimilan
Copy link
Member

Well, if I want to find the last occurrence of noël in a string, I don't need to reverse both the string and noël; this is just like standard regexp matching, but starting from the end. It looks like the code @carlobaldassi linked to just does that.

@carlobaldassi
Copy link
Member

It could almost be reworked fairly easily to use rsearch with strings, but more verbosely. I needed to match either of two strings, so I would need 2 searches, and then compare the ranges (with if/else code to account for all cases). Plus, replicating the word-boundary token \b without regexes would be non-trivial.
(That said, that code is not super-critical.)

In any case, I was so annoyed by the experience that I started writing a ReverseRegex type. I have a reasonably complete working prototype now (warning: ugly code, and likely to have horrible performance).

@carlobaldassi
Copy link
Member

this is just like standard regexp matching, but starting from the end

@nalimilan see also #6276

@StefanKarpinski
Copy link
Member

Well, if I want to find the last occurrence of noël in a string, I don't need to reverse both the string and noël; this is just like standard regexp matching, but starting from the end.

Are you looking for \u0308 then e or the reverse? If you reverse both the regex and the string, sure it's pretty straightforward, but if you reverse only one, then you still have to answer all the questions.

@nalimilan
Copy link
Member

@carlobaldassi Yeah, but your solution isn't perfect either (is gnisu a word? :-), and it's reliable only because you know you deal with ASCII characters. If you were looking for a user-provided string, then the Unicode normalization issue would make it quite fragile. Another solution, since you're dealing with short strings anyway, is to get all the matches and take the last one (which was suggested in #6276). Anyway, my point is just that reverse is not super useful for pattern matching, or at least not really necessary (and potentially dangerous).

@StefanKarpinski The point is, you don't need to reverse any of them to find the last occurrence of noël.

@carlobaldassi
Copy link
Member

If you were looking for a user-provided string, then the Unicode normalization issue would make it quite fragile

I think that @StefanKarpinski's proposal of an argument to reverse would mostly solve this issue.

Another solution, since you're dealing with short strings anyway, is to get all the matches and take the last one

Indeed in this case it would, it'd just be more verbose (not a particularly compelling argument for keeping reverse in Base, I'll concede). But in general, besides performance considerations, reverse-searching with a regex is not exactly equivalent to doing this if it involves quantifiers (e.g. *, *? etc.). In order to express those, you really need to start matching from the end of the string, and match regex atoms from the end of the regex as you go. In other words, reverse both.

@stevengj
Copy link
Member

stevengj commented Dec 7, 2014

With #9261, join(reverse(collect(graphemes("noe\u0308l")))) == "lëon" will reverse the graphemes, doing the right thing with all combining characters regardless of normalization, if that is what you want.

(The collect step could be eliminated by implementing a reverse iterator, but I don't know if anyone actually cares about this functionality enough to make it worthwhile.)

@StefanKarpinski
Copy link
Member

Perhaps grapheme reversal is the correct default behavior.

@stevengj
Copy link
Member

stevengj commented Dec 7, 2014

If the application of reverse(s::String) is mostly for reversed regex searching (as seems to be the case), it would be better to leave as-is.

@ivarne
Copy link
Member

ivarne commented Dec 7, 2014

Since we have two possible behaviors, we should document which we have picked.

@nalimilan
Copy link
Member

@stevengj You scared me! join(reverse(collect(graphemes("noe\u0308l")))) == "lëon", not "leön". :-)

This indeed sounds like a very cool feature, and I agree with @StefanKarpinski that it should be the default. I think it would be more intuitive, as it is based on what the standard reference you quoted (http://www.unicode.org/reports/tr29/) calls "user-perceived characters". OTC the current behavior only makes sense when using regexps and it requires a deep understanding of Unicode.

I'd actually argue that iterating over strings should also go over graphemes, and that calling reverse on a string should continue to be equivalent to calling reverse on the array obtained via a string iterator. The current situation is quite confusing since it varies depending on how characters are formed:

# ë as a single codepoint, OK
julia> [c for c in "noël"]
4-element Array{Char,1}:
 'n'
 'o'
 'ë'
 'l'

julia> join(reverse([c for c in "noël"]))
"lëon"

julia> reverse("noël")
"lëon"

# ë as two codepoints, weird
julia> [c for c in "noe\u0308l"]
5-element Array{Char,1}:
 'n'
 'o'
 'e'
 '̈'
 'l'

# These really don't make any sense to 80% of users IMHO
julia> join(reverse([c for c in "noe\u0308l"]))
"l̈eon"

julia> reverse("noe\u0308l")
"l̈eon"

If the application of reverse(s::String) is mostly for reversed regex searching (as seems to be the case), it would be better to leave as-is.

I'd rather offer a function to to reverse regex matching (as I suggested at #9249 (comment)), and make reverse act on graphemes.

@stevengj
Copy link
Member

stevengj commented Dec 7, 2014

@nalimilan, making iteration go over graphemes by default would be a huge change because graphemes are substrings, not codepoints (Chars). I wouldn't be in favor of this.

I continue to think that the behavior of any function, reverse included, should be dictated by how it is used. So far, the only practical use for reverse has been for reverse regex searches (and possibly reversed string processing with other external C libraries). And if you are going to regex search for a unicode substring with combining characters, even if you are searching forward, you have to understand combining characters and normalizations.

A reverse-regex function should be discussed in #6276.

My preference would be to simply document this in reverse: say that it reverses codepoints, and is mainly useful for reverse string processing (e.g. regex), but mention the grapheme reversal as an alternative.

@nalimilan
Copy link
Member

@nalimilan, making iteration go over graphemes by default would be a huge change because graphemes are substrings, not characters. I wouldn't be in favor of this.

Ah, right. Unicode is so complex. Maybe we just need some amount of normalization when iterating and reversing, so that they give the same result for both "noe\u0308l" and "noël" . Keeping iteration and reversal in sync looks like a good idea to me, an currently they fail the "noël" test.

I don't really like the idea that reverse is only useful for regex. If that's the case, then let's just provide a function for that, but let's not have a reverse function which looks like it's generally useful while it isn't.

@stevengj
Copy link
Member

stevengj commented Dec 7, 2014

Problems with just exposing a function for rsearch with regex:

  • The UI would be confusing because the caller must manually reverse the regex — we can't automate this without implementing our own regex processing.
  • If we ever do implement pure-Julia regex processing, it will be able to implement rsearch without a reversed regex and we will be forced to make a breaking change.
  • There are many regex functions in Julia, not just search: you would need reversed versions of ismatch, eachmatch, etcetera.
  • There might be other external C string-processing libraries where you want to process in reverse, and hence you would still want reverse.

Normalizing on iteration (and reversal) would (a) greatly slow down those functions, (b) involve more memory allocation, and (c) wouldn't fix the problem anyway because NFC eliminates some but not all combining characters.

@PallHaraldsson
Copy link
Contributor

@stevengj: "I vote that we just remove reverse(::String) function from Base". I may not have thought this through, but doesn't reverse imply indexing must be well defined (and it isn't - see #9297 (comment))

If/when indexing is gone and you can reintroduce it explicitly, say you want codepoint-indexing or whatever (byte, grapheme), does reverse get trivial then?

@PallHaraldsson
Copy link
Contributor

Add breaking label? Appropriate if changing from codepoint-meaning (or just dropping function..). Seems I just do not have the right.

@ivarne ivarne added the breaking This change will break code label Jul 23, 2015
@stevengj
Copy link
Member

I think this issue can be closed.

Per the above discussion, it seems to have become clear that we need a reverse function for only one reason: efficient reversed-order string processing with external libraries, especially PCRE. And for this purpose, you want to reverse codepoints, not graphemes.

No one has come up with any practical use for reversing graphemes. This is possible in Julia as noted above, via the graphemes iterator, but with no real application it doesn't make sense to make it the default or even to provide a special function in Base.

@ScottPJones
Copy link
Contributor

👏 Good to see this closed!

stevengj added a commit that referenced this issue Sep 20, 2017
In looking back at #6165 (due to #23612), I realized that we never clearly documented the behavior or rationale.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code needs decision A decision on this change is needed unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests