Unicode grapheme support #7043

huonw · 2013-06-10T08:45:58Z

Currently Rust doesn't support unicode properly, e.g. there is no way to iterate over a string by grapheme (there is .iter() for codepoints, and .bytes_iter() for bytes).

Possibly useful: http://useless-factor.blogspot.de/2007/08/unicode-implementers-guide-part-4.html

The text was updated successfully, but these errors were encountered:

mleise · 2013-06-10T09:09:25Z

To force people new to Unicode to understand what they iterate over, the names could be chosen to not make one look more common than the other. Grapheme clusters are slow to decode, but what you typically want when you need to limit text to the first n characters. Even when storing text in a fixed size database field one should be aware that one possibly just strip accents of the last letter when working with code points. Maybe a scheme like this would ensure the right decision is made in user code:

bytes_iter();
cp_iter();
graph_iter();

Kimundi · 2013-06-10T09:16:10Z

Yeah, I'm of the same opinion.
If there exist iterators for all three, none of them should be the shorter default, people need to think about which one they need.

The docs of every of those three functions should also contain a short example along those lines:

/// Returns a Iterator over the graphemes of a string.
///
/// Which string iterator do I need?
/// - "aỹe".iter_graph() => iterates "a", "ỹ", "e"
/// - "aỹe".iter_cp()    => iterates 'a', 'y', '\u0303', 'e'
/// - "aỹe".iter_bytes() => iterates 0x61, 0x79, 0xcc, 0x83, 0x65
fn iter_graph() ...

msullivan · 2013-07-29T21:58:52Z

Nominating for backwards compatibility milestone, I suppose? String handling is a pretty fundamental part of the libraries.

bluss · 2013-08-02T15:17:03Z

Normalization forms will matter too. Kimundi's 4-codepoint string "aỹe" is 3 codepoints in NFC normalization.

We have a 1-to-1 encoding of utf-8 now, so at least the bytewise equality is the same as the charwise equality. Should string equality hold across unicode normalizations too?

Kimundi · 2013-08-02T19:15:31Z

Rust strings are defined a utf8, but NFC normalization is a additional property on top of that.

I think we should provide functions for explicitly normalizing a str, maybe add Iterator adapters that normalize lazy, but I don't think it is something that should happen automatically (It would generally mean more allocations).

However, if we ever get user definable unsized types, nothing would speak against having a nfc_str, where the invariant 'nfc normalized utf8' holds.

bluss · 2013-08-18T12:34:10Z

Also a replacement for .word_iter() that takes unicode properties into account.

http://www.unicode.org/reports/tr29

catamorphism · 2013-09-05T16:44:58Z

Not backwards-incompatible; accepted for feature-complete

emberian · 2014-02-17T22:44:03Z

Visiting for triage. This is still as important as ever. To my knowledge, no progress has been made.

pzol · 2014-02-26T17:23:49Z

In order to be consistent with curreny naming, my suggestions would be:

bytes()
chars() // codepoints
graphemes()

I'd like to work on the graphemes() iterator.

Kimundi · 2014-02-26T17:30:41Z

@pzol: Yeah, something like that would work. For the name, seeing how "grapheme cluster" is the correct name, an alternative to graphemes could also be clusters.

pnkfelix · 2014-03-20T20:30:49Z

We can add support for graphemes backwards compatibly. Therefore not a backwards-compatibility issue. Not tagging as a 1.0 blocker.

Assigning P-low, not 1.0.

Meyermagic · 2014-03-24T03:17:08Z

I'm working on a patch for grapheme cluster iteration here: https://github.com/Meyermagic/rust/compare/graphemecluster

Still need to clean it up, optimize, write tests, etc. There are probably some code style issues, too.

kwantam · 2014-07-11T22:24:28Z

Folks: I added a Graphemes iterator to the UnicodeStrSlice trait: #15619

Comments appreciated.

- `width()` computes the displayed width of a string, ignoring the width of control characters. - arguably we might do *something* else for control characters, but the question is, what? - users who want to do something else can iterate over chars() - `graphemes()` returns a `Graphemes` struct, which implements an iterator over the grapheme clusters of a &str. - fully compliant with [UAX#29](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) - passes all [Unicode-supplied tests](http://www.unicode.org/reports/tr41/tr41-15.html#Tests29) - added code to generate additionial categories in `unicode.py` - `Cn` aka `Not_Assigned` - categories necessary for grapheme cluster breaking - tidied up the exports from libunicode - all exports are exposed through a module rather than directly at crate root. - std::prelude imports UnicodeChar and UnicodeStrSlice from std::char and std::str rather than directly from libunicode closes #7043

Remove some dead utils changelog: none

pnkfelix mentioned this issue Aug 23, 2013

getopts should use grapheme clusters for text alignment #5516

Closed

huonw mentioned this issue Aug 24, 2013

Bad span computations with unicode characters, should be handling them as graphemes #8706

Closed

Kimundi mentioned this issue Sep 24, 2013

Consider removing the _iter suffixes on specialized Iterator constructors #9440

Closed

pzol self-assigned this Feb 26, 2014

pnkfelix added P-low and removed P-high-untriaged labels Mar 20, 2014

huonw referenced this issue in kmcallister/rust Jun 12, 2014

Replace enum LintId with an extensible alternative

4c50f8b

huonw unassigned pzol Jun 16, 2014

kwantam mentioned this issue Jul 11, 2014

UnicodeStrSlice: add width() and graphemes() methods #15619

Merged

bors closed this as completed in #15619 Jul 16, 2014

ftxqxd mentioned this issue Jan 24, 2015

Diagnostics' ^~~~ is not aligned properly when error contains 日本語 characters #21492

Closed

flip1995 pushed a commit to flip1995/rust that referenced this issue Apr 8, 2021

Auto merge of rust-lang#7043 - camsteffen:dead-utils, r=flip1995

bbe1567

Remove some dead utils changelog: none

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode grapheme support #7043

Unicode grapheme support #7043

huonw commented Jun 10, 2013

mleise commented Jun 10, 2013

Kimundi commented Jun 10, 2013

msullivan commented Jul 29, 2013

bluss commented Aug 2, 2013

Kimundi commented Aug 2, 2013

bluss commented Aug 18, 2013

catamorphism commented Sep 5, 2013

emberian commented Feb 17, 2014

pzol commented Feb 26, 2014

Kimundi commented Feb 26, 2014

pnkfelix commented Mar 20, 2014

Meyermagic commented Mar 24, 2014

kwantam commented Jul 11, 2014