Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode grapheme support #7043

Closed
huonw opened this issue Jun 10, 2013 · 13 comments · Fixed by #15619
Closed

Unicode grapheme support #7043

huonw opened this issue Jun 10, 2013 · 13 comments · Fixed by #15619
Labels
A-Unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. P-low Low priority

Comments

@huonw
Copy link
Member

huonw commented Jun 10, 2013

Currently Rust doesn't support unicode properly, e.g. there is no way to iterate over a string by grapheme (there is .iter() for codepoints, and .bytes_iter() for bytes).

Possibly useful: http://useless-factor.blogspot.de/2007/08/unicode-implementers-guide-part-4.html

@mleise
Copy link

mleise commented Jun 10, 2013

To force people new to Unicode to understand what they iterate over, the names could be chosen to not make one look more common than the other. Grapheme clusters are slow to decode, but what you typically want when you need to limit text to the first n characters. Even when storing text in a fixed size database field one should be aware that one possibly just strip accents of the last letter when working with code points. Maybe a scheme like this would ensure the right decision is made in user code:

bytes_iter();
cp_iter();
graph_iter();

@Kimundi
Copy link
Member

Kimundi commented Jun 10, 2013

Yeah, I'm of the same opinion.
If there exist iterators for all three, none of them should be the shorter default, people need to think about which one they need.

The docs of every of those three functions should also contain a short example along those lines:

/// Returns a Iterator over the graphemes of a string.
///
/// Which string iterator do I need?
/// - "aỹe".iter_graph() => iterates "a", "ỹ", "e"
/// - "aỹe".iter_cp()    => iterates 'a', 'y', '\u0303', 'e'
/// - "aỹe".iter_bytes() => iterates 0x61, 0x79, 0xcc, 0x83, 0x65
fn iter_graph() ...

@msullivan
Copy link
Contributor

Nominating for backwards compatibility milestone, I suppose? String handling is a pretty fundamental part of the libraries.

@bluss
Copy link
Member

bluss commented Aug 2, 2013

Normalization forms will matter too. Kimundi's 4-codepoint string "aỹe" is 3 codepoints in NFC normalization.

We have a 1-to-1 encoding of utf-8 now, so at least the bytewise equality is the same as the charwise equality. Should string equality hold across unicode normalizations too?

@Kimundi
Copy link
Member

Kimundi commented Aug 2, 2013

Rust strings are defined a utf8, but NFC normalization is a additional property on top of that.

I think we should provide functions for explicitly normalizing a str, maybe add Iterator adapters that normalize lazy, but I don't think it is something that should happen automatically (It would generally mean more allocations).

However, if we ever get user definable unsized types, nothing would speak against having a nfc_str, where the invariant 'nfc normalized utf8' holds.

@bluss
Copy link
Member

bluss commented Aug 18, 2013

  • Also a replacement for .word_iter() that takes unicode properties into account.

http://www.unicode.org/reports/tr29

@catamorphism
Copy link
Contributor

Not backwards-incompatible; accepted for feature-complete

@emberian
Copy link
Member

Visiting for triage. This is still as important as ever. To my knowledge, no progress has been made.

@pzol
Copy link
Contributor

pzol commented Feb 26, 2014

In order to be consistent with curreny naming, my suggestions would be:

bytes()
chars() // codepoints
graphemes()

I'd like to work on the graphemes() iterator.

@Kimundi
Copy link
Member

Kimundi commented Feb 26, 2014

@pzol: Yeah, something like that would work. For the name, seeing how "grapheme cluster" is the correct name, an alternative to graphemes could also be clusters.

@pzol pzol self-assigned this Feb 26, 2014
@pnkfelix
Copy link
Member

We can add support for graphemes backwards compatibly. Therefore not a backwards-compatibility issue. Not tagging as a 1.0 blocker.

Assigning P-low, not 1.0.

@Meyermagic
Copy link
Contributor

I'm working on a patch for grapheme cluster iteration here: https://github.com/Meyermagic/rust/compare/graphemecluster

Still need to clean it up, optimize, write tests, etc. There are probably some code style issues, too.

@kwantam
Copy link
Contributor

kwantam commented Jul 11, 2014

Folks: I added a Graphemes iterator to the UnicodeStrSlice trait: #15619

Comments appreciated.

bors added a commit that referenced this issue Jul 15, 2014
- `width()` computes the displayed width of a string, ignoring the width of control characters.
    - arguably we might do *something* else for control characters, but the question is, what?
    - users who want to do something else can iterate over chars()

- `graphemes()` returns a `Graphemes` struct, which implements an iterator over the grapheme clusters of a &str.
    - fully compliant with [UAX#29](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)
    - passes all [Unicode-supplied tests](http://www.unicode.org/reports/tr41/tr41-15.html#Tests29)

- added code to generate additionial categories in `unicode.py`
    - `Cn` aka `Not_Assigned`
    - categories necessary for grapheme cluster breaking

- tidied up the exports from libunicode
  - all exports are exposed through a module rather than directly at crate root.
  - std::prelude imports UnicodeChar and UnicodeStrSlice from std::char and std::str rather than directly from libunicode

closes #7043
flip1995 pushed a commit to flip1995/rust that referenced this issue Apr 8, 2021
Remove some dead utils

changelog: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode C-enhancement Category: An issue proposing an enhancement or a PR with one. P-low Low priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.