-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode grapheme support #7043
Comments
To force people new to Unicode to understand what they iterate over, the names could be chosen to not make one look more common than the other. Grapheme clusters are slow to decode, but what you typically want when you need to limit text to the first n characters. Even when storing text in a fixed size database field one should be aware that one possibly just strip accents of the last letter when working with code points. Maybe a scheme like this would ensure the right decision is made in user code:
|
Yeah, I'm of the same opinion. The docs of every of those three functions should also contain a short example along those lines:
|
Nominating for backwards compatibility milestone, I suppose? String handling is a pretty fundamental part of the libraries. |
Normalization forms will matter too. Kimundi's 4-codepoint string We have a 1-to-1 encoding of utf-8 now, so at least the bytewise equality is the same as the charwise equality. Should string equality hold across unicode normalizations too? |
Rust strings are defined a utf8, but NFC normalization is a additional property on top of that. I think we should provide functions for explicitly normalizing a str, maybe add Iterator adapters that normalize lazy, but I don't think it is something that should happen automatically (It would generally mean more allocations). However, if we ever get user definable unsized types, nothing would speak against having a |
|
Not backwards-incompatible; accepted for feature-complete |
Visiting for triage. This is still as important as ever. To my knowledge, no progress has been made. |
In order to be consistent with curreny naming, my suggestions would be: bytes()
chars() // codepoints
graphemes() I'd like to work on the graphemes() iterator. |
@pzol: Yeah, something like that would work. For the name, seeing how "grapheme cluster" is the correct name, an alternative to |
We can add support for graphemes backwards compatibly. Therefore not a backwards-compatibility issue. Not tagging as a 1.0 blocker. Assigning P-low, not 1.0. |
I'm working on a patch for grapheme cluster iteration here: https://github.com/Meyermagic/rust/compare/graphemecluster Still need to clean it up, optimize, write tests, etc. There are probably some code style issues, too. |
Folks: I added a Graphemes iterator to the UnicodeStrSlice trait: #15619 Comments appreciated. |
- `width()` computes the displayed width of a string, ignoring the width of control characters. - arguably we might do *something* else for control characters, but the question is, what? - users who want to do something else can iterate over chars() - `graphemes()` returns a `Graphemes` struct, which implements an iterator over the grapheme clusters of a &str. - fully compliant with [UAX#29](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries) - passes all [Unicode-supplied tests](http://www.unicode.org/reports/tr41/tr41-15.html#Tests29) - added code to generate additionial categories in `unicode.py` - `Cn` aka `Not_Assigned` - categories necessary for grapheme cluster breaking - tidied up the exports from libunicode - all exports are exposed through a module rather than directly at crate root. - std::prelude imports UnicodeChar and UnicodeStrSlice from std::char and std::str rather than directly from libunicode closes #7043
Remove some dead utils changelog: none
Currently Rust doesn't support unicode properly, e.g. there is no way to iterate over a string by grapheme (there is
.iter()
for codepoints, and.bytes_iter()
for bytes).Possibly useful: http://useless-factor.blogspot.de/2007/08/unicode-implementers-guide-part-4.html
The text was updated successfully, but these errors were encountered: