Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Char::to_{lower,upper}case should return Option<&'static str> instead of char #20333

Closed
SimonSapin opened this issue Dec 30, 2014 · 1 comment · Fixed by #23126
Closed

Char::to_{lower,upper}case should return Option<&'static str> instead of char #20333

SimonSapin opened this issue Dec 30, 2014 · 1 comment · Fixed by #23126
Labels
A-collections Area: `std::collection`

Comments

@SimonSapin
Copy link
Contributor

These two methods should not be stabilized as-is. They should be changed to return a variable number of code points (between one and three), per Unicode’s SpecialCasing.txt.

Such results could be represented as &'static str or &'static [char] slices of a static table in libunicode. The former avoid re-encoding to UTF-8 when accumulating results in a String. To avoid having an entry in that table for every one of the 1114111 code points, the return type could be an Option, where None means that the code point is unchanged by the mapping. (This is by large the common case.) Or it could be a new special-purpose type like enum CaseMappingResult { Unchanged, MappedTo(&'static str) }.

Since the Char methods become less convenient to use, there should be str::to_{lower,upper}case() -> String wrappers.

SpecialCasing.txt also defines some language-sensitive mappings for Turkish and Lithuanian, but I suggest not including them, for a few reasons:

  • Using the system’s locale is a very bad idea. Programs behaving differently on different systems is a source of countless bugs, and the system’s locale may not even be that of the end users (e.g for server-side software.)

  • Forcing users to specify a language is counter-productive since it might often end up being hard-coded to English or something. There should be a default.

  • Users who do care about language-specific tailoring may want to do more anyway. SpecialCasing.txt says:

    Note that the preferred mechanism for defining tailored casing operations is the Unicode Common Locale Data Repository (CLDR).

Finally, there are conditional mappings that depend on the context of surrounding code points, but not on the language. They could be special cases in the str methods, but I don’t know if it’s worth the bother since there is currently only one such special case. (Greek capital sigma at the end of a word.)

More background on Unicode case mappings:

http://unicode.org/faq/casemap_charprop.html
http://www.unicode.org/reports/tr44/tr44-14.html#Casemapping

CC @huonw, @aturon

@SimonSapin
Copy link
Contributor Author

Another possibility is to make the Char methods private (or not re-exported by libstd), implementation details of the str methods.

bors added a commit that referenced this issue Jan 5, 2015
cc #19260 

The casing transformations are left unstable (it is highly likely to be better to adopt the proper non-1-to-1 case mappings, per #20333) as are `is_xid_*`.

I've got a little todo list in the last commit of things I thought about/was told about that I haven't yet handled (I'd also like some feedback).
@kmcallister kmcallister added A-libs A-collections Area: `std::collection` labels Jan 16, 2015
bors added a commit that referenced this issue Mar 10, 2015
This commit performs another pass over the `std::char` module for stabilization.
Some minor cleanup is performed such as migrating documentation from libcore to
libunicode (where the `std`-facing trait resides) as well as a slight
reorganiation in libunicode itself. Otherwise, the stability modifications made
are:

* `char::from_digit` is now stable
* `CharExt::is_digit` is now stable
* `CharExt::to_digit` is now stable
* `CharExt::to_{lower,upper}case` are now stable after being modified to return
  an iterator over characters. While the implementation today has not changed
  this should allow us to implement the full set of case conversions in unicode
  where some characters can map to multiple when doing an upper or lower case
  mapping.
* `StrExt::to_{lower,upper}case` was added as unstable for a convenience of not
  having to worry about characters expanding to more characters when you just
  want the whole string to get into upper or lower case.

This is a breaking change due to the change in the signatures of the
`CharExt::to_{upper,lower}case` methods. Code can be updated to use functions
like `flat_map` or `collect` to handle the difference.

[breaking-change]

Closes #20333
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-collections Area: `std::collection`
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants