-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue for char encoding methods #27784
Comments
How about returning enums like |
Certainly possible, but there's also the question of ergonomics here in terms of what to do with that after you've got the information. |
I think that one form or another of this functionality that doesn’t require allocation should be exposed. Returning an iterator is nicer than taking |
I’ve suggested taking Taking anything else than slice also makes following use case not as elegant as it is now: let mut buffer = Vec::with_capacity(alot);
let mut idx = 0;
loop {
idx += some_char().encode_utf8(&mut buffer[idx..]).unwrap();
} |
* Rename `Utf16Items` to `Utf16Decoder`. "Items" is meaningless. * Generalize it to any `u16` iterator, not just `[u16].iter()` * Make it yield `Result` instead of a custom `Utf16Item` enum that was isomorphic to `Result`. This enable using the `FromIterator for Result` impl. * Replace `Utf16Item::to_char_lossy` with a `Utf16Decoder::lossy` iterator adaptor. This is a [breaking change], but only for users of the unstable `rustc_unicode` crate. I’d like this functionality to be stabilized and re-exported in `std` eventually, as the "low-level equivalent" of `String::from_utf16` and `String::from_utf16_lossy` like #27784 is the low-level equivalent of #27714. CC @aturon, @alexcrichton
How about returning something that both is an iterator and dereferences to a slice? struct Utf8Char {
bytes: [u8; 4],
position: usize,
}
impl Deref for Utf8Char {
type Target = [u8];
fn deref(&self) -> &[u8] { &self.bytes[self.position..] }
}
impl Iterator for Utf8Char {
type Item = u8;
fn next(&mut self) -> Option<u8> {
if self.position < self.bytes.len() {
let byte = self.bytes[self.position];
self.position += 1;
Some(byte)
} else {
None
}
}
} (“Short” code points have zeros as padding at the start of the array.) … and similarly for UTF-16, but with |
@SimonSapin That looks really sweet to me! |
@SimonSapin In your |
@SimonSapin What do you think about also exposing |
No, We already have For UTF-8 I do want to expose a decoder that’s more low-level than what we currently have, but I’m not sure what it should look like. I have some experiments at https://github.com/SimonSapin/rust-utf8 |
Ah! That was what I missed. Thanks for the clarification.
Interesting. That is much more complex than I had thought it would be! (I hadn't considered returning additional info about incomplete sequences.) |
Most of the complexity comes from self-imposed constraints:
I don’t know how much of that should be in the standard library. But when the standard library gets performance improvement like #30740 (and perhaps more in the future with SIMD or something?), ideally they’d be in a low-level algorithm that everything else builds on top of. |
🔔 This issue is now entering its cycle-long final comment period for stabilization 🔔 The API proposed by @SimonSapin seems reasonable, perhaps in the form of: impl char {
fn encode_utf8(&self) -> EncodeUtf8;
}
struct EncodeUtf8 {
// ...
}
impl Iterator for EncodeUtf8 {
type Item = u8;
// ...
}
impl EncodeUtf8 {
#[unstable(...)]
pub fn as_slice(&self) -> &[u8] { /* ... */ }
} |
Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which *also* exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc rust-lang#27784
…turon std: Change `encode_utf{8,16}` to return iterators Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which *also* exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc #27784
…turon std: Change `encode_utf{8,16}` to return iterators Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which *also* exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc #27784
Why add an |
@alexcrichton not sure if this is in scope, but I have the use case where I want to write a "char" into a u8 array at a given offset. Right now I've copied the main part from stdlib and do it like so: #[inline]
fn write_char_into_array(&self, offset: usize, ch: &char, array: &mut [u8]) -> bool {
if (ch.len_utf8() + offset) > array.len() {
return false;
}
let code = *ch as u32;
if code < MAX_ONE_B {
array[offset] = code as u8;
} else if code < MAX_TWO_B {
array[offset] = (code >> 6 & 0x1F) as u8 | TAG_TWO_B;
array[offset + 1] = (code & 0x3F) as u8 | TAG_CONT;
} else if code < MAX_THREE_B {
array[offset] = (code >> 12 & 0x0F) as u8 | TAG_THREE_B;
array[offset + 1] = (code >> 6 & 0x3F) as u8 | TAG_CONT;
array[offset + 2] = (code & 0x3F) as u8 | TAG_CONT;
} else {
array[offset] = (code >> 18 & 0x07) as u8 | TAG_FOUR_B;
array[offset + 1] = (code >> 12 & 0x3F) as u8 | TAG_CONT;
array[offset + 2] = (code >> 6 & 0x3F) as u8 | TAG_CONT;
array[offset + 3] = (code & 0x3F) as u8 | TAG_CONT;
}
true
} Again not sure if this is in scope, but I wanted to bring it up in case such a use case makes sense to integrate. |
@daschl you can deal with the offset by slicing: fn write_char_into_array(&self, offset: usize, ch: &char, array: &mut [u8]) -> bool {
let slice = &mut array[offset..];
if ch.len_utf8() > slice.len() {
return false
}
ch.encode_utf8(slice);
true
} |
@alexcrichton All of your suggestions are good. I don't know if this data point is interesting, but ArrayString ended up copying this code too. When I shaped it after how it's used it ends up looking like
|
@SimonSapin ah good to know, thanks! lets hope it gets stable soon then |
I would vote for no panic. That also simplifies the bounds checking (the impl linked above has all bounds checks elided, since |
@rfcbot fcp merge These methods have been around for awhile, I propose we merge! |
er, stabilize* |
Team member @alexcrichton has proposed to merge this. The next step is review by the rest of the tagged teams: Concerns:
Once these reviewers reach consensus, this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
It's not obvious to me that the current API is the one we want since there has been ongoing discussion. @bluss recently suggested it should not panic, and the current implementation does panic. |
@rfcbot concern panic-vs-not-panic |
I'm personally convinced by @SimonSapin's points above, which is that there is a statically known size (4 and 2) to encode all utf8 and utf16 characters. Callers basically always need to pass in that size buffer, so it seems more like a programmer error if you pass in a small buffer than a runtime error that should be handled. |
To reiterate: you can use 2 or 4 to keep things simple or to use a statically-sized array, but you can also use |
I don't want to cause a gridlock, it was just what my perspective was in that moment. The fixed capacity use case is more uncommon, and the motivation for panic is following the usual conventions, so it certainly makes sense. |
OK, it sounds like there's no major opposition to the api as is. |
@rfcbot resolved panic-vs-not-panic |
Apart from the panicking. I'm a bit confused right now about what the actual API/signature is going to be. The one that returns |
Yeah fn encode_utf8(self, dst: &mut [u8]) -> &mut str and fn encode_utf16(self, dst: &mut [u16]) -> &mut [u16] |
Alright! |
🔔 This is now entering its final comment period, as per the review above. 🔔 psst @alexcrichton, I wasn't able to add the |
The final comment period is now complete. |
Library stabilizations/deprecations for 1.15 release Stabilized: - `std::iter::Iterator::{min_by, max_by}` - `std::os::*::fs::FileExt` - `std::sync::atomic::Atomic*::{get_mut, into_inner}` - `std::vec::IntoIter::{as_slice, as_mut_slice}` - `std::sync::mpsc::Receiver::try_iter` - `std::os::unix::process::CommandExt::before_exec` - `std::rc::Rc::{strong_count, weak_count}` - `std::sync::Arc::{strong_count, weak_count}` - `std::char::{encode_utf8, encode_utf16}` - `std::cell::Ref::clone` - `std::io::Take::into_inner` Deprecated: - `std::rc::Rc::{would_unwrap, is_unique}` - `std::cell::RefCell::borrow_state` Closes #23755 Closes #27733 Closes #27746 Closes #27784 Closes #28356 Closes #31398 Closes #34931 Closes #35601 Closes #35603 Closes #35918 Closes #36105
This is a tracking issue for the unstable
unicode
feature and thechar::encode_utf{8,16}
methods.The interfaces here are a little wonky but are done for performance. It's not clear whether these need to be exported or not or if there's a better method to do so through iterators.
The text was updated successfully, but these errors were encountered: