-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change encode_utf{8,16}() to write to a buffer and panic if it's too small #36377
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
dst[0] = (code >> 18 & 0x07) as u8 | TAG_FOUR_B; | ||
dst[1] = (code >> 12 & 0x3F) as u8 | TAG_CONT; | ||
dst[2] = (code >> 6 & 0x3F) as u8 | TAG_CONT; | ||
dst[3] = (code & 0x3F) as u8 | TAG_CONT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code unnecessarily does 4 bounds checks, despite you already knowing that destination slice is big enough. You should be using get_unchecked_mut
instead of indexing here.
Also in order to duplicate less code between this and char::utf8_len()
, you could do this:
let bytes = char.utf8_len();
if dst.len() < bytes { /* panic */ }
match bytes {
1 => { /* encode 1 char */ }
2 => { /* encode 2 chars */ }
...
_ => {}
}
bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's that bad at eliding bound checks‽ I'll do that then.
I considered rewriting it to use len_utf8()
, but was worried about performance of calculating length, then checking if the length is <constant>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be able to figure out the checks are not necessary in most cases, but LLVM is known to trip up occasionally (also will never try to optimise in debug mode).
I considered rewriting it to use len_utf8(), but was worried about performance of calculating length, then checking if the length is .
You’re doing exactly the same computation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm no expert on this and maybe it's optimized out, but this is what I think happens:
if the codepoint needs one byte:
now:
is codepoint < MAX_ONE_B?
is the buffer length not zero?
with len_utf8()
:
is codepoint < MAX_ONE_B?
set len to 1
is len >= the buffer length?
is len 1?
If the codepoint needs two bytes:
now:
is codepoint < MAX_ONE_B?
is codepoint < MAX_TWO_B?
is the buffer length >= 2?
with len_utf8()
:
is codepoint < MAX_ONE_B?
is codepoint < MAX_TWO_B?
set len to 2
is len >= the buffer length?
is len 1?
is len 2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was more concerned about it being not DRY and expected both versions be pretty much the same performance wise. Did some benchmarking which prove that the version I proposed performs worse, so its fine to keep the current version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's ugly,
Thanks for creating benchmarks. I need to learn that, they beat any assumption :)
Doc examples are tests, the compiler runs them as part of the test suite :) |
I know that, but I have seen few examples that showcase panicking, so I was wondering if it should be removed. The reason to panic is after all that it would rarely happen that the buffer is to short. So is there any point to have an example for something that rarely happens? I forgot to mention that I removed the clause that the buffer will be left unmodified if it's not large enough, to not prevent any potential optimization. Since it will panic I doubt anybody who notices it's changed will care, |
reason = "this function should not be exposed publicly", | ||
issue = "0")] | ||
#[doc(hidden)] | ||
pub fn encode_utf8_raw(code: u32, dst: &mut [u8]) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this function stay private and not exported from this module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's called from src/libstd/sys/common/wtf8.rs, but that can be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah it's ok I think to leave as-is today to leverage the same implementation.
Thanks for the PR @tormol! The implementation looks pretty good to me, although I've had an idea while reading this which I think would allow us to improve ergonomics greatly as well in lots of the various call sites. Can you also be sure to tag these commits as |
Discussed at libs triage today, this seemed good but the conclusion was that we may wish to instead stabilize: fn encode_utf8<'a>(&self, buf: &'a mut [u8]) -> &'a mut str;
fn encode_utf16<'a>(&self, buf: &'a mut [u16]) -> &'a mut [u16]; @tormol would you be up for implementing that change? |
Yes, but I might not have time to do it this week. Returning a str or slice looks like a good idea. |
Done |
Apparently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great, thanks @tormol!
ch.encode_utf8(slice); | ||
self.vec.set_len(cur_len + ch_len); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be the same as it was before, basically just falling back to extend_from_slice
@@ -1131,10 +1147,12 @@ impl String { | |||
let len = self.len(); | |||
assert!(idx <= len); | |||
assert!(self.is_char_boundary(idx)); | |||
let bits = ch.encode_utf8(); | |||
self.vec.reserve(4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can avoid the reserve
and intermediate buffers here w/ code similar to before and temporary buffers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the .reserve()
is unnecessary, but what is the difference between an intermediate buffer and a temporary buffer?
let bytes = c.encode_utf8(); | ||
let s = ::std::str::from_utf8(bytes.as_slice()).unwrap(); | ||
let b = c.encode_utf8(&mut bytes).as_bytes(); | ||
let s = ::std::str::from_utf8(b).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid the from_utf8
here as encode_utf8
returns a string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I kept the from_utf8()
because it also tests encode_utf8()
, but I'll move it to the right place.
let bytes = c.encode_utf8(); | ||
let s = ::std::str::from_utf8(bytes.as_slice()).unwrap(); | ||
let b = c.encode_utf8(&mut bytes).as_bytes(); | ||
let s = ::std::str::from_utf8(b).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same as above about avoiding from_utf8
)
reason = "this function should not be exposed publicly", | ||
issue = "0")] | ||
#[doc(hidden)] | ||
pub fn encode_utf8_raw(code: u32, dst: &mut [u8]) -> &mut str { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these functions be left private/removed like they were before?
let bytes = unsafe { | ||
char::from_u32_unchecked(code_point.value).encode_utf8() | ||
}; | ||
self.bytes.extend_from_slice(bytes.as_slice()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can stay roughly the same as before, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a diff comment two weeks ago you said it was OK, but yes.
…nd return a slice They panic if the buffer is too small.
Done |
Change encode_utf{8,16}() to write to a buffer and panic if it's too small cc #27784 Should the "A buffer that's too small" examples be removed and replaced by tests?
cc #27784
Should the "A buffer that's too small" examples be removed and replaced by tests?