Change encode_utf{8,16}() to write to a buffer and panic if it's too small #36377

tormol · 2016-09-10T13:43:23Z

Should the "A buffer that's too small" examples be removed and replaced by tests?

rust-highfive · 2016-09-10T13:43:42Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

nagisa · 2016-09-10T15:49:51Z

src/libcore/char.rs

+        dst[0] = (code >> 18 & 0x07) as u8 | TAG_FOUR_B;
+        dst[1] = (code >> 12 & 0x3F) as u8 | TAG_CONT;
+        dst[2] = (code >>  6 & 0x3F) as u8 | TAG_CONT;
+        dst[3] = (code & 0x3F) as u8 | TAG_CONT;


This code unnecessarily does 4 bounds checks, despite you already knowing that destination slice is big enough. You should be using get_unchecked_mut instead of indexing here.

Also in order to duplicate less code between this and char::utf8_len(), you could do this:

let bytes = char.utf8_len(); if dst.len() < bytes { /* panic */ } match bytes { 1 => { /* encode 1 char */ } 2 => { /* encode 2 chars */ } ... _ => {} } bytes

It's that bad at eliding bound checks‽ I'll do that then.

I considered rewriting it to use len_utf8(), but was worried about performance of calculating length, then checking if the length is <constant>.

It may be able to figure out the checks are not necessary in most cases, but LLVM is known to trip up occasionally (also will never try to optimise in debug mode).

I considered rewriting it to use len_utf8(), but was worried about performance of calculating length, then checking if the length is .

You’re doing exactly the same computation here.

I'm no expert on this and maybe it's optimized out, but this is what I think happens:

if the codepoint needs one byte:

now:

is codepoint < MAX_ONE_B? is the buffer length not zero?

with len_utf8():

is codepoint < MAX_ONE_B? set len to 1 is len >= the buffer length? is len 1?

If the codepoint needs two bytes:

now:

is codepoint < MAX_ONE_B? is codepoint < MAX_TWO_B? is the buffer length >= 2?

with len_utf8():

is codepoint < MAX_ONE_B? is codepoint < MAX_TWO_B? set len to 2 is len >= the buffer length? is len 1? is len 2?

I was more concerned about it being not DRY and expected both versions be pretty much the same performance wise. Did some benchmarking which prove that the version I proposed performs worse, so its fine to keep the current version.

I agree that it's ugly,

Thanks for creating benchmarks. I need to learn that, they beat any assumption :)

steveklabnik · 2016-09-10T17:03:00Z

Should the "A buffer that's too small" examples be removed and replaced by tests?

Doc examples are tests, the compiler runs them as part of the test suite :)

tormol · 2016-09-10T19:16:19Z

I know that, but I have seen few examples that showcase panicking, so I was wondering if it should be removed. The reason to panic is after all that it would rarely happen that the buffer is to short. So is there any point to have an example for something that rarely happens?

I forgot to mention that I removed the clause that the buffer will be left unmodified if it's not large enough, to not prevent any potential optimization. Since it will panic I doubt anybody who notices it's changed will care,

alexcrichton · 2016-09-12T17:11:19Z

src/libcore/char.rs

+           reason = "this function should not be exposed publicly",
+           issue = "0")]
+#[doc(hidden)]
+pub fn encode_utf8_raw(code: u32, dst: &mut [u8]) -> usize {


Can this function stay private and not exported from this module?

It's called from src/libstd/sys/common/wtf8.rs, but that can be changed.

Ah yeah it's ok I think to leave as-is today to leverage the same implementation.

alexcrichton · 2016-09-12T17:18:20Z

Thanks for the PR @tormol! The implementation looks pretty good to me, although I've had an idea while reading this which I think would allow us to improve ergonomics greatly as well in lots of the various call sites.

Can you also be sure to tag these commits as [breaking-change] as it'll be a breaking change for nightly users? (I don't see a way to avoid this). Finally, it's ok to squash the commits down into one if you'd like as well!

alexcrichton · 2016-09-12T22:29:16Z

Discussed at libs triage today, this seemed good but the conclusion was that we may wish to instead stabilize:

fn encode_utf8<'a>(&self, buf: &'a mut [u8]) -> &'a mut str;
fn encode_utf16<'a>(&self, buf: &'a mut [u16]) -> &'a mut [u16];

@tormol would you be up for implementing that change?

tormol · 2016-09-13T05:32:01Z

Yes, but I might not have time to do it this week.

Returning a str or slice looks like a good idea.

tormol · 2016-09-22T17:29:08Z

Done

tormol · 2016-09-25T21:07:15Z

Apparently check-stage1-std doesn't test core.

alexcrichton

Looking great, thanks @tormol!

alexcrichton · 2016-09-26T23:44:59Z

src/libcollections/string.rs

+                    ch.encode_utf8(slice);
+                    self.vec.set_len(cur_len + ch_len);
+                }
+            }


I think this can be the same as it was before, basically just falling back to extend_from_slice

alexcrichton · 2016-09-26T23:45:30Z

src/libcollections/string.rs

@@ -1131,10 +1147,12 @@ impl String {
        let len = self.len();
        assert!(idx <= len);
        assert!(self.is_char_boundary(idx));
-        let bits = ch.encode_utf8();
+        self.vec.reserve(4);


I think you can avoid the reserve and intermediate buffers here w/ code similar to before and temporary buffers

I see that the .reserve() is unnecessary, but what is the difference between an intermediate buffer and a temporary buffer?

alexcrichton · 2016-09-26T23:45:46Z

src/libcollectionstest/str.rs

-        let bytes = c.encode_utf8();
-        let s = ::std::str::from_utf8(bytes.as_slice()).unwrap();
+        let b = c.encode_utf8(&mut bytes).as_bytes();
+        let s = ::std::str::from_utf8(b).unwrap();


You can avoid the from_utf8 here as encode_utf8 returns a string

I think I kept the from_utf8() because it also tests encode_utf8(), but I'll move it to the right place.

alexcrichton · 2016-09-26T23:45:56Z

src/libcollectionstest/str.rs

-        let bytes = c.encode_utf8();
-        let s = ::std::str::from_utf8(bytes.as_slice()).unwrap();
+        let b = c.encode_utf8(&mut bytes).as_bytes();
+        let s = ::std::str::from_utf8(b).unwrap();


(same as above about avoiding from_utf8)

alexcrichton · 2016-09-26T23:46:16Z

src/libcore/char.rs

+           reason = "this function should not be exposed publicly",
+           issue = "0")]
+#[doc(hidden)]
+pub fn encode_utf8_raw(code: u32, dst: &mut [u8]) -> &mut str {


Could these functions be left private/removed like they were before?

alexcrichton · 2016-09-26T23:46:45Z

src/libstd/sys/common/wtf8.rs

-        let bytes = unsafe {
-            char::from_u32_unchecked(code_point.value).encode_utf8()
-        };
-        self.bytes.extend_from_slice(bytes.as_slice());


This can stay roughly the same as before, right?

In a diff comment two weeks ago you said it was OK, but yes.

…nd return a slice They panic if the buffer is too small.

tormol · 2016-09-29T07:47:02Z

Done

alexcrichton · 2016-09-29T16:17:05Z

@bors: r+ 13a2dd9

Thanks!

bors · 2016-09-29T18:20:03Z

⌛ Testing commit 13a2dd9 with merge 289f3a4...

Change encode_utf{8,16}() to write to a buffer and panic if it's too small cc #27784 Should the "A buffer that's too small" examples be removed and replaced by tests?

bors · 2016-09-29T21:32:04Z

rust-highfive assigned alexcrichton Sep 10, 2016

nagisa reviewed Sep 10, 2016
View reviewed changes

alexcrichton reviewed Sep 12, 2016
View reviewed changes

alexcrichton mentioned this pull request Sep 12, 2016

Tracking issue for char encoding methods #27784

Closed

alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Sep 12, 2016

tormol force-pushed the encode_utf branch from ff4872c to defc258 Compare September 22, 2016 17:27

tormol force-pushed the encode_utf branch from defc258 to cf1bf00 Compare September 25, 2016 12:54

alexcrichton reviewed Sep 26, 2016

View reviewed changes

[breaking-change] std: change encode_utf{8,16}() to take a buffer a…

13a2dd9

…nd return a slice They panic if the buffer is too small.

tormol force-pushed the encode_utf branch from cf1bf00 to 13a2dd9 Compare September 29, 2016 05:22

bors merged commit 13a2dd9 into rust-lang:master Sep 29, 2016

bors mentioned this pull request Sep 29, 2016

Rename rustc_unicode to std_unicode #36806

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change encode_utf{8,16}() to write to a buffer and panic if it's too small #36377

Change encode_utf{8,16}() to write to a buffer and panic if it's too small #36377

tormol commented Sep 10, 2016

rust-highfive commented Sep 10, 2016

nagisa Sep 10, 2016

tormol Sep 10, 2016

nagisa Sep 10, 2016 •

edited

Loading

tormol Sep 10, 2016

nagisa Sep 10, 2016

tormol Sep 13, 2016

steveklabnik commented Sep 10, 2016

tormol commented Sep 10, 2016 •

edited

Loading

alexcrichton Sep 12, 2016

tormol Sep 13, 2016

alexcrichton Sep 13, 2016

alexcrichton commented Sep 12, 2016

alexcrichton commented Sep 12, 2016

tormol commented Sep 13, 2016

tormol commented Sep 22, 2016

tormol commented Sep 25, 2016

alexcrichton left a comment

alexcrichton Sep 26, 2016

alexcrichton Sep 26, 2016

tormol Sep 27, 2016

alexcrichton Sep 26, 2016

tormol Sep 27, 2016

alexcrichton Sep 26, 2016

alexcrichton Sep 26, 2016

alexcrichton Sep 26, 2016

tormol Sep 27, 2016

tormol commented Sep 29, 2016

alexcrichton commented Sep 29, 2016

bors commented Sep 29, 2016

bors commented Sep 29, 2016

Change encode_utf{8,16}() to write to a buffer and panic if it's too small #36377

Change encode_utf{8,16}() to write to a buffer and panic if it's too small #36377

Conversation

tormol commented Sep 10, 2016

rust-highfive commented Sep 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nagisa Sep 10, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveklabnik commented Sep 10, 2016

tormol commented Sep 10, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexcrichton commented Sep 12, 2016

alexcrichton commented Sep 12, 2016

tormol commented Sep 13, 2016

tormol commented Sep 22, 2016

tormol commented Sep 25, 2016

alexcrichton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tormol commented Sep 29, 2016

alexcrichton commented Sep 29, 2016

bors commented Sep 29, 2016

bors commented Sep 29, 2016

nagisa Sep 10, 2016 •

edited

Loading

tormol commented Sep 10, 2016 •

edited

Loading