Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change encode_utf{8,16}() to write to a buffer and panic if it's too small #36377

Merged
merged 1 commit into from
Sep 29, 2016

Conversation

tormol
Copy link
Contributor

@tormol tormol commented Sep 10, 2016

cc #27784

Should the "A buffer that's too small" examples be removed and replaced by tests?

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @alexcrichton (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

dst[0] = (code >> 18 & 0x07) as u8 | TAG_FOUR_B;
dst[1] = (code >> 12 & 0x3F) as u8 | TAG_CONT;
dst[2] = (code >> 6 & 0x3F) as u8 | TAG_CONT;
dst[3] = (code & 0x3F) as u8 | TAG_CONT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code unnecessarily does 4 bounds checks, despite you already knowing that destination slice is big enough. You should be using get_unchecked_mut instead of indexing here.

Also in order to duplicate less code between this and char::utf8_len(), you could do this:

let bytes = char.utf8_len();
if dst.len() < bytes { /* panic */ }
match bytes {
    1 => { /* encode 1 char */ }
    2 => { /* encode 2 chars */ }
    ...
    _ => {}
}
bytes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's that bad at eliding bound checks‽ I'll do that then.

I considered rewriting it to use len_utf8(), but was worried about performance of calculating length, then checking if the length is <constant>.

Copy link
Member

@nagisa nagisa Sep 10, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be able to figure out the checks are not necessary in most cases, but LLVM is known to trip up occasionally (also will never try to optimise in debug mode).

I considered rewriting it to use len_utf8(), but was worried about performance of calculating length, then checking if the length is .

You’re doing exactly the same computation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm no expert on this and maybe it's optimized out, but this is what I think happens:

if the codepoint needs one byte:

now:

is codepoint < MAX_ONE_B?
is the buffer length not zero?

with len_utf8():

is codepoint < MAX_ONE_B?
set len to 1
is len >= the buffer length?
is len 1?

If the codepoint needs two bytes:

now:

is codepoint < MAX_ONE_B?
is codepoint < MAX_TWO_B?
is the buffer length >= 2?

with len_utf8():

is codepoint < MAX_ONE_B?
is codepoint < MAX_TWO_B?
set len to 2
is len >= the buffer length?
is len 1?
is len 2?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more concerned about it being not DRY and expected both versions be pretty much the same performance wise. Did some benchmarking which prove that the version I proposed performs worse, so its fine to keep the current version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's ugly,

Thanks for creating benchmarks. I need to learn that, they beat any assumption :)

@steveklabnik
Copy link
Member

Should the "A buffer that's too small" examples be removed and replaced by tests?

Doc examples are tests, the compiler runs them as part of the test suite :)

@tormol
Copy link
Contributor Author

tormol commented Sep 10, 2016

I know that, but I have seen few examples that showcase panicking, so I was wondering if it should be removed. The reason to panic is after all that it would rarely happen that the buffer is to short. So is there any point to have an example for something that rarely happens?

I forgot to mention that I removed the clause that the buffer will be left unmodified if it's not large enough, to not prevent any potential optimization. Since it will panic I doubt anybody who notices it's changed will care,

reason = "this function should not be exposed publicly",
issue = "0")]
#[doc(hidden)]
pub fn encode_utf8_raw(code: u32, dst: &mut [u8]) -> usize {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function stay private and not exported from this module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's called from src/libstd/sys/common/wtf8.rs, but that can be changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah it's ok I think to leave as-is today to leverage the same implementation.

@alexcrichton
Copy link
Member

Thanks for the PR @tormol! The implementation looks pretty good to me, although I've had an idea while reading this which I think would allow us to improve ergonomics greatly as well in lots of the various call sites.

Can you also be sure to tag these commits as [breaking-change] as it'll be a breaking change for nightly users? (I don't see a way to avoid this). Finally, it's ok to squash the commits down into one if you'd like as well!

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Sep 12, 2016
@alexcrichton
Copy link
Member

Discussed at libs triage today, this seemed good but the conclusion was that we may wish to instead stabilize:

fn encode_utf8<'a>(&self, buf: &'a mut [u8]) -> &'a mut str;
fn encode_utf16<'a>(&self, buf: &'a mut [u16]) -> &'a mut [u16];

@tormol would you be up for implementing that change?

@tormol
Copy link
Contributor Author

tormol commented Sep 13, 2016

Yes, but I might not have time to do it this week.

Returning a str or slice looks like a good idea.

@tormol
Copy link
Contributor Author

tormol commented Sep 22, 2016

Done

@tormol
Copy link
Contributor Author

tormol commented Sep 25, 2016

Apparently check-stage1-std doesn't test core.

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, thanks @tormol!

ch.encode_utf8(slice);
self.vec.set_len(cur_len + ch_len);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be the same as it was before, basically just falling back to extend_from_slice

@@ -1131,10 +1147,12 @@ impl String {
let len = self.len();
assert!(idx <= len);
assert!(self.is_char_boundary(idx));
let bits = ch.encode_utf8();
self.vec.reserve(4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid the reserve and intermediate buffers here w/ code similar to before and temporary buffers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the .reserve() is unnecessary, but what is the difference between an intermediate buffer and a temporary buffer?

let bytes = c.encode_utf8();
let s = ::std::str::from_utf8(bytes.as_slice()).unwrap();
let b = c.encode_utf8(&mut bytes).as_bytes();
let s = ::std::str::from_utf8(b).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid the from_utf8 here as encode_utf8 returns a string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I kept the from_utf8() because it also tests encode_utf8(), but I'll move it to the right place.

let bytes = c.encode_utf8();
let s = ::std::str::from_utf8(bytes.as_slice()).unwrap();
let b = c.encode_utf8(&mut bytes).as_bytes();
let s = ::std::str::from_utf8(b).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same as above about avoiding from_utf8)

reason = "this function should not be exposed publicly",
issue = "0")]
#[doc(hidden)]
pub fn encode_utf8_raw(code: u32, dst: &mut [u8]) -> &mut str {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these functions be left private/removed like they were before?

let bytes = unsafe {
char::from_u32_unchecked(code_point.value).encode_utf8()
};
self.bytes.extend_from_slice(bytes.as_slice());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can stay roughly the same as before, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a diff comment two weeks ago you said it was OK, but yes.

…nd return a slice

They panic if the buffer is too small.
@tormol
Copy link
Contributor Author

tormol commented Sep 29, 2016

Done

@alexcrichton
Copy link
Member

@bors: r+ 13a2dd9

Thanks!

@bors
Copy link
Contributor

bors commented Sep 29, 2016

⌛ Testing commit 13a2dd9 with merge 289f3a4...

bors added a commit that referenced this pull request Sep 29, 2016
Change encode_utf{8,16}() to write to a buffer and panic if it's too small

cc #27784

Should the "A buffer that's too small" examples be removed and replaced by tests?
@bors bors merged commit 13a2dd9 into rust-lang:master Sep 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants