std: Change `encode_utf{8,16}` to return iterators #32204

alexcrichton · 2016-03-11T22:41:33Z

Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which also exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc #27784

alexcrichton · 2016-03-11T22:41:55Z

r? @aturon

cc @SimonSapin

ranma42 · 2016-03-11T22:50:27Z

src/libcore/char.rs

+    fn encode_utf8(self) -> EncodeUtf8 {
+        let code = self as u32;
+        let mut buf = [0; 4];
+        let pos = if code < MAX_ONE_B && !buf.is_empty() {


What is the purpose of !buf.is_empty()? Isn't it always true?

Oops! Just a holdover from the copy/pasted code below.

SimonSapin · 2016-03-12T09:35:52Z

src/libcore/char.rs

-    fn encode_utf8(self, dst: &mut [u8]) -> Option<usize>;
-    #[stable(feature = "core", since = "1.6.0")]
-    fn encode_utf16(self, dst: &mut [u16]) -> Option<usize>;
+    #[unstable(feature = "unicode", issue = "27784")]


Changing #[stable] to #[unstable]? Is that acceptable because the trait is #[unstable]?

Oh right yes thanks for the reminder! I meant to discuss this more at length on the PR comments but forgot.

So it looks like char::encode_utf8 is unstable (as expected), but we accidentally stabilized CharExt::encode_utf8 in libcore. This appears to be accidental (from what I can tell), so it seems that we're within our rights to switch back to unstable here (with libstd being the "source of truth")

That being said I wouldn't mind doing a crater run to see if it affects any crates.

This appears to be accidental (from what I can tell), so it seems that we're within our rights to switch back to unstable here (with libstd being the "source of truth")

That reasoning seems problematic. If a stable channel compiler lets me use some feature/item because it is marked #[stable], I have no way to know whether it is "accidentally" stable. Going back to #[unstable] effectively removes an item as far as the stable channel is concerned, which RFC 1105 says is a major change.

I thought this specific case was OK since the CharExt trait is #[unstable]. If I can’t write use core::char::CharExt; I can’t use CharExt’s methods. But I don’t need that use, it’s in the core prelude.

Maybe these methods have been stable a short enough time that little enough code uses them that the breakage is acceptable. I’m in favor of this change, but “oh but we didn’t mean that stability promise we made” should not be sufficient.

@SimonSapin this was a mistake when stabilizing libcore. If it causes breakage we can back out and rethink our strategy, but the hypothesis is that this won't cause breakage as the standard library's copy has been unstable so anyone using stable has been avoiding it anyway.

Right, in general it's obviously not OK to revert a stabilization just because it was an accident. But the fact that the stable item was isolated to libcore and not stable in libstd meant that the impact of correction should be very low.

Hah, I see @alexcrichton beat me to basically the same point...

Looks like a separate issue was opened for this: #32460

aturon · 2016-03-21T17:49:00Z

Oof, sorry for the delay, this got buried in my inbox.

Nice improvements here. I don't think we need a crater run for the libcore stability changes -- I'm fine to land as-is.

@bors: r+

bors · 2016-03-21T17:49:02Z

📌 Commit e3581c8 has been approved by aturon

bors · 2016-03-22T04:04:13Z

⌛ Testing commit e3581c8 with merge 93ee81b...

bors · 2016-03-22T04:07:26Z

💔 Test failed - auto-linux-64-x-armhf

Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which *also* exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc rust-lang#27784

alexcrichton · 2016-03-22T17:39:33Z

@bors: r=aturon 48d5fe9

bors · 2016-03-22T20:15:23Z

⌛ Testing commit 48d5fe9 with merge e1b4dd0...

…turon std: Change `encode_utf{8,16}` to return iterators Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which *also* exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc #27784

bors · 2016-03-22T21:44:33Z

💔 Test failed - auto-win-gnu-32-nopt-t

alexcrichton · 2016-03-22T21:49:26Z

@bors: retry

On Tue, Mar 22, 2016 at 2:44 PM, bors [email protected] wrote:

[image: 💔] Test failed - auto-win-gnu-32-nopt-t
http://buildbot.rust-lang.org/builders/auto-win-gnu-32-nopt-t/builds/3527

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#32204 (comment)

bors · 2016-03-22T22:55:33Z

alexcrichton · 2016-03-22T23:05:37Z

@bors: retry force clean

bors · 2016-03-22T23:05:39Z

⌛ Testing commit 48d5fe9 with merge 0dcc413...

…turon std: Change `encode_utf{8,16}` to return iterators Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which *also* exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc #27784

bors · 2016-03-23T02:55:18Z

SimonSapin · 2016-03-24T09:21:36Z

src/libstd/sys/common/wtf8.rs

-            self.bytes.set_len(cur_len + used);
-        }
+        let bytes = unsafe {
+            char::from_u32_unchecked(code_point.value).encode_utf8()


Even if only "for a short time" this can create an invalid char (which can be a surrogate). Is that really OK? Doesn’t char provide range assertions to LLVM? Previously we had encode_utf8_raw taking u32 to avoid this.

Ah right. I'd rather not expose those methods though from libcore again and we can just inline them here for now if we need to.

Ok. But I’m asking, because I really don’t know: do we need to? Is this one of those things that appear to work when you try it but is technically undefined behavior and can cause bad things in future compiler versions?

I still have nigthly CI for https://github.com/SimonSapin/rust-wtf8 which has similar code, failing now that encode_utf8_raw is gone, and I’m unsure what to do about it. And I imagine other libraries may be tempted to do something similar: I recall the regex crate doing some contortions internally to avoid creating a surrogate char when "iterating" over character ranges like [a-z].

I'm not actually sure either, it kinda depends what we tell LLVM, how we codgen this, and what operations happen.

alexcrichton assigned aturon Mar 11, 2016

ranma42 reviewed Mar 11, 2016
View reviewed changes

alexcrichton force-pushed the redesign-char-encoding-types branch from 2b3c5ac to e3581c8 Compare March 11, 2016 23:01

SimonSapin reviewed Mar 12, 2016
View reviewed changes

alexcrichton force-pushed the redesign-char-encoding-types branch from e3581c8 to 48d5fe9 Compare March 22, 2016 17:25

bors mentioned this pull request Mar 23, 2016

move const_eval and check_match out of librustc into their own crate #32259

Merged

bors merged commit 48d5fe9 into rust-lang:master Mar 23, 2016

bors mentioned this pull request Mar 23, 2016

convert 99.9% of try!s to ?s #32390

Merged

thepowersgang mentioned this pull request Mar 24, 2016

Breaking Change - encode_utf8 destabilised #32460

Closed

SimonSapin reviewed Mar 24, 2016
View reviewed changes

alexcrichton deleted the redesign-char-encoding-types branch March 27, 2016 17:49

nagisa mentioned this pull request Jul 28, 2016

EncodeUtf8 iterator implementation results in bad code #35099

Closed

SimonSapin mentioned this pull request May 30, 2020

OsString::from_wide (in Windows OsStringExt) is unsound #72760

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

std: Change `encode_utf{8,16}` to return iterators #32204

std: Change `encode_utf{8,16}` to return iterators #32204

alexcrichton commented Mar 11, 2016

alexcrichton commented Mar 11, 2016

ranma42 Mar 11, 2016

alexcrichton Mar 11, 2016

SimonSapin Mar 12, 2016

alexcrichton Mar 12, 2016

SimonSapin Mar 23, 2016

alexcrichton Mar 23, 2016

aturon Mar 23, 2016

aturon Mar 23, 2016

frewsxcv Mar 24, 2016

aturon commented Mar 21, 2016

bors commented Mar 21, 2016

bors commented Mar 22, 2016

bors commented Mar 22, 2016

alexcrichton commented Mar 22, 2016

bors commented Mar 22, 2016

bors commented Mar 22, 2016

alexcrichton commented Mar 22, 2016

bors commented Mar 22, 2016

alexcrichton commented Mar 22, 2016

bors commented Mar 22, 2016

bors commented Mar 23, 2016

SimonSapin Mar 24, 2016

alexcrichton Mar 24, 2016

SimonSapin Mar 25, 2016

alexcrichton Mar 25, 2016

std: Change encode_utf{8,16} to return iterators #32204

std: Change encode_utf{8,16} to return iterators #32204

Conversation

alexcrichton commented Mar 11, 2016

alexcrichton commented Mar 11, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aturon commented Mar 21, 2016

bors commented Mar 21, 2016

bors commented Mar 22, 2016

bors commented Mar 22, 2016

alexcrichton commented Mar 22, 2016

bors commented Mar 22, 2016

bors commented Mar 22, 2016

alexcrichton commented Mar 22, 2016

bors commented Mar 22, 2016

alexcrichton commented Mar 22, 2016

bors commented Mar 22, 2016

bors commented Mar 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

std: Change `encode_utf{8,16}` to return iterators #32204

std: Change `encode_utf{8,16}` to return iterators #32204