Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: introduce split_at(mid: usize) on str #1123

Merged
merged 2 commits into from
Jun 9, 2015

Conversation

bluss
Copy link
Member

@bluss bluss commented May 17, 2015

Introduce the method split_at(&self, mid: usize) -> (&str, &str) on str,
to divide a slice into two, just like we can with [T].

Rendered version

Adding split_at is a measure to provide a method from [T] in a version that
makes sense for str.

Once used to [T], users might even expect that split_at is present on str.

/// Divide one string slice into two at an index.
///
/// The index `mid` is a byte offset from the start of the string
/// that must be on a character boundary.
///
/// Return slices `&self[..mid]` and `&self[mid..]`.
///
/// # Panics
///
/// Panics if `mid` is beyond the last character of the string,
/// or if it is not on a character boundary.
///
/// # Examples
/// ```
/// let s = "Löwe 老虎 Léopard";
/// let first_space = s.find(' ').unwrap_or(s.len());
/// let (a, b) = s.split_at(first_space);
///
/// assert_eq!(a, "Löwe");
/// assert_eq!(b, " 老虎 Léopard");
/// ```
fn split_at(&self, mid: usize) -> (&str, &str) {
    (&self[..mid], &self[mid..])
}

@bluss
Copy link
Member Author

bluss commented May 17, 2015

cc @Manishearth you requested something like this in rust-lang/rust#18063.

@Manishearth
Copy link
Member

👍 😄

[Edit: I'm wrong here]
The current implementation could be improved to only have a single pass through the string (this iterates twice because each indexing operation is one loop)

@hanna-kruppe
Copy link

@Manishearth It doesn't loop at all, it uses byte indices and therefore each slice operation is constant time (just check that the index is in bounds and the byte is the start of a UTF-8 code point, then adjust the start/length of the slice).

@Manishearth
Copy link
Member

[Edit: I'm wrong here]
Strings are utf8, byte indexing would make this unsafe.

String indexing indexes by chars IIRC, and for that you need to loop through because chars have variable length.

@hanna-kruppe
Copy link

String indexing is by bytes and checks that the index lies on a codepoint boundary (which is cheap, just check that the byte is less than ox80). This is consistent with almost all other methods, including char_at.

@Manishearth
Copy link
Member

Ah, okay.

@bluss
Copy link
Member Author

bluss commented May 17, 2015

The method has only a single index that needs to be bounds checked and checked for character boundary (mid), so if the redundant checks are not elided already, split_at can be optimized to fix that. The implementation in the RFC is of course only for illustration.

@alexcrichton
Copy link
Member

This sounds like a great idea to me, thanks for writing this up @bluss!

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label May 18, 2015
@retep998
Copy link
Member

Since this doesn't return Option, that means this code would panic! if you tried to use an index in the middle of a utf-8 sequence. Make sure you add that to the drawbacks.

@Gankra
Copy link
Contributor

Gankra commented May 20, 2015

I am of the belief that now that the trains are running, this kind of change no longer merits requires any kind of RFC. Am I mistaken?

@seanmonstar
Copy link
Contributor

I would expect adding any new public API to require an RFC. Twiddling the internals is just implementation that's hidden, but something that will eventually be marked #[stable] needs an RFC.

@bluss
Copy link
Member Author

bluss commented May 20, 2015

@retep998 Done!

@seanmonstar This was what I thought was the rule, and I agree with you.

@pythonesque
Copy link
Contributor

Is there any chance split_at for strings could be modified to return an Option instead? It's actually pretty common for me to want a non-panicking version of this for slices and I definitely have a use case for it in parsers. I realize it would be inconsistent with the [T] version, but I've never really understood why that one doesn't return an Option.

@bluss
Copy link
Member Author

bluss commented May 20, 2015

The panicking logic is the same as for slicing. You should only ever use it with a byte offset you got from somewhere, so you know it's always correct (for example from find like in the example). So we panic because it's a programmer's error (contract violation) to specify an illegal offset.

I absolutely don't want to introduce an inconsistency. A part of the motivation here is that you expect .split_at to be available as it is on slices. If we go the option route, it needs a new name.

Contract violations @ Error conventions RFC

@pythonesque
Copy link
Contributor

You should only ever use it with a byte offset you got from somewhere, so you know it's always correct (for example from find like in the example).

I don't think it's necessarily a contract violation (because strings and vectors can grow and shrink) but yeah, maybe I just want an additional checked version (but I thought our new policy on this was that we shouldn't have both a panicking and non-panicking version of an API?).

@bluss
Copy link
Member Author

bluss commented May 20, 2015

@pythonesque

Since we don't have any optional-value slicing, I think we should add .get_slice<R: RangeArgument>(range: R) -> Option<&str> first.

Additionally or alternatively, we should stabilize is_char_boundary so that it is possible to check indices for slicing and split_at up front. I think that's in line with our guidelines. Maybe the function could be renamed to something like is_valid_index ?

bluss pushed a commit to bluss/rust that referenced this pull request May 27, 2015
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).
@alexcrichton alexcrichton added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Jun 2, 2015
@alexcrichton
Copy link
Member

This RFC is now entering the final week-long final comment period.

The library subteam may not require an RFC for a change such as this in the future, but we have yet to concretely decided on the guidelines for what needs an RFC (for discussion see this thread). While we develop this policy, however, we're going to go ahead with this RFC.

@Gankra
Copy link
Contributor

Gankra commented Jun 2, 2015

This seems fine to add. No strong feelings, though.

@huonw
Copy link
Member

huonw commented Jun 3, 2015

My only comment is that I question the unicode hygiene of this function (i.e. naive calls to this probably won't be unicode-correct?). However, this is consistent with [T], no worse than things we already expose, definitely clearer than double-..-slicing and anyway we generally don't want to stop people doing things just because there's the risk of misuse. On balance, I'm happy to see this land.

@hanna-kruppe
Copy link

What do you mean by unicode-correct? Like all other slicing and indexing, it respects code point boundaries, i.e., it will rather panic than cutting a code point in half. It won't try to make you deal with grapheme clusters or anything, but none of the existing methods do that (except for graphemes and grapheme_indices, obviously).

@huonw
Copy link
Member

huonw commented Jun 3, 2015

I mean that it is very easy to split a string to give nonsense output, e.g. at the most basic level, splitting the two-codepoint version of ä between a and the combining character, and, it's not immediately obvious to me if performing this operation is useful/can be correct for arbitrary unicode strings. (Don't get me wrong, it can definitely be correct and useful for restricted subsets, e.g. ASCII-only strings, but an arbitrary chunk of text can be very arbitrary.)

@hanna-kruppe
Copy link

But all other string manipulation1 already has that problem. It's a long-standing policy to not try and prevent that sort of error. In addition, it's not like this API can't be used "properly" --- you just have to select the index at which you split properly. Which, again, is exactly the same situations as with the rest of the string API.

1With the aforementioned exception, which BTW moved out of libstd now.

@huonw
Copy link
Member

huonw commented Jun 3, 2015

Yes, I covered those aspects in my comment. It's why I'm not against this landing. :)

@Gankra Gankra self-assigned this Jun 5, 2015
@alexcrichton
Copy link
Member

This consensus of the libs subteam is that we should merge this RFC, so I'm going to merge it. Thanks again for the RFC @bluss!

@alexcrichton alexcrichton merged commit 8f8f7ea into rust-lang:master Jun 9, 2015
@bluss bluss deleted the str-split-at branch June 10, 2015 00:35
bluss pushed a commit to bluss/rust that referenced this pull request Jun 10, 2015
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).
bors added a commit to rust-lang/rust that referenced this pull request Jun 11, 2015
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).

Also a minor cleanup in the collections::str module. Remove redundant slicing of self.
@Centril Centril added the A-slice Slice related proposals & ideas label Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-slice Slice related proposals & ideas final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants