impl next_back() for CharSplitIterator, and methods .rsplit_iter(), .rsplitn_iter() #8737

bluss · 2013-08-24T14:24:11Z

Make CharSplitIterator double-ended which is simple given that the operation is symmetric, once the split-N feature is factored out into its own adaptor.

.rsplitn_iter() allows splitting N times from the back of a string, so it is a completely new feature. With the double-ended impl, .split_iter(), .line_iter(), .word_iter() all allow picking off elements from either end.

split_options_iter is removed with the factoring of the split- and split-N- iterators, instead there is split_terminator_iter.

Add benchmarks using #[bench] and tune CharSplitIterator a bit after Huon Wilson's suggestions

Benchmarks 1-5 do the same split using different implementations of CharEq, all splitting an ascii string on ascii space. Benchmarks 6-7 split a unicode string on an ascii char.

Before this PR
test str::bench::split_iter_ascii ... bench: 166 ns/iter (+/- 2)
test str::bench::split_iter_closure ... bench: 113 ns/iter (+/- 1)
test str::bench::split_iter_extern_fn ... bench: 286 ns/iter (+/- 7)
test str::bench::split_iter_not_ascii ... bench: 114 ns/iter (+/- 4)
test str::bench::split_iter_slice ... bench: 220 ns/iter (+/- 12)
test str::bench::split_iter_unicode_ascii ... bench: 217 ns/iter (+/- 3)
test str::bench::split_iter_unicode_not_ascii ... bench: 248 ns/iter (+/- 3)

PR, first commit
test str::bench::split_iter_ascii ... bench: 331 ns/iter (+/- 9)
test str::bench::split_iter_closure ... bench: 114 ns/iter (+/- 2)
test str::bench::split_iter_extern_fn ... bench: 314 ns/iter (+/- 6)
test str::bench::split_iter_not_ascii ... bench: 132 ns/iter (+/- 1)
test str::bench::split_iter_slice ... bench: 157 ns/iter (+/- 3)
test str::bench::split_iter_unicode_ascii ... bench: 502 ns/iter (+/- 64)
test str::bench::split_iter_unicode_not_ascii ... bench: 250 ns/iter (+/- 3)

PR, final version
test str::bench::split_iter_ascii ... bench: 106 ns/iter (+/- 4)
test str::bench::split_iter_closure ... bench: 107 ns/iter (+/- 1)
test str::bench::split_iter_extern_fn ... bench: 267 ns/iter (+/- 6)
test str::bench::split_iter_not_ascii ... bench: 108 ns/iter (+/- 1)
test str::bench::split_iter_slice ... bench: 170 ns/iter (+/- 8)
test str::bench::split_iter_unicode_ascii ... bench: 128 ns/iter (+/- 5)
test str::bench::split_iter_unicode_not_ascii ... bench: 252 ns/iter (+/- 3)

There are several ways to deal with CharEq::only_ascii. It is a performance optimization, so with that in mind, we allow passing bogus char (outside ascii) as long as they don't match. We use a byte value check to make sure we don't split on these (would split substrings in the middle of encoded char). (A more principled way would be to only pass the ascii codepoints to the CharEq when it indicates only_ascii, but that undoes some of the performance optimization.)

bluss · 2013-08-25T03:59:35Z

I'm still not happy with the expression on this new code, it looks messy even if I tried to find the best alternative out of many trials. I can't use the previous ByteOffset(Enumerate<vec::VecIterator<'self, u8>>) because Enumerate is not double-ended. Suggestions for improvement welcome.

Add new methods `.rsplit_iter()` and `.rsplitn_iter()` for &str. Separate out CharSplitIterator and CharSplitNIterator, CharSplitIterator (`split_iter` and `rsplit_iter`) is made double-ended while `splitn_iter` and `rsplitn_iter` (limited to N splits) are not, since these don't have the same symmetry. With CharSplitIterator being double ended, derived iterators like `line_iter` and `word_iter` are too.

bluss · 2013-08-25T06:57:59Z

that's a bit better.

huonw · 2013-08-25T12:25:02Z

src/libstd/str.rs

-                            })
-                        }
+                Left(ref mut it) => match it.next() {
+                    Some((idx, byte)) if byte < 128u8 && self.sep.matches(byte as char) =>


When I wrote the original split-external-iterator a while ago, having the extra byte < 128 check made it noticably slower: I'd prefer the ascii-only CharSeps to have to make this check themselves if they need to, since it's entirely unnecessary when splitting on just '\n' or ' ' (for example). Of course, this would be comprehensively answered with some benches.

Also, what's the motivation for moving the "which type of iterator should I use" decision into next? It seems like it could easily be a significant performance regression (I guess LLVM handles it pretty well though).

My motivation for having the byte < 128 check is that CharEq is a pub trait, and a user impl of CharEq that claims only_ascii will still be asked about non-ascii char. Further, if it lied about only_ascii and did match one of those, it would allow creating an invalid str with the iterator. Not sure how to deal with that, user error or safety hole?

Just because it felt and looked the best, allowing to factor the slice code into one place. I did a lot trial implementations without being happy with the expression of it.. happy for suggestions. I didn't benchmark this particular version, but I will do it!

I'll bench the difference for the byte < 128 check too.

Maybe you could also try moving the byte < 128 check to be after the .matches check. (I agree we need some check to maintain the utf8 guarantee.)

Implement Huon Wilson's suggestions (since the benchmarks agree!). Use `self.sep.matches(byte as char) && byte < 128u8` to match in the only_ascii case so that mistaken matches outside the ascii range can't create invalid substrings. Put the conditional on only_ascii outside the loop.

bluss · 2013-08-26T11:31:47Z

fixed the variable name for the char_offset_iter loop thanks!

Make CharSplitIterator double-ended which is simple given that the operation is symmetric, once the split-N feature is factored out into its own adaptor. `.rsplitn_iter()` allows splitting `N` times from the back of a string, so it is a completely new feature. With the double-ended impl, `.split_iter()`, `.line_iter()`, `.word_iter()` all allow picking off elements from either end. `split_options_iter` is removed with the factoring of the split- and split-N- iterators, instead there is `split_terminator_iter`. --- Add benchmarks using `#[bench]` and tune CharSplitIterator a bit after Huon Wilson's suggestions Benchmarks 1-5 do the same split using different implementations of `CharEq`, all splitting an ascii string on ascii space. Benchmarks 6-7 split a unicode string on an ascii char. Before this PR test str::bench::split_iter_ascii ... bench: 166 ns/iter (+/- 2) test str::bench::split_iter_closure ... bench: 113 ns/iter (+/- 1) test str::bench::split_iter_extern_fn ... bench: 286 ns/iter (+/- 7) test str::bench::split_iter_not_ascii ... bench: 114 ns/iter (+/- 4) test str::bench::split_iter_slice ... bench: 220 ns/iter (+/- 12) test str::bench::split_iter_unicode_ascii ... bench: 217 ns/iter (+/- 3) test str::bench::split_iter_unicode_not_ascii ... bench: 248 ns/iter (+/- 3) PR, first commit test str::bench::split_iter_ascii ... bench: 331 ns/iter (+/- 9) test str::bench::split_iter_closure ... bench: 114 ns/iter (+/- 2) test str::bench::split_iter_extern_fn ... bench: 314 ns/iter (+/- 6) test str::bench::split_iter_not_ascii ... bench: 132 ns/iter (+/- 1) test str::bench::split_iter_slice ... bench: 157 ns/iter (+/- 3) test str::bench::split_iter_unicode_ascii ... bench: 502 ns/iter (+/- 64) test str::bench::split_iter_unicode_not_ascii ... bench: 250 ns/iter (+/- 3) PR, final version test str::bench::split_iter_ascii ... bench: 106 ns/iter (+/- 4) test str::bench::split_iter_closure ... bench: 107 ns/iter (+/- 1) test str::bench::split_iter_extern_fn ... bench: 267 ns/iter (+/- 6) test str::bench::split_iter_not_ascii ... bench: 108 ns/iter (+/- 1) test str::bench::split_iter_slice ... bench: 170 ns/iter (+/- 8) test str::bench::split_iter_unicode_ascii ... bench: 128 ns/iter (+/- 5) test str::bench::split_iter_unicode_not_ascii ... bench: 252 ns/iter (+/- 3) --- There are several ways to deal with `CharEq::only_ascii`. It is a performance optimization, so with that in mind, we allow passing bogus char (outside ascii) as long as they don't match. We use a byte value check to make sure we don't split on these (would split substrings in the middle of encoded char). (A more principled way would be to only pass the ascii codepoints to the CharEq when it indicates only_ascii, but that undoes some of the performance optimization.)

…fate Extend `extra_unused_lifetimes` to handle impl lifetimes Fixes rust-lang#6437 (cc: `@carols10cents)` changelog: fix rust-lang#6437

huonw reviewed Aug 25, 2013
View reviewed changes

blake2-ppc added 2 commits August 26, 2013 11:48

std::str: bench tests for .split_iter()

413f868

bors closed this Aug 26, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

impl next_back() for CharSplitIterator, and methods .rsplit_iter(), .rsplitn_iter() #8737

impl next_back() for CharSplitIterator, and methods .rsplit_iter(), .rsplitn_iter() #8737

bluss commented Aug 24, 2013

bluss commented Aug 25, 2013

bluss commented Aug 25, 2013

huonw Aug 25, 2013

bluss Aug 25, 2013

bluss Aug 25, 2013

huonw Aug 26, 2013

bluss commented Aug 26, 2013

impl next_back() for CharSplitIterator, and methods .rsplit_iter(), .rsplitn_iter() #8737

impl next_back() for CharSplitIterator, and methods .rsplit_iter(), .rsplitn_iter() #8737

Conversation

bluss commented Aug 24, 2013

bluss commented Aug 25, 2013

bluss commented Aug 25, 2013

huonw Aug 25, 2013

Choose a reason for hiding this comment

bluss Aug 25, 2013

Choose a reason for hiding this comment

bluss Aug 25, 2013

Choose a reason for hiding this comment

huonw Aug 26, 2013

Choose a reason for hiding this comment

bluss commented Aug 26, 2013