Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

impl next_back() for CharSplitIterator, and methods .rsplit_iter(), .rsplitn_iter() #8737

Closed
wants to merge 3 commits into from
Closed

Conversation

bluss
Copy link
Member

@bluss bluss commented Aug 24, 2013

Make CharSplitIterator double-ended which is simple given that the operation is symmetric, once the split-N feature is factored out into its own adaptor.

.rsplitn_iter() allows splitting N times from the back of a string, so it is a completely new feature. With the double-ended impl, .split_iter(), .line_iter(), .word_iter() all allow picking off elements from either end.

split_options_iter is removed with the factoring of the split- and split-N- iterators, instead there is split_terminator_iter.


Add benchmarks using #[bench] and tune CharSplitIterator a bit after Huon Wilson's suggestions

Benchmarks 1-5 do the same split using different implementations of CharEq, all splitting an ascii string on ascii space. Benchmarks 6-7 split a unicode string on an ascii char.

Before this PR
test str::bench::split_iter_ascii ... bench: 166 ns/iter (+/- 2)
test str::bench::split_iter_closure ... bench: 113 ns/iter (+/- 1)
test str::bench::split_iter_extern_fn ... bench: 286 ns/iter (+/- 7)
test str::bench::split_iter_not_ascii ... bench: 114 ns/iter (+/- 4)
test str::bench::split_iter_slice ... bench: 220 ns/iter (+/- 12)
test str::bench::split_iter_unicode_ascii ... bench: 217 ns/iter (+/- 3)
test str::bench::split_iter_unicode_not_ascii ... bench: 248 ns/iter (+/- 3)

PR, first commit
test str::bench::split_iter_ascii ... bench: 331 ns/iter (+/- 9)
test str::bench::split_iter_closure ... bench: 114 ns/iter (+/- 2)
test str::bench::split_iter_extern_fn ... bench: 314 ns/iter (+/- 6)
test str::bench::split_iter_not_ascii ... bench: 132 ns/iter (+/- 1)
test str::bench::split_iter_slice ... bench: 157 ns/iter (+/- 3)
test str::bench::split_iter_unicode_ascii ... bench: 502 ns/iter (+/- 64)
test str::bench::split_iter_unicode_not_ascii ... bench: 250 ns/iter (+/- 3)

PR, final version
test str::bench::split_iter_ascii ... bench: 106 ns/iter (+/- 4)
test str::bench::split_iter_closure ... bench: 107 ns/iter (+/- 1)
test str::bench::split_iter_extern_fn ... bench: 267 ns/iter (+/- 6)
test str::bench::split_iter_not_ascii ... bench: 108 ns/iter (+/- 1)
test str::bench::split_iter_slice ... bench: 170 ns/iter (+/- 8)
test str::bench::split_iter_unicode_ascii ... bench: 128 ns/iter (+/- 5)
test str::bench::split_iter_unicode_not_ascii ... bench: 252 ns/iter (+/- 3)


There are several ways to deal with CharEq::only_ascii. It is a performance optimization, so with that in mind, we allow passing bogus char (outside ascii) as long as they don't match. We use a byte value check to make sure we don't split on these (would split substrings in the middle of encoded char). (A more principled way would be to only pass the ascii codepoints to the CharEq when it indicates only_ascii, but that undoes some of the performance optimization.)

@bluss
Copy link
Member Author

bluss commented Aug 25, 2013

I'm still not happy with the expression on this new code, it looks messy even if I tried to find the best alternative out of many trials. I can't use the previous ByteOffset(Enumerate<vec::VecIterator<'self, u8>>) because Enumerate is not double-ended. Suggestions for improvement welcome.

Add new methods `.rsplit_iter()` and `.rsplitn_iter()` for &str.

Separate out CharSplitIterator and CharSplitNIterator,
CharSplitIterator (`split_iter` and `rsplit_iter`) is made double-ended
while `splitn_iter` and `rsplitn_iter` (limited to N splits) are not,
since these don't have the same symmetry.

With CharSplitIterator being double ended, derived iterators like
`line_iter` and `word_iter` are too.
@bluss
Copy link
Member Author

bluss commented Aug 25, 2013

that's a bit better.

})
}
Left(ref mut it) => match it.next() {
Some((idx, byte)) if byte < 128u8 && self.sep.matches(byte as char) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I wrote the original split-external-iterator a while ago, having the extra byte < 128 check made it noticably slower: I'd prefer the ascii-only CharSeps to have to make this check themselves if they need to, since it's entirely unnecessary when splitting on just '\n' or ' ' (for example). Of course, this would be comprehensively answered with some benches.

Also, what's the motivation for moving the "which type of iterator should I use" decision into next? It seems like it could easily be a significant performance regression (I guess LLVM handles it pretty well though).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. My motivation for having the byte < 128 check is that CharEq is a pub trait, and a user impl of CharEq that claims only_ascii will still be asked about non-ascii char. Further, if it lied about only_ascii and did match one of those, it would allow creating an invalid str with the iterator. Not sure how to deal with that, user error or safety hole?
  2. Just because it felt and looked the best, allowing to factor the slice code into one place. I did a lot trial implementations without being happy with the expression of it.. happy for suggestions. I didn't benchmark this particular version, but I will do it!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll bench the difference for the byte < 128 check too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could also try moving the byte < 128 check to be after the .matches check. (I agree we need some check to maintain the utf8 guarantee.)

blake2-ppc added 2 commits August 26, 2013 11:48
Implement Huon Wilson's suggestions (since the benchmarks agree!).

Use `self.sep.matches(byte as char) && byte < 128u8` to match in the
only_ascii case so that mistaken matches outside the ascii range can't
create invalid substrings.

Put the conditional on only_ascii outside the loop.
@bluss
Copy link
Member Author

bluss commented Aug 26, 2013

fixed the variable name for the char_offset_iter loop thanks!

bors added a commit that referenced this pull request Aug 26, 2013
Make CharSplitIterator double-ended which is simple given that the operation is symmetric, once the split-N feature is factored out into its own adaptor.

`.rsplitn_iter()` allows splitting `N` times from the back of a string, so it is a completely new feature. With the double-ended impl, `.split_iter()`, `.line_iter()`, `.word_iter()` all allow picking off elements from either end.

`split_options_iter` is removed with the factoring of the split- and split-N- iterators, instead there is `split_terminator_iter`.

---

Add benchmarks using `#[bench]` and tune CharSplitIterator a bit after Huon Wilson's suggestions

Benchmarks 1-5 do the same split using different implementations of `CharEq`, all splitting an ascii string on ascii space. Benchmarks 6-7 split a unicode string on an ascii char.

Before this PR
test str::bench::split_iter_ascii ... bench: 166 ns/iter (+/- 2)
test str::bench::split_iter_closure ... bench: 113 ns/iter (+/- 1)
test str::bench::split_iter_extern_fn ... bench: 286 ns/iter (+/- 7)
test str::bench::split_iter_not_ascii ... bench: 114 ns/iter (+/- 4)
test str::bench::split_iter_slice ... bench: 220 ns/iter (+/- 12)
test str::bench::split_iter_unicode_ascii ... bench: 217 ns/iter (+/- 3)
test str::bench::split_iter_unicode_not_ascii ... bench: 248 ns/iter (+/- 3)

PR, first commit
test str::bench::split_iter_ascii ... bench: 331 ns/iter (+/- 9)
test str::bench::split_iter_closure ... bench: 114 ns/iter (+/- 2)
test str::bench::split_iter_extern_fn ... bench: 314 ns/iter (+/- 6)
test str::bench::split_iter_not_ascii ... bench: 132 ns/iter (+/- 1)
test str::bench::split_iter_slice ... bench: 157 ns/iter (+/- 3)
test str::bench::split_iter_unicode_ascii ... bench: 502 ns/iter (+/- 64)
test str::bench::split_iter_unicode_not_ascii ... bench: 250 ns/iter (+/- 3)

PR, final version
test str::bench::split_iter_ascii ... bench: 106 ns/iter (+/- 4)
test str::bench::split_iter_closure ... bench: 107 ns/iter (+/- 1)
test str::bench::split_iter_extern_fn ... bench: 267 ns/iter (+/- 6)
test str::bench::split_iter_not_ascii ... bench: 108 ns/iter (+/- 1)
test str::bench::split_iter_slice ... bench: 170 ns/iter (+/- 8)
test str::bench::split_iter_unicode_ascii ... bench: 128 ns/iter (+/- 5)
test str::bench::split_iter_unicode_not_ascii ... bench: 252 ns/iter (+/- 3)

---

There are several ways to deal with `CharEq::only_ascii`. It is a performance optimization, so with that in mind, we allow passing bogus char (outside ascii) as long as they don't match. We use a byte value check to make sure we don't split on these (would split substrings in the middle of encoded char).  (A more principled way would be to only pass the ascii codepoints to the CharEq when it indicates only_ascii, but that undoes some of the performance optimization.)
@bors bors closed this Aug 26, 2013
flip1995 pushed a commit to flip1995/rust that referenced this pull request May 5, 2022
…fate

Extend `extra_unused_lifetimes` to handle impl lifetimes

Fixes rust-lang#6437 (cc: `@carols10cents)`

changelog: fix rust-lang#6437
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants