-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
impl next_back() for CharSplitIterator, and methods .rsplit_iter(), .rsplitn_iter() #8737
Conversation
I'm still not happy with the expression on this new code, it looks messy even if I tried to find the best alternative out of many trials. I can't use the previous |
Add new methods `.rsplit_iter()` and `.rsplitn_iter()` for &str. Separate out CharSplitIterator and CharSplitNIterator, CharSplitIterator (`split_iter` and `rsplit_iter`) is made double-ended while `splitn_iter` and `rsplitn_iter` (limited to N splits) are not, since these don't have the same symmetry. With CharSplitIterator being double ended, derived iterators like `line_iter` and `word_iter` are too.
that's a bit better. |
}) | ||
} | ||
Left(ref mut it) => match it.next() { | ||
Some((idx, byte)) if byte < 128u8 && self.sep.matches(byte as char) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I wrote the original split-external-iterator a while ago, having the extra byte < 128
check made it noticably slower: I'd prefer the ascii-only CharSep
s to have to make this check themselves if they need to, since it's entirely unnecessary when splitting on just '\n'
or ' '
(for example). Of course, this would be comprehensively answered with some benches.
Also, what's the motivation for moving the "which type of iterator should I use" decision into next
? It seems like it could easily be a significant performance regression (I guess LLVM handles it pretty well though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- My motivation for having the
byte < 128
check is thatCharEq
is a pub trait, and a user impl of CharEq that claimsonly_ascii
will still be asked about non-ascii char. Further, if it lied about only_ascii and did match one of those, it would allow creating an invalid str with the iterator. Not sure how to deal with that, user error or safety hole? - Just because it felt and looked the best, allowing to factor the slice code into one place. I did a lot trial implementations without being happy with the expression of it.. happy for suggestions. I didn't benchmark this particular version, but I will do it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll bench the difference for the byte < 128
check too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could also try moving the byte < 128
check to be after the .matches
check. (I agree we need some check to maintain the utf8
guarantee.)
Implement Huon Wilson's suggestions (since the benchmarks agree!). Use `self.sep.matches(byte as char) && byte < 128u8` to match in the only_ascii case so that mistaken matches outside the ascii range can't create invalid substrings. Put the conditional on only_ascii outside the loop.
fixed the variable name for the char_offset_iter loop thanks! |
Make CharSplitIterator double-ended which is simple given that the operation is symmetric, once the split-N feature is factored out into its own adaptor. `.rsplitn_iter()` allows splitting `N` times from the back of a string, so it is a completely new feature. With the double-ended impl, `.split_iter()`, `.line_iter()`, `.word_iter()` all allow picking off elements from either end. `split_options_iter` is removed with the factoring of the split- and split-N- iterators, instead there is `split_terminator_iter`. --- Add benchmarks using `#[bench]` and tune CharSplitIterator a bit after Huon Wilson's suggestions Benchmarks 1-5 do the same split using different implementations of `CharEq`, all splitting an ascii string on ascii space. Benchmarks 6-7 split a unicode string on an ascii char. Before this PR test str::bench::split_iter_ascii ... bench: 166 ns/iter (+/- 2) test str::bench::split_iter_closure ... bench: 113 ns/iter (+/- 1) test str::bench::split_iter_extern_fn ... bench: 286 ns/iter (+/- 7) test str::bench::split_iter_not_ascii ... bench: 114 ns/iter (+/- 4) test str::bench::split_iter_slice ... bench: 220 ns/iter (+/- 12) test str::bench::split_iter_unicode_ascii ... bench: 217 ns/iter (+/- 3) test str::bench::split_iter_unicode_not_ascii ... bench: 248 ns/iter (+/- 3) PR, first commit test str::bench::split_iter_ascii ... bench: 331 ns/iter (+/- 9) test str::bench::split_iter_closure ... bench: 114 ns/iter (+/- 2) test str::bench::split_iter_extern_fn ... bench: 314 ns/iter (+/- 6) test str::bench::split_iter_not_ascii ... bench: 132 ns/iter (+/- 1) test str::bench::split_iter_slice ... bench: 157 ns/iter (+/- 3) test str::bench::split_iter_unicode_ascii ... bench: 502 ns/iter (+/- 64) test str::bench::split_iter_unicode_not_ascii ... bench: 250 ns/iter (+/- 3) PR, final version test str::bench::split_iter_ascii ... bench: 106 ns/iter (+/- 4) test str::bench::split_iter_closure ... bench: 107 ns/iter (+/- 1) test str::bench::split_iter_extern_fn ... bench: 267 ns/iter (+/- 6) test str::bench::split_iter_not_ascii ... bench: 108 ns/iter (+/- 1) test str::bench::split_iter_slice ... bench: 170 ns/iter (+/- 8) test str::bench::split_iter_unicode_ascii ... bench: 128 ns/iter (+/- 5) test str::bench::split_iter_unicode_not_ascii ... bench: 252 ns/iter (+/- 3) --- There are several ways to deal with `CharEq::only_ascii`. It is a performance optimization, so with that in mind, we allow passing bogus char (outside ascii) as long as they don't match. We use a byte value check to make sure we don't split on these (would split substrings in the middle of encoded char). (A more principled way would be to only pass the ascii codepoints to the CharEq when it indicates only_ascii, but that undoes some of the performance optimization.)
…fate Extend `extra_unused_lifetimes` to handle impl lifetimes Fixes rust-lang#6437 (cc: `@carols10cents)` changelog: fix rust-lang#6437
Make CharSplitIterator double-ended which is simple given that the operation is symmetric, once the split-N feature is factored out into its own adaptor.
.rsplitn_iter()
allows splittingN
times from the back of a string, so it is a completely new feature. With the double-ended impl,.split_iter()
,.line_iter()
,.word_iter()
all allow picking off elements from either end.split_options_iter
is removed with the factoring of the split- and split-N- iterators, instead there issplit_terminator_iter
.Add benchmarks using
#[bench]
and tune CharSplitIterator a bit after Huon Wilson's suggestionsBenchmarks 1-5 do the same split using different implementations of
CharEq
, all splitting an ascii string on ascii space. Benchmarks 6-7 split a unicode string on an ascii char.Before this PR
test str::bench::split_iter_ascii ... bench: 166 ns/iter (+/- 2)
test str::bench::split_iter_closure ... bench: 113 ns/iter (+/- 1)
test str::bench::split_iter_extern_fn ... bench: 286 ns/iter (+/- 7)
test str::bench::split_iter_not_ascii ... bench: 114 ns/iter (+/- 4)
test str::bench::split_iter_slice ... bench: 220 ns/iter (+/- 12)
test str::bench::split_iter_unicode_ascii ... bench: 217 ns/iter (+/- 3)
test str::bench::split_iter_unicode_not_ascii ... bench: 248 ns/iter (+/- 3)
PR, first commit
test str::bench::split_iter_ascii ... bench: 331 ns/iter (+/- 9)
test str::bench::split_iter_closure ... bench: 114 ns/iter (+/- 2)
test str::bench::split_iter_extern_fn ... bench: 314 ns/iter (+/- 6)
test str::bench::split_iter_not_ascii ... bench: 132 ns/iter (+/- 1)
test str::bench::split_iter_slice ... bench: 157 ns/iter (+/- 3)
test str::bench::split_iter_unicode_ascii ... bench: 502 ns/iter (+/- 64)
test str::bench::split_iter_unicode_not_ascii ... bench: 250 ns/iter (+/- 3)
PR, final version
test str::bench::split_iter_ascii ... bench: 106 ns/iter (+/- 4)
test str::bench::split_iter_closure ... bench: 107 ns/iter (+/- 1)
test str::bench::split_iter_extern_fn ... bench: 267 ns/iter (+/- 6)
test str::bench::split_iter_not_ascii ... bench: 108 ns/iter (+/- 1)
test str::bench::split_iter_slice ... bench: 170 ns/iter (+/- 8)
test str::bench::split_iter_unicode_ascii ... bench: 128 ns/iter (+/- 5)
test str::bench::split_iter_unicode_not_ascii ... bench: 252 ns/iter (+/- 3)
There are several ways to deal with
CharEq::only_ascii
. It is a performance optimization, so with that in mind, we allow passing bogus char (outside ascii) as long as they don't match. We use a byte value check to make sure we don't split on these (would split substrings in the middle of encoded char). (A more principled way would be to only pass the ascii codepoints to the CharEq when it indicates only_ascii, but that undoes some of the performance optimization.)