-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify some Iterator
methods.
#64572
Simplify some Iterator
methods.
#64572
Conversation
PR rust-lang#64545 got a big speed-up by replacing a hot call to `all()` with explicit iteration. This is because the implementation of `all()` is excessively complex: it wraps the given predicate in a closure that returns a `LoopState`, passes that closure to `try_for_each()`, which wraps the first closure in a second closure, passes that second closure to `try_fold()`, which does the actual iteration using the second closure. A sufficient smart compiler could optimize all this away; rustc is currently not sufficiently smart. This commit does the following. - Changes the implementations of `all()`, `any()`, `find()` and `find_map()` to use the simplest possible code, rather than using `try_for_each()`. (I am reminded of "The Evolution of a Haskell Programmer".) These are both shorter and faster than the current implementations, and will permit the undoing of the `all()` removal in rust-lang#64545. - Changes `ResultShunt::next()` so it doesn't call `self.find()`, because that was causing infinite recursion with the new implementation of `find()`, which itself calls `self.next()`. (I honestly don't know how the old implementation of `ResultShunt::next()` didn't cause an infinite loop, given that it also called `self.next()`, albeit via `try_for_each()` and `try_fold()`.) - Changes `nth()` to use `self.next()` in a while loop rather than `for x in self`, because using self-iteration within an iterator method seems dubious, and `self.next()` is used in all the other iterator methods.
r? @cramertj (rust_highfive has picked a reviewer for you, use r? to override) |
@simulacrum: time to test the new all-in-one perf CI command: @bors try @rust-timer queue |
Awaiting bors try build completion |
Simplify some `Iterator` methods. PR #64545 got a big speed-up by replacing a hot call to `all()` with explicit iteration. This is because the implementation of `all()` is excessively complex: it wraps the given predicate in a closure that returns a `LoopState`, passes that closure to `try_for_each()`, which wraps the first closure in a second closure, passes that second closure to `try_fold()`, which does the actual iteration using the second closure. A sufficient smart compiler could optimize all this away; rustc is currently not sufficiently smart. This commit does the following. - Changes the implementations of `all()`, `any()`, `find()` and `find_map()` to use the simplest possible code, rather than using `try_for_each()`. (I am reminded of "The Evolution of a Haskell Programmer".) These are both shorter and faster than the current implementations, and will permit the undoing of the `all()` removal in #64545. - Changes `ResultShunt::next()` so it doesn't call `self.find()`, because that was causing infinite recursion with the new implementation of `find()`, which itself calls `self.next()`. (I honestly don't know how the old implementation of `ResultShunt::next()` didn't cause an infinite loop, given that it also called `self.next()`, albeit via `try_for_each()` and `try_fold()`.) - Changes `nth()` to use `self.next()` in a while loop rather than `for x in self`, because using self-iteration within an iterator method seems dubious, and `self.next()` is used in all the other iterator methods.
@nnethercote it's |
r? @scottmcm |
☀️ Try build successful - checks-azure |
Queued 5be48aa with parent 5283791, future comparison URL. |
else { LoopState::Break(()) } | ||
while let Some(x) = self.next() { | ||
if !f(x) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL that Iterator::all
is that complex.
Here is the result from godbolt:https://godbolt.org/z/QlHZ0m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'm inexperienced with godbolt, but the results seem to strongly support my simplification, i.e. the microbenchmark static instruction count drops from 29 to 9 and the cycle count drops from 735 to 260. Is that right? Is there anything else in there of note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The MCA results are misleading here, because of the warning at the bottom of the output that "note: program counter updates are ignored" and thus it doesn't understand the loops. If you consider that the first one is 4x unrolled, the cycle difference is not unreasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in that godbolt link isn't actually correct:
while let Some(&x) = a.iter().next() {
a.iter()
is called on every loop iteration so it's only looking at the first element in the slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. I updated the link: https://godbolt.org/z/QlHZ0m
Finished benchmarking try commit 5be48aa, comparison URL. |
Note that using internal iteration over |
@rkruppe: I read through #45595 but I don't see an explanation of the I'd be interested in concrete examples of where this change causes slowdowns, because the perf results for rustc itself are ridiculously good. A 48.8% instruction count win? A 50.5% wall-time win? That's crazy. I had only tested with |
I just analyzed
In short, LLVM is doing way less stuff. My best guess is that I did a quick grep for these methods occurring in
That's a lot of calls, so this seems plausible. |
Hm, so I wonder if #62429 is somewhat related here. I'd be interested in seeing if we could get similar results with the old code but replacing e.g. |
@nnethercote Oh, duh, that makes sense. "LLVM spends a lot less time optimizing iterator code if the iterator code is simpler" is a lot more plausible (and less scary) than rustc code magically getting twice as fast by fiddling with some iterator methods. But there's still instances of
I have never investigated this in enough detail to give a very detailed or confident explanation why this is, but I can give you my best guess as compiler engineer. A loop like The fix to all those woes is to split the loop into two loops, one going just over |
Here is a test program that demonstrates @rkruppe's point: // bench.rs
// - Compile with `rustc --test -O bench.rs`
// - Run with `./bench --bench`
#![feature(test)]
extern crate test;
use test::Bencher;
#[bench]
fn long_all(b: &mut Bencher) {
b.iter(|| {
let n = test::black_box(100000);
let k = test::black_box(1000000);
(0..2*n).all(|x| x < k)
})
}
#[bench]
fn long_all_chained(b: &mut Bencher) {
b.iter(|| {
let n = test::black_box(100000);
let k = test::black_box(1000000);
(0..n).chain(0..n).all(|x| x < k)
})
} Before:
After:
So we have a demonstrated slowdown for a |
To summarize the effects of this PR.
IMO the possible pessimization of the relatively rare Furthermore, I'm wondering if any other common library functions are overly complicated and could also be simplified in order to gain more compile-time wins. I'm trying to think of how to find any such functions. |
Looking at the change in #64545 and at @lzutao's helpful goldbolt link reminded me of something else that gives me an alternative hypothesis: Lines 3184 to 3201 in eceec57
That's there because it was ported from I'll make a quick PR to remove the unrolling to give us some information on whether the problem is actually the |
[DO NOT MERGE] Experiment with removing unrolling from slice::Iter::try_fold For context see #64572 (comment) r? @scottmcm
Everything makes sense to revisit, but since the internal iteration is a big algorithm shift - it's not just chain, it allows efficient iteration of all kinds of segmented datastructures, including VecDeque and multi dimensional arrays (when we get there, implementing try_fold is still unstable), backing out unrolling would be better. |
It also overrides It's correct (though sometimes suboptimal) to make an iterator by overloading |
So the closure passed to these iterator methods could be duplicated four times? Huh. That's aggressive. Looking at the results in #64600, that accounts for about half of the speedups. Even half of this win is still a big win. I also wonder if it's worth using the simple implementations for debug builds, where losing some runtime speed in favour of faster compilation is a more acceptable trade-off. |
One option: we could ostensibly |
Er, no? libstd/core etc are shipped for end-users and for rustc itself as the same binary, so we can't do things differently really in that way. But I also don't think that would be entirely useful. |
Out of curiosity, has anyone looked at what clap's doing that makes it an order-of-magnitude outlier here? 2x faster for the patched incremental just because of this change seems outlandishly good. (Oh, I spotted the post above about the call counts for these methods. Is it lots of calls with trivial bodies? Are the bodies really complicated? Is it typechecking or codegen or ...? For example, if a bunch of the cost is coming from the generic instead of bools, we can use a different type in Iterator...) |
This blog post writes in detail about what internal iteration gives to Rust. The headline can be taken with a grain of salt: "Rust’s iterators are inefficient, and here’s what we can do about it." Just to give more background to why |
As a longer-term solution, I wonder if closures themselves could be made cheaper to compile, so rustc could avoid building so much LLVM IR even when handed the internal iterator versions of these functions. Improvements in that direction could potentially also hugely benefit generator-based iterator implementations, if such a thing becomes common. For example, the Thorin research IR has some higher-order optimizations that can eliminate closures before they are lowered: http://compilers.cs.uni-saarland.de/papers/lkh15_cgo.pdf. Perhaps something like this could be adapted to run on MIR? |
I haven't looked closely, but the 4x unrolling means that any large closure (especially if it's marked with |
Remove manual unrolling from slice::Iter(Mut)::try_fold While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone. I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545). --- For context see #64572 (comment)
Ping from triage |
@@ -319,7 +319,7 @@ pub trait Iterator { | |||
#[inline] | |||
#[stable(feature = "rust1", since = "1.0.0")] | |||
fn nth(&mut self, mut n: usize) -> Option<Self::Item> { | |||
for x in self { | |||
while let Some(x) = self.next() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good simplification
@@ -1855,18 +1855,15 @@ pub trait Iterator { | |||
/// ``` | |||
#[inline] | |||
#[stable(feature = "rust1", since = "1.0.0")] | |||
fn all<F>(&mut self, f: F) -> bool where | |||
fn all<F>(&mut self, mut f: F) -> bool where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep these Iterator::all
etc simplifications, there are examples of where removing them would be a big regression. We have already put in the work to make sure we thread internal iteration methods through various adaptors correctly, so to remove the use of try_fold
here at the top level would be a step back.
This change only partly rolls back what we have worked on for internal iteration — other entry points to it still remain, like sum
, max
, min
, fold
and try_fold
. And those remaining happen to be the most useful in stable Rust (because implementing try_fold
is not possible in stable, so custom iterators can't fully take advantage of this particular all
magic). It would also be inconsistent to partly remove it, I think we should keep it all.
Here's an example with a microbenchmark of what kind of improvement it can be rust-itertools/itertools#348 In this example it was fold
, but it will also be all
when try_fold
is stable.
I think we should go through fold
/try_fold
everywhere we can, for the structural improvements that gives (bypasses complicated state machines that composed iterators might have in .next()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course the code inside Iterator::all is not beautiful to read as it is now, and if it could be simplified while keeping the same features, that would be fantastic.
this PR changes documented stable behavior see https://doc.rust-lang.org/std/iter/trait.Iterator.html#note-to-implementors |
Remove manual unrolling from slice::Iter(Mut)::try_fold While this definitely helps sometimes (particularly for trivial closures), it's also a pessimization sometimes, so it's better to leave this to (hypothetical) future LLVM improvements instead of forcing this on everyone. I think it's better for the advice to be that sometimes you need to unroll manually than you sometimes need to not-unroll manually (like #64545). --- For context see #64572 (comment)
@andjo403 Note that the |
use try_fold instead of try_for_each to reduce compile time as it was stated in rust-lang#64572 that the biggest gain was due to less code was generated I tried to reduce the number of functions to inline by using try_fold direct instead of calling try_for_each that calls try_fold. as there is some gains with using the try_fold function this is maybe a way forward. when I tried to compile the clap-rs benchmark I get times gains only some % from rust-lang#64572 there is more function that use eg. fold that calls try_fold that also can be changed but the question is how mush "duplication" that is tolerated in std to give faster compile times can someone start a perf run? cc @nnethercote @scottmcm @bluss r? @ghost
Out of curiosity, I tried introducing the simplified versions of these four functions one at a time on
There were big drops in the |
Which iterators are slowed down by the |
PR #64545 got a big speed-up by replacing a hot call to
all()
withexplicit iteration. This is because the implementation of
all()
isexcessively complex: it wraps the given predicate in a closure that
returns a
LoopState
, passes that closure totry_for_each()
, whichwraps the first closure in a second closure, passes that second closure
to
try_fold()
, which does the actual iteration using the secondclosure.
A sufficient smart compiler could optimize all this away; rustc is
currently not sufficiently smart.
This commit does the following.
Changes the implementations of
all()
,any()
,find()
andfind_map()
to use the simplest possible code, rather than usingtry_for_each()
. (I am reminded of "The Evolution of a HaskellProgrammer".) These are both shorter and faster than the current
implementations, and will permit the undoing of the
all()
removal inMore
ObligationForest
improvements #64545.Changes
ResultShunt::next()
so it doesn't callself.find()
,because that was causing infinite recursion with the new
implementation of
find()
, which itself callsself.next()
. (Ihonestly don't know how the old implementation of
ResultShunt::next()
didn't cause an infinite loop, given that italso called
self.next()
, albeit viatry_for_each()
andtry_fold()
.)Changes
nth()
to useself.next()
in a while loop rather thanfor x in self
, because using self-iteration within an iterator methodseems dubious, and
self.next()
is used in all the other iteratormethods.