Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ironing out StepBy<Range>'s performance issues #51557

Open
Emerentius opened this issue Jun 14, 2018 · 7 comments
Open

Ironing out StepBy<Range>'s performance issues #51557

Emerentius opened this issue Jun 14, 2018 · 7 comments
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@Emerentius
Copy link
Contributor

The behaviour of <Range<_> as Iterator>::nth has a slight mismatch with StepBy (or Step depending on your viewpoint) as @scottmcm has found out, resulting in sub-optimal performance.
On every iteration, the range has to first step forwards n-1 times to get the next element and then advance again by 1.

I'm hoping we can improve step_by into a 100% zero-cost abstraction.

It seems like the performance issue is specific to StepBy<Range>. I'm thinking therefore that we could specialize Iterator for StepBy<Range<I>> such that it would use @scottmcm's suggested semantics. Like this:

impl<I> Iterator for StepBy<Range<I>>
where
    I: Step
{
    fn next(&mut self) -> Option<Self::Item> {
        self.first = false;
        if let Some(mut n) = self.start.add_usize(self.step+1) {
            if n < self.end {
                std::mem::swap(&mut self.start, &mut n);
                return Some(n);
            }
        }

        self.start = self.end.clone();
        None
    }
}

That also avoids the branch on a regular next(). I haven't looked at the other methods but that boolean in StepBy could possibly become superfluous. During construction of the StepBy adapter, the size in .step_by(size) is decremented and this specialization has to counter-add 1 every time but that should be optimized away if inlined.

If someone were to depend on side-effects in Step::add_usize (when the trait is stabilized), this pre-stepping would become weird. Same thing with a hypothetical next_and_skip_ahead().

@scottmcm what do you think of this?

@kennytm kennytm added C-enhancement Category: An issue proposing an enhancement or a PR with one. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jun 14, 2018
@scottmcm
Copy link
Member

Overall yes; someone should make a PR 🙂 Probably best to use specialization for now to hide the detail. (I suppose a doc-hidden unstable method on Iterator could work too, but that feels wrong.)

A few minor things:

that we could specialize Iterator for StepBy<Range<I>>

I thought that normally public traits didn't get specialized, instead delegating to an internal one and specializing that, but apparently Iterator for Fuse is specialized, so I'm misremembering.

but that boolean in StepBy could possibly become superfluous

It would still be needed for non-Ranges, of course, so I don't think it can go away in the specialization approach, since Fuse still has its boolean even when the specialization never uses it.

Like this:

It's hard to guess at exactly how it should be written without looking at how LLVM handles it. Something like that seems plausible, but I could imagine seemingly-unimportant things like the branch orderings affecting the loop passes. Maybe it would be possible to have a codegen test to make sure?

@Emerentius
Copy link
Contributor Author

I've copied the std implementations onto godbolt and added the specialization. As expected, it improves the generated code a lot.
You're right, @scottmcm, with switched branch order it produces slightly different, seemingly better, code. With that and usize ranges it compiles to nothing, u8 still has extra code. That is with constant start, end and step size, I haven't checked the rest, yet.

code on godbolt

Adding this specializion slightly changes what we're expecting from the Step trait in that any implementor must be fine with having 1 more add_usize() call than strictly necessary (side effects). It's unstable now anyway but in the future it may matter.
I'm going to send a PR for the specialization.

Is there any way we can get rid of the TryFrom baggage?

@scottmcm
Copy link
Member

I haven't looked at it in detail yet, but reminder that this isn't exactly what we want to reach:

// what we want to reach
pub fn manual_while() {
    let mut n = 0;
    while n < UPPER {
        test::black_box(n);
        n += STEP;
    }
}

Because that doesn't have overflow detection, and is thus an infinite loop for something like (0..255u8).step_by(2), as it'll get to 254 and wrap back to 0, which are both still below 255.

@Emerentius
Copy link
Contributor Author

Good catch. That's also what caused the difference between manual and specialized for u8.

@Emerentius
Copy link
Contributor Author

Emerentius commented Jun 16, 2018

Looks like specialization is still too broken for this. It trips up type inference. Fixed by second trait.

Another issue I've found is that with RangeFrom I can't specialize it generically in a way that inherits overflow checks based on debug or release. Furthermore, the overflow and therefore panic in debug mode would happen one iteration too early. We may still be able to specialize RangeFrom for each type separately whilst preserving semantics.

Step::add_usize for signed integers is currently impeding optimizations.

bors added a commit that referenced this issue Jun 21, 2018
Specialize StepBy<Range(Inclusive)>

Part of #51557, related to #43064, #31155

As discussed in the above issues, `step_by` optimizes very badly on ranges which is related to
1. the special casing of the first `StepBy::next()` call
2. the need to do 2 additions of `n - 1` and `1` inside the range's `next()`

This PR eliminates both by overriding `next()` to always produce the current element and also step ahead by `n` elements in one go. The generated code is much better, even identical in the case of a `Range` with constant `start` and `end` where `start+step` can't overflow. Without constant bounds it's a bit longer than the manual loop. `RangeInclusive` doesn't optimize as nicely but is still much better than the original asm.
Unsigned integers optimize better than signed ones for some reason.

See the following two links for a comparison.

[godbolt: specialization for ..](https://godbolt.org/g/haHLJr)
[godbolt: specialization for ..=](https://godbolt.org/g/ewyMu6)

`RangeFrom`, the only other range with an `Iterator` implementation can't be specialized like this without changing behaviour due to overflow. There is no way to save "finished-ness".

The approach can not be used in general, because it would produce side effects of the underlying iterator too early.

May obsolete #51435, haven't checked.
@the8472 the8472 added the I-slow Issue: Problems and improvements with respect to performance of generated code. label May 20, 2023
@the8472
Copy link
Member

the8472 commented May 20, 2023

For the record, the issue is still open due because the specialization PR (#51601) was reverted in #56049

@the8472
Copy link
Member

the8472 commented Sep 3, 2023

#111850 solved this for unsigned integers. Is that sufficient?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-slow Issue: Problems and improvements with respect to performance of generated code. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

4 participants