Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements for take #279

Open
Dandandan opened this issue May 10, 2021 · 6 comments
Open

Performance improvements for take #279

Dandandan opened this issue May 10, 2021 · 6 comments
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@Dandandan
Copy link
Contributor

Dandandan commented May 10, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Take is a critical function that is used in quite some code and is hot up in profiling. It seems like we should be able to speed it up.

Two improvements we can do:

  • Bound checks are about 25% of the work (in take benchmark)
  • SIMD could be utilized.

Describe the solution you'd like
Check whether we can eliminate / avoid bound checks.
Add SIMD versions of take

Describe alternatives you've considered

Additional context

@Dandandan Dandandan added the enhancement Any new improvement worthy of a entry in the changelog label May 10, 2021
@Dandandan Dandandan changed the title Performance improvements for take by specializing on 32 /64 bit Performance improvements for take by specializing on 32 /64 bit indices May 10, 2021
@Dandandan Dandandan changed the title Performance improvements for take by specializing on 32 /64 bit indices Performance improvements for take by specializing on 32 /64 bit integer indices May 10, 2021
@Dandandan Dandandan changed the title Performance improvements for take by specializing on 32 /64 bit integer indices Performance improvements for take by specializing on 32 / 64 bit integer indices May 10, 2021
@Dandandan
Copy link
Contributor Author

Some first experimenting didn't reveal a lot of options.
What did improve performance however, was removing the bound check (-25%) but it doesn't seem like we can get rid of it without checking bounds for all values manual (which seems like it should be slower - even after speeding it up by 2x here #281 )

@nevi-me
Copy link
Contributor

nevi-me commented May 11, 2021

Currently maybe_usize and try_from_trusted_len_iter are used for the take kernel. It seems we could do better for indices with type u32, i32, and u64/ i64 (for >64 bit architecture) as we can cast to usize directly and use the non-optional variants which is likely faster.

@Dandandan I once checked the assembly output of what the num crate does, as I was worried that it might be introducing overhead.
There appeared to be no overhead. For safe casts like going from u32 to usize, the compiler optimises out the Option check, and returns the result directly.

Here's some example assembly output: https://godbolt.org/z/1nsq4reT7

That could explain why you didn't see perf changes.

@Dandandan
Copy link
Contributor Author

Currently maybe_usize and try_from_trusted_len_iter are used for the take kernel. It seems we could do better for indices with type u32, i32, and u64/ i64 (for >64 bit architecture) as we can cast to usize directly and use the non-optional variants which is likely faster.

@Dandandan I once checked the assembly output of what the num crate does, as I was worried that it might be introducing overhead.
There appeared to be no overhead. For safe casts like going from u32 to usize, the compiler optimises out the Option check, and returns the result directly.

Here's some example assembly output: https://godbolt.org/z/1nsq4reT7

That could explain why you didn't see perf changes.

Thanks! That makes sense.

It seems further speedups might come only from somehow removing bound checks and SIMD gather.

@ritchie46
Copy link
Contributor

It seems further speedups might come only from somehow removing bound checks and SIMD gather.

SIMD gather is not being used atm right? I remember looking for this operation in packed_simd, but not yielding any results.

@Dandandan
Copy link
Contributor Author

Yeah I also did some googling - seems not supported in packed_simd so I guess at this point it has to be written by hand.

@nevi-me
Copy link
Contributor

nevi-me commented May 11, 2021

These are good opportunities to use stdsimd, as like all things, it'll eventually become stable

@Dandandan Dandandan changed the title Performance improvements for take by specializing on 32 / 64 bit integer indices Performance improvements for take May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

3 participants