Performance improvements for take #279

Dandandan · 2021-05-10T18:00:26Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Take is a critical function that is used in quite some code and is hot up in profiling. It seems like we should be able to speed it up.

Two improvements we can do:

Bound checks are about 25% of the work (in take benchmark)
SIMD could be utilized.

Describe the solution you'd like
Check whether we can eliminate / avoid bound checks.
Add SIMD versions of take

Describe alternatives you've considered

Additional context

Dandandan · 2021-05-10T20:39:40Z

Some first experimenting didn't reveal a lot of options.
What did improve performance however, was removing the bound check (-25%) but it doesn't seem like we can get rid of it without checking bounds for all values manual (which seems like it should be slower - even after speeding it up by 2x here #281 )

nevi-me · 2021-05-11T03:01:17Z

Currently maybe_usize and try_from_trusted_len_iter are used for the take kernel. It seems we could do better for indices with type u32, i32, and u64/ i64 (for >64 bit architecture) as we can cast to usize directly and use the non-optional variants which is likely faster.

@Dandandan I once checked the assembly output of what the num crate does, as I was worried that it might be introducing overhead.
There appeared to be no overhead. For safe casts like going from u32 to usize, the compiler optimises out the Option check, and returns the result directly.

Here's some example assembly output: https://godbolt.org/z/1nsq4reT7

That could explain why you didn't see perf changes.

Dandandan · 2021-05-11T05:49:45Z

Currently maybe_usize and try_from_trusted_len_iter are used for the take kernel. It seems we could do better for indices with type u32, i32, and u64/ i64 (for >64 bit architecture) as we can cast to usize directly and use the non-optional variants which is likely faster.

@Dandandan I once checked the assembly output of what the num crate does, as I was worried that it might be introducing overhead.
There appeared to be no overhead. For safe casts like going from u32 to usize, the compiler optimises out the Option check, and returns the result directly.

Here's some example assembly output: https://godbolt.org/z/1nsq4reT7

That could explain why you didn't see perf changes.

Thanks! That makes sense.

It seems further speedups might come only from somehow removing bound checks and SIMD gather.

ritchie46 · 2021-05-11T07:55:04Z

It seems further speedups might come only from somehow removing bound checks and SIMD gather.

SIMD gather is not being used atm right? I remember looking for this operation in packed_simd, but not yielding any results.

Dandandan · 2021-05-11T07:58:30Z

Yeah I also did some googling - seems not supported in packed_simd so I guess at this point it has to be written by hand.

nevi-me · 2021-05-11T11:03:19Z

These are good opportunities to use stdsimd, as like all things, it'll eventually become stable

Dandandan added the enhancement Any new improvement worthy of a entry in the changelog label May 10, 2021

Dandandan changed the title ~~Performance improvements for take by specializing on 32 /64 bit~~ Performance improvements for take by specializing on 32 /64 bit indices May 10, 2021

Dandandan changed the title ~~Performance improvements for take by specializing on 32 /64 bit indices~~ Performance improvements for take by specializing on 32 /64 bit integer indices May 10, 2021

Dandandan changed the title ~~Performance improvements for take by specializing on 32 /64 bit integer indices~~ Performance improvements for take by specializing on 32 / 64 bit integer indices May 10, 2021

Dandandan changed the title ~~Performance improvements for take by specializing on 32 / 64 bit integer indices~~ Performance improvements for take May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements for take #279

Performance improvements for take #279

Dandandan commented May 10, 2021 •

edited

Loading

Dandandan commented May 10, 2021

nevi-me commented May 11, 2021

Dandandan commented May 11, 2021

ritchie46 commented May 11, 2021

Dandandan commented May 11, 2021

nevi-me commented May 11, 2021

Performance improvements for take #279

Performance improvements for take #279

Comments

Dandandan commented May 10, 2021 • edited Loading

Dandandan commented May 10, 2021

nevi-me commented May 11, 2021

Dandandan commented May 11, 2021

ritchie46 commented May 11, 2021

Dandandan commented May 11, 2021

nevi-me commented May 11, 2021

Dandandan commented May 10, 2021 •

edited

Loading