Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimizations #13

Open
piegamesde opened this issue Mar 12, 2021 · 8 comments
Open

Performance optimizations #13

piegamesde opened this issue Mar 12, 2021 · 8 comments

Comments

@piegamesde
Copy link

Randomly consistently shows up as a major bottleneck in my application. My basic problem is that I want to retrieve a lot of random elements of a Vec. I noticed a few things and would like to discuss them:

  • I did something like index = if len < 65536 { rng.u16(0..len) } else { rng.usize(0..len) } and it did seem to help with the performance.
  • I don't actually need the lower range bound, and in a lot of cases I need the full range of the value space anyways. Would adding explicit API for these use case increase performance? (I don't really know how RNGs work internally)
@SabrinaJewson
Copy link
Contributor

I did something like index = if len < 65536 { rng.u16(0..len) } else { rng.usize(0..len) } and it did seem to help with the performance.

This is probably because the algorithm only generates u32s, and generating a usize (which I assume is a u64 for you) requires generating two u32s. My PR #14 switches to Wyrand, which generates u64s so shouldn't have this problem. It is also faster in other ways.

I don't actually need the lower range bound, and in a lot of cases I need the full range of the value space anyways. Would adding explicit API for these use case increase performance?

I ran a benchmark and a hypothetical full-range API has the exact same performance as calling the function with a .. range, so you can just use that.

@piegamesde
Copy link
Author

If I understand you correctly, this means that generating u16 is faster than usize (maybe two times?), but as fast as generating u32. I could get twice as fast by splitting the u32 into two u16.

If this is true, this also means that using repeat_with(|| rng.u8(..)).take(10_000).collect(); to generate random byte vectors could be made 4 times faster by generating u32 and transmuting them to a byte slice.

@SabrinaJewson
Copy link
Contributor

Yes, I think that's all correct with the current implementation. It is two times faster to generate a u32 and then split it than to generate two u16s.

You don't need to transmute at all in your iterator. On beta and nightly (and stable in 12 days) you can write:

repeat_with(|| rng.u32(..))
    .flat_map(|v| array::IntoIter::new(v.to_ne_bytes()))
    .take(10_000)
    .collect()

or on stable until then:

repeat_with(|| rng.u32(..).to_ne_bytes())
    .flat_map(|bytes| (0..4).map(move |i| bytes[i]))
    .take(10_000)
    .collect()

@piegamesde
Copy link
Author

One would have to check if the iterator implementations compile down to a transmute or something similarly fast. Also, maybe it's worth adding a convenience method for bulk-filling random data array? (Not sure about how the API should look like though.)

Wasting half (or more) of the generated random values is something I think should easily be avoidable by storing the unused random bytes for future usage instead of throwing them away. Using a separate store for u8, u16 (and later on, u32) should make the branches easily predictable. Does this crate make any guarantees about the values that a certain seed returns across versions?

@SabrinaJewson
Copy link
Contributor

Wasting half (or more) of the generated random values is something I think should easily be avoidable by storing the unused random bytes for future usage instead of throwing them away.

I don't really think this is worth it. Especially if #14 gets merged the actual generation becomes really cheap, and caching values would require both adding branches and increasing the size of the RNG. Also, I don't know of any other RNG implementations that do this.

Does this crate make any guarantees about the values that a certain seed returns across versions?

It's not explicitly specified anywhere, but I would assume not, otherwise there would never be any possibility to change the RNG algorithm.

@piegamesde
Copy link
Author

The new Rng algorithm is indeed faster, but not by orders of magnitude (about 15% maybe?). Interestingly, generating usize is still slower than u8/u16 for some reason.

@ironhaven
Copy link
Contributor

I have a theory the performance discrepancy might be due to way the code is written not the library. There is a branch in the middle of what looks like a hot loop and this can slow things down surprisingly. If there was only one code path the CPU can work ahead on computing the random index. With the if statement the CPU has to predict which path it will take to stay at full speed. If it predicts wrong it will slow down. Only generating usizes would likely help with performance also considering WyRand generates u64s internally.

@Havunen
Copy link

Havunen commented Jul 21, 2024

It seems https://github.com/wangyi-fudan/wyhash wyhash has evolved into https://github.com/Nicoshev/rapidhash could that be used to improve perf of this library further

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants