Performance optimizations #13

piegamesde · 2021-03-12T22:45:29Z

Randomly consistently shows up as a major bottleneck in my application. My basic problem is that I want to retrieve a lot of random elements of a Vec. I noticed a few things and would like to discuss them:

I did something like index = if len < 65536 { rng.u16(0..len) } else { rng.usize(0..len) } and it did seem to help with the performance.
I don't actually need the lower range bound, and in a lot of cases I need the full range of the value space anyways. Would adding explicit API for these use case increase performance? (I don't really know how RNGs work internally)

The text was updated successfully, but these errors were encountered:

SabrinaJewson · 2021-03-13T13:05:09Z

I did something like index = if len < 65536 { rng.u16(0..len) } else { rng.usize(0..len) } and it did seem to help with the performance.

This is probably because the algorithm only generates u32s, and generating a usize (which I assume is a u64 for you) requires generating two u32s. My PR #14 switches to Wyrand, which generates u64s so shouldn't have this problem. It is also faster in other ways.

I don't actually need the lower range bound, and in a lot of cases I need the full range of the value space anyways. Would adding explicit API for these use case increase performance?

I ran a benchmark and a hypothetical full-range API has the exact same performance as calling the function with a .. range, so you can just use that.

piegamesde · 2021-03-13T14:45:49Z

If I understand you correctly, this means that generating u16 is faster than usize (maybe two times?), but as fast as generating u32. I could get twice as fast by splitting the u32 into two u16.

If this is true, this also means that using repeat_with(|| rng.u8(..)).take(10_000).collect(); to generate random byte vectors could be made 4 times faster by generating u32 and transmuting them to a byte slice.

SabrinaJewson · 2021-03-13T15:48:52Z

Yes, I think that's all correct with the current implementation. It is two times faster to generate a u32 and then split it than to generate two u16s.

You don't need to transmute at all in your iterator. On beta and nightly (and stable in 12 days) you can write:

repeat_with(|| rng.u32(..))
    .flat_map(|v| array::IntoIter::new(v.to_ne_bytes()))
    .take(10_000)
    .collect()

or on stable until then:

repeat_with(|| rng.u32(..).to_ne_bytes())
    .flat_map(|bytes| (0..4).map(move |i| bytes[i]))
    .take(10_000)
    .collect()

piegamesde · 2021-03-13T16:05:42Z

One would have to check if the iterator implementations compile down to a transmute or something similarly fast. Also, maybe it's worth adding a convenience method for bulk-filling random data array? (Not sure about how the API should look like though.)

Wasting half (or more) of the generated random values is something I think should easily be avoidable by storing the unused random bytes for future usage instead of throwing them away. Using a separate store for u8, u16 (and later on, u32) should make the branches easily predictable. Does this crate make any guarantees about the values that a certain seed returns across versions?

SabrinaJewson · 2021-03-13T16:38:35Z

Wasting half (or more) of the generated random values is something I think should easily be avoidable by storing the unused random bytes for future usage instead of throwing them away.

I don't really think this is worth it. Especially if #14 gets merged the actual generation becomes really cheap, and caching values would require both adding branches and increasing the size of the RNG. Also, I don't know of any other RNG implementations that do this.

Does this crate make any guarantees about the values that a certain seed returns across versions?

It's not explicitly specified anywhere, but I would assume not, otherwise there would never be any possibility to change the RNG algorithm.

piegamesde · 2021-03-13T20:44:20Z

The new Rng algorithm is indeed faster, but not by orders of magnitude (about 15% maybe?). Interestingly, generating usize is still slower than u8/u16 for some reason.

ironhaven · 2021-11-21T21:06:31Z

I have a theory the performance discrepancy might be due to way the code is written not the library. There is a branch in the middle of what looks like a hot loop and this can slow things down surprisingly. If there was only one code path the CPU can work ahead on computing the random index. With the if statement the CPU has to predict which path it will take to stay at full speed. If it predicts wrong it will slow down. Only generating usizes would likely help with performance also considering WyRand generates u64s internally.

Havunen · 2024-07-21T09:17:49Z

It seems https://github.com/wangyi-fudan/wyhash wyhash has evolved into https://github.com/Nicoshev/rapidhash could that be used to improve perf of this library further

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimizations #13

Performance optimizations #13

piegamesde commented Mar 12, 2021

SabrinaJewson commented Mar 13, 2021

piegamesde commented Mar 13, 2021

SabrinaJewson commented Mar 13, 2021

piegamesde commented Mar 13, 2021

SabrinaJewson commented Mar 13, 2021

piegamesde commented Mar 13, 2021

ironhaven commented Nov 21, 2021

Havunen commented Jul 21, 2024

Performance optimizations #13

Performance optimizations #13

Comments

piegamesde commented Mar 12, 2021

SabrinaJewson commented Mar 13, 2021

piegamesde commented Mar 13, 2021

SabrinaJewson commented Mar 13, 2021

piegamesde commented Mar 13, 2021

SabrinaJewson commented Mar 13, 2021

piegamesde commented Mar 13, 2021

ironhaven commented Nov 21, 2021

Havunen commented Jul 21, 2024