Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance improvements for shuffle and partial_shuffle #1272

Merged
merged 11 commits into from
Jan 8, 2023

Conversation

wainwrightmark
Copy link
Contributor

This is Related to #1266 but completely orthogonal to #1268 which improves the performance of a different set of methods in a different way.

This improves the performance of SliceRandom::shuffle() and SliceRandom::partial_shuffle() by essentially batching the random number generation.

It seems to be about 50-100% faster for most slice lengths, with less performance improvement for longer slices. It will use the old method for slices with length longer than 2^32.

This is a value breaking change.

Benchmark results

Partial Shuffle

Partial shuffle half of the slice

Number of Elements Rng Old ns/iter New ns/iter Ratio (new:old)
10 CryptoRng 44 27 1.62962963
10 SmallRng 30 21 1.428571429
100 CryptoRng 409 164 2.493902439
100 SmallRng 270 130 2.076923077
1000 CryptoRng 3,616 1,993 1.814350226
1000 SmallRng 2,474 1,364 1.813782991
10000 CryptoRng 38,607 26,286 1.468728601
10000 SmallRng 27,824 16,927 1.6437644

Shuffle

Number of Elements Rng Old ns/iter New ns/iter Ratio (new:old)
1 CryptoRng 0 0 N/A
1 SmallRng 0 0 N/A
2 CryptoRng 11 8 1.375
2 SmallRng 8 5 1.6
3 CryptoRng 18 9 2
3 SmallRng 13 6 2.166666667
10 CryptoRng 88 25 3.52
10 SmallRng 61 16 3.8125
100 CryptoRng 872 325 2.683076923
100 SmallRng 552 245 2.253061224
1000 CryptoRng 7,219 3,910 1.84629156
1000 SmallRng 5,057 2,650 1.908301887
10000 CryptoRng 76,061 50,041 1.519973622
10000 SmallRng 55,682 32,715 1.702032707

@TheIronBorn
Copy link
Collaborator

Huh, I've always thought shuffling was memory bound. Impressive work

benches/shuffle.rs Outdated Show resolved Hide resolved
benches/shuffle.rs Outdated Show resolved Hide resolved
@wainwrightmark
Copy link
Contributor Author

If we did have a way of determining native output size (#1261) using 64 bit chunks would give significant performance improvements when shuffling longer sequences. Unfortunately this would lead to different values on 32 and 64 bit machines.

@dhardy
Copy link
Member

dhardy commented Jan 4, 2023

Unfortunately this would lead to different values on 32 and 64 bit machines.

Only where a different RNG is used on these machines (e.g. SmallRng which is a platform-dependent type-def), in which case results would already differ.

There is a question of which chunk size we should use by default given that 64-bit CPUs are now the norm, though it would penalise results for short lists with e.g. ChaCha.

Actually, we should run benchmarks with both chunk sizes with several RNGs (e.g. Pcg32, Pcg64 and ChaCha12; don't need two variants of ChaCha) — as a hack that doesn't need to be committed. I'll do this for your other PR.

@wainwrightmark
Copy link
Contributor Author

I ran the benchmarks comparing u32 and u64 versions.
Unsurprisingly, the main factor seems to be the number of elements. Regardless of the RNG, with 10 or fewer elements using u64 is a lot slower, with 100 it's about the same, and with 1000 or 10000 you start to see the benefit.
Of course this is me running on a 64 bit machine - on a 32 bit machine all those 64 bit div operations would probably be a lot slower.

Method Elements Rng units u32 u64 Ratio
shuffle 1 ChaCha12 ps 147.86 161 1.088867848
shuffle 2 ChaCha12 ns 7.6755 17.923 2.335092176
shuffle 3 ChaCha12 ns 9.0967 19.445 2.137588356
shuffle 10 ChaCha12 ns 19.83 43.989 2.218305598
partial_shuffle 10 ChaCha12 ns 22.278 37.443 1.680716402
shuffle 100 ChaCha12 ns 296 310.14 1.04777027
partial_shuffle 100 ChaCha12 ns 153.41 164.68 1.073463268
shuffle 1000 ChaCha12 µs 3.3972 3.239 0.9534322383
partial_shuffle 1000 ChaCha12 µs 1.7698 1.5642 0.8838286812
shuffle 10000 ChaCha12 µs 42.689 35.48 0.8311274567
partial_shuffle 10000 ChaCha12 µs 22.199 17.56 0.7910266228
shuffle 1 Pcg32 ps 153.99 158.82 1.031365673
shuffle 2 Pcg32 ns 6.2062 13.458 2.168476685
shuffle 3 Pcg32 ns 7.2483 15.499 2.138294497
shuffle 10 Pcg32 ns 16.559 38.7 2.337097651
partial_shuffle 10 Pcg32 ns 29.098 33.407 1.148085779
shuffle 100 Pcg32 ns 238.94 281.99 1.180170754
partial_shuffle 100 Pcg32 ns 128.43 147.25 1.146538971
shuffle 1000 Pcg32 µs 2.667 2.6314 0.9866516685
partial_shuffle 1000 Pcg32 µs 1.3933 1.2401 0.8900452164
shuffle 10000 Pcg32 µs 32.283 28.93 0.8961372859
partial_shuffle 10000 Pcg32 µs 16.62 14.075 0.8468712395
shuffle 1 Pcg64 ps 176.42 149.3 0.8462759324
shuffle 2 Pcg64 ns 7.4931 13.865 1.850369006
shuffle 3 Pcg64 ns 8.5974 15.291 1.77856096
shuffle 10 Pcg64 ns 19.851 37.843 1.906352325
partial_shuffle 10 Pcg64 ns 21.434 30.914 1.442287954
shuffle 100 Pcg64 ns 265.08 291.27 1.098800362
partial_shuffle 100 Pcg64 ns 141.66 151.39 1.068685585
shuffle 1000 Pcg64 µs 3.0694 2.7147 0.8844399557
partial_shuffle 1000 Pcg64 µs 1.6057 1.3093 0.8154076104
shuffle 10000 Pcg64 µs 39.199 29.133 0.7432077349
partial_shuffle 10000 Pcg64 µs 20.264 14.098 0.6957165417

@dhardy
Copy link
Member

dhardy commented Jan 5, 2023

Pcg64 being faster with 64-bit chunks for large sizes is not surprising since the 32-bit version is discarding random bits, but the significant losses for ≤ 10 elements and only moderate wins at 10'000 elements means it's still questionable whether 64-bit chunks is an improvement.

Meanwhile ChaCha and Pcg32 also see gains despite not discarding random bits in the same way.

We could use this to select a different shuffling implementation based on the slice length, regardless of RNG algorithm. But is there enough interest in shuffling large slices to justify the extra complexity? Probably better to stick with 32-bit only.

Thanks for the extra benchmarks @wainwrightmark.

@wainwrightmark
Copy link
Contributor Author

Pcg64 being faster with 64-bit chunks for large sizes is not surprising since the 32-bit version is discarding random bits, but the significant losses for ≤ 10 elements and only moderate wins at 10'000 elements means it's still questionable whether 64-bit chunks is an improvement.

Meanwhile ChaCha and Pcg32 also see gains despite not discarding random bits in the same way.

We could use this to select a different shuffling implementation based on the slice length, regardless of RNG algorithm. But is there enough interest in shuffling large slices to justify the extra complexity? Probably better to stick with 32-bit only.

Thanks for the extra benchmarks @wainwrightmark.

I agree.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase and add a copyright header to increasing_uniform.rs (also coin_flipper.rs, missed in the last PR).

Please also run rustfmt src/seq/increasing_uniform.rs and wrap long comments. Less comments may be better — detailed explanations like this run the risk of becoming outdated (I mostly didn't read them).

src/seq/increasing_uniform.rs Outdated Show resolved Hide resolved
src/seq/increasing_uniform.rs Outdated Show resolved Hide resolved
self.swap(i, index);
}
} else {
for i in m..self.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to reverse the iterator (both loops). Your code can only "choose" the last element of the list with probability 1/len when it should be m/len.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ermm, I'm pretty sure I've got this right. The last element gets swapped to a random place in the list so it has a m/len probability of being in the first m elements. Earlier elements are more likely to be chosen initially but can get booted out by later ones. The test_shuffle test is checking this and I've also tried similar tests with longer lists and more runs.

The reason I don't reverse the iterator is because the increasing_uniform needs i to increase and a decreasing version would be more complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. We previously reversed since this way the proof by induction is easier. But we can also prove this algorithm works.

First, lets not use m = len - amount since in the last PR we used m = amount. I'll continue to use end = len - amount.

Lets say we have a list elts = [e0, e1, .., ei, ..] of length len. Elements are "chosen" if they appear in elts[end..len] after the algorithm; additionally we need to show that this slice is fully shuffled.

Algorithm is:

for i in end..len {
    elts.swap(i, rng.sample_range(0..=i));
}

For any length, for amount = 0 or amount = 1, this is clearly correct. We'll prove by induction, assuming that the algorithm is already proven correct for amount-1 and len-1 (so that end does not change and the algorithm only has one last swap to perform).

Thus, we assume:

  • For any elt ei, we have P(ei in elts[0..end]) = end/(len-1) [here we say nothing about element order]
  • For any elt ei, for any k in end..(len-1), P(elts[k] = ei) = (amount-1)/(len-1) [fully shuffled]

We perform the last step of the algorithm: let x = sample_range(0..=len); elts.swap(len-1, x);. Now:

  • Any element in elts[0..end] is moved to elts[len-1] with probability 1/len, thus for any elt ei except e_last, P(ei in elts[0..end]) = end/(len-1) * (len-1)/len = end/len
  • For any elt ei previously in elts[end..len-1], the chance it is not moved is (len-1)/len, thus, for these ei, for any k in end..(len-1), P(elts[k] = ei) = (amount-1)/(len-1) * (len-1)/len = (amount-1)/len
  • For any elt ei previously in elts[end..len-1], P(elts[len-1] = ei) = 1/len
  • The previous two points together imply that for any ei previously in elts[end..len-1], for any k in end..len, P(elts[k] = ei) = (amount-1+1)/len = amount/len
  • Element e_last may appear in any position with probability 1/len

Thus each element has chance amount/len to appear in ents[end..len] and this slice is fully shuffled.

src/seq/mod.rs Show resolved Hide resolved
src/seq/mod.rs Outdated Show resolved Hide resolved
src/seq/increasing_uniform.rs Outdated Show resolved Hide resolved
src/seq/increasing_uniform.rs Outdated Show resolved Hide resolved
Comment on lines +49 to +62
let r = self.chunk % next_n;
self.chunk /= next_n;
r as usize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's probably also room for further optimisation here: modulus is a slow operation (see https://www.pcg-random.org/posts/bounded-rands.html).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did read that article and it helped me find some of the optimizations I used for this. I also tried using a method based on bitmask but it turned out about 50% slower than this. Obviously I could easily have missed something.

Copy link
Member

@dhardy dhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants