Performance improvements for `shuffle` and `partial_shuffle` #1272

wainwrightmark · 2022-12-06T00:23:49Z

This is Related to #1266 but completely orthogonal to #1268 which improves the performance of a different set of methods in a different way.

This improves the performance of SliceRandom::shuffle() and SliceRandom::partial_shuffle() by essentially batching the random number generation.

It seems to be about 50-100% faster for most slice lengths, with less performance improvement for longer slices. It will use the old method for slices with length longer than 2^32.

This is a value breaking change.

Benchmark results

Partial Shuffle

Partial shuffle half of the slice

Number of Elements	Rng	Old ns/iter	New ns/iter	Ratio (new:old)
10	CryptoRng	44	27	1.62962963
10	SmallRng	30	21	1.428571429
100	CryptoRng	409	164	2.493902439
100	SmallRng	270	130	2.076923077
1000	CryptoRng	3,616	1,993	1.814350226
1000	SmallRng	2,474	1,364	1.813782991
10000	CryptoRng	38,607	26,286	1.468728601
10000	SmallRng	27,824	16,927	1.6437644

Shuffle

Number of Elements	Rng	Old ns/iter	New ns/iter	Ratio (new:old)
1	CryptoRng	0	0	N/A
1	SmallRng	0	0	N/A
2	CryptoRng	11	8	1.375
2	SmallRng	8	5	1.6
3	CryptoRng	18	9	2
3	SmallRng	13	6	2.166666667
10	CryptoRng	88	25	3.52
10	SmallRng	61	16	3.8125
100	CryptoRng	872	325	2.683076923
100	SmallRng	552	245	2.253061224
1000	CryptoRng	7,219	3,910	1.84629156
1000	SmallRng	5,057	2,650	1.908301887
10000	CryptoRng	76,061	50,041	1.519973622
10000	SmallRng	55,682	32,715	1.702032707

TheIronBorn · 2022-12-06T20:53:39Z

Huh, I've always thought shuffling was memory bound. Impressive work

benches/shuffle.rs

wainwrightmark · 2023-01-04T18:09:57Z

If we did have a way of determining native output size (#1261) using 64 bit chunks would give significant performance improvements when shuffling longer sequences. Unfortunately this would lead to different values on 32 and 64 bit machines.

dhardy · 2023-01-04T18:35:25Z

Unfortunately this would lead to different values on 32 and 64 bit machines.

Only where a different RNG is used on these machines (e.g. SmallRng which is a platform-dependent type-def), in which case results would already differ.

There is a question of which chunk size we should use by default given that 64-bit CPUs are now the norm, though it would penalise results for short lists with e.g. ChaCha.

Actually, we should run benchmarks with both chunk sizes with several RNGs (e.g. Pcg32, Pcg64 and ChaCha12; don't need two variants of ChaCha) — as a hack that doesn't need to be committed. I'll do this for your other PR.

wainwrightmark · 2023-01-04T22:27:35Z

I ran the benchmarks comparing u32 and u64 versions.
Unsurprisingly, the main factor seems to be the number of elements. Regardless of the RNG, with 10 or fewer elements using u64 is a lot slower, with 100 it's about the same, and with 1000 or 10000 you start to see the benefit.
Of course this is me running on a 64 bit machine - on a 32 bit machine all those 64 bit div operations would probably be a lot slower.

Method	Elements	Rng	units	u32	u64	Ratio
shuffle	1	ChaCha12	ps	147.86	161	1.088867848
shuffle	2	ChaCha12	ns	7.6755	17.923	2.335092176
shuffle	3	ChaCha12	ns	9.0967	19.445	2.137588356
shuffle	10	ChaCha12	ns	19.83	43.989	2.218305598
partial_shuffle	10	ChaCha12	ns	22.278	37.443	1.680716402
shuffle	100	ChaCha12	ns	296	310.14	1.04777027
partial_shuffle	100	ChaCha12	ns	153.41	164.68	1.073463268
shuffle	1000	ChaCha12	µs	3.3972	3.239	0.9534322383
partial_shuffle	1000	ChaCha12	µs	1.7698	1.5642	0.8838286812
shuffle	10000	ChaCha12	µs	42.689	35.48	0.8311274567
partial_shuffle	10000	ChaCha12	µs	22.199	17.56	0.7910266228
shuffle	1	Pcg32	ps	153.99	158.82	1.031365673
shuffle	2	Pcg32	ns	6.2062	13.458	2.168476685
shuffle	3	Pcg32	ns	7.2483	15.499	2.138294497
shuffle	10	Pcg32	ns	16.559	38.7	2.337097651
partial_shuffle	10	Pcg32	ns	29.098	33.407	1.148085779
shuffle	100	Pcg32	ns	238.94	281.99	1.180170754
partial_shuffle	100	Pcg32	ns	128.43	147.25	1.146538971
shuffle	1000	Pcg32	µs	2.667	2.6314	0.9866516685
partial_shuffle	1000	Pcg32	µs	1.3933	1.2401	0.8900452164
shuffle	10000	Pcg32	µs	32.283	28.93	0.8961372859
partial_shuffle	10000	Pcg32	µs	16.62	14.075	0.8468712395
shuffle	1	Pcg64	ps	176.42	149.3	0.8462759324
shuffle	2	Pcg64	ns	7.4931	13.865	1.850369006
shuffle	3	Pcg64	ns	8.5974	15.291	1.77856096
shuffle	10	Pcg64	ns	19.851	37.843	1.906352325
partial_shuffle	10	Pcg64	ns	21.434	30.914	1.442287954
shuffle	100	Pcg64	ns	265.08	291.27	1.098800362
partial_shuffle	100	Pcg64	ns	141.66	151.39	1.068685585
shuffle	1000	Pcg64	µs	3.0694	2.7147	0.8844399557
partial_shuffle	1000	Pcg64	µs	1.6057	1.3093	0.8154076104
shuffle	10000	Pcg64	µs	39.199	29.133	0.7432077349
partial_shuffle	10000	Pcg64	µs	20.264	14.098	0.6957165417

dhardy · 2023-01-05T10:36:54Z

Pcg64 being faster with 64-bit chunks for large sizes is not surprising since the 32-bit version is discarding random bits, but the significant losses for ≤ 10 elements and only moderate wins at 10'000 elements means it's still questionable whether 64-bit chunks is an improvement.

Meanwhile ChaCha and Pcg32 also see gains despite not discarding random bits in the same way.

We could use this to select a different shuffling implementation based on the slice length, regardless of RNG algorithm. But is there enough interest in shuffling large slices to justify the extra complexity? Probably better to stick with 32-bit only.

Thanks for the extra benchmarks @wainwrightmark.

wainwrightmark · 2023-01-05T11:29:32Z

Pcg64 being faster with 64-bit chunks for large sizes is not surprising since the 32-bit version is discarding random bits, but the significant losses for ≤ 10 elements and only moderate wins at 10'000 elements means it's still questionable whether 64-bit chunks is an improvement.

Meanwhile ChaCha and Pcg32 also see gains despite not discarding random bits in the same way.

We could use this to select a different shuffling implementation based on the slice length, regardless of RNG algorithm. But is there enough interest in shuffling large slices to justify the extra complexity? Probably better to stick with 32-bit only.

Thanks for the extra benchmarks @wainwrightmark.

I agree.

dhardy

Please rebase and add a copyright header to increasing_uniform.rs (also coin_flipper.rs, missed in the last PR).

Please also run rustfmt src/seq/increasing_uniform.rs and wrap long comments. Less comments may be better — detailed explanations like this run the risk of becoming outdated (I mostly didn't read them).

src/seq/increasing_uniform.rs

dhardy · 2023-01-05T15:45:14Z

src/seq/mod.rs

+                self.swap(i, index);
+            }
+        } else {            
+            for i in m..self.len() {


You need to reverse the iterator (both loops). Your code can only "choose" the last element of the list with probability 1/len when it should be m/len.

Ermm, I'm pretty sure I've got this right. The last element gets swapped to a random place in the list so it has a m/len probability of being in the first m elements. Earlier elements are more likely to be chosen initially but can get booted out by later ones. The test_shuffle test is checking this and I've also tried similar tests with longer lists and more runs.

The reason I don't reverse the iterator is because the increasing_uniform needs i to increase and a decreasing version would be more complicated.

Okay. We previously reversed since this way the proof by induction is easier. But we can also prove this algorithm works.

First, lets not use m = len - amount since in the last PR we used m = amount. I'll continue to use end = len - amount.

Lets say we have a list elts = [e0, e1, .., ei, ..] of length len. Elements are "chosen" if they appear in elts[end..len] after the algorithm; additionally we need to show that this slice is fully shuffled.

Algorithm is:

for i in end..len { elts.swap(i, rng.sample_range(0..=i)); }

For any length, for amount = 0 or amount = 1, this is clearly correct. We'll prove by induction, assuming that the algorithm is already proven correct for amount-1 and len-1 (so that end does not change and the algorithm only has one last swap to perform).

Thus, we assume:

For any elt ei, we have P(ei in elts[0..end]) = end/(len-1) [here we say nothing about element order]

For any elt ei, for any k in end..(len-1), P(elts[k] = ei) = (amount-1)/(len-1) [fully shuffled]

We perform the last step of the algorithm: let x = sample_range(0..=len); elts.swap(len-1, x);. Now:

Any element in elts[0..end] is moved to elts[len-1] with probability 1/len, thus for any elt ei except e_last, P(ei in elts[0..end]) = end/(len-1) * (len-1)/len = end/len

For any elt ei previously in elts[end..len-1], the chance it is not moved is (len-1)/len, thus, for these ei, for any k in end..(len-1), P(elts[k] = ei) = (amount-1)/(len-1) * (len-1)/len = (amount-1)/len

For any elt ei previously in elts[end..len-1], P(elts[len-1] = ei) = 1/len

The previous two points together imply that for any ei previously in elts[end..len-1], for any k in end..len, P(elts[k] = ei) = (amount-1+1)/len = amount/len

Element e_last may appear in any position with probability 1/len

Thus each element has chance amount/len to appear in ents[end..len] and this slice is fully shuffled.

src/seq/mod.rs

src/seq/increasing_uniform.rs

dhardy · 2023-01-05T16:10:31Z

src/seq/increasing_uniform.rs

+            let r = self.chunk % next_n;
+            self.chunk /= next_n;
+            r as usize


There's probably also room for further optimisation here: modulus is a slow operation (see https://www.pcg-random.org/posts/bounded-rands.html).

I did read that article and it helped me find some of the optimizations I used for this. I also tried using a method based on bitmask but it turned out about 50% slower than this. Obviously I could easily have missed something.

dhardy

👍

TheIronBorn reviewed Dec 6, 2022

View reviewed changes

benches/shuffle.rs Outdated Show resolved Hide resolved

dhardy reviewed Jan 1, 2023

View reviewed changes

benches/shuffle.rs Outdated Show resolved Hide resolved

dhardy reviewed Jan 5, 2023

View reviewed changes

wainwrightmark added 6 commits January 6, 2023 15:54

Made shuffle and partial_shuffle faster

ce42437

Use criterion benchmarks for shuffle

e5f2c3b

Added a note about RNG word size

ae53b3c

Tidied comments

485d015

Added a debug_assert

c7c52a5

Added a comment re possible further optimization

d2e939f

wainwrightmark force-pushed the shuffle branch from 0cee068 to d2e939f Compare January 6, 2023 15:57

wainwrightmark added 5 commits January 6, 2023 15:58

Added and updated copyright notices

06820c2

Revert cfg mistake

f7b4a99

Reverted change to mod.rs

fbd7114

Removed ChaCha20 benches from shuffle

7bff828

moved debug_assert out of a const fn

bf38097

dhardy approved these changes Jan 8, 2023

View reviewed changes

dhardy merged commit 4bde8a0 into rust-random:master Jan 8, 2023

wainwrightmark mentioned this pull request Mar 1, 2023

Potential 1.5x performance improvement for Random.Shuffle() dotnet/runtime#82838

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements for `shuffle` and `partial_shuffle` #1272

Performance improvements for `shuffle` and `partial_shuffle` #1272

wainwrightmark commented Dec 6, 2022

TheIronBorn commented Dec 6, 2022

wainwrightmark commented Jan 4, 2023

dhardy commented Jan 4, 2023

wainwrightmark commented Jan 4, 2023

dhardy commented Jan 5, 2023

wainwrightmark commented Jan 5, 2023

dhardy left a comment

dhardy Jan 5, 2023

wainwrightmark Jan 6, 2023

dhardy Jan 8, 2023

dhardy Jan 5, 2023

wainwrightmark Jan 6, 2023

dhardy left a comment

Performance improvements for shuffle and partial_shuffle #1272

Performance improvements for shuffle and partial_shuffle #1272

Conversation

wainwrightmark commented Dec 6, 2022

Partial Shuffle

Shuffle

TheIronBorn commented Dec 6, 2022

wainwrightmark commented Jan 4, 2023

dhardy commented Jan 4, 2023

wainwrightmark commented Jan 4, 2023

dhardy commented Jan 5, 2023

wainwrightmark commented Jan 5, 2023

dhardy left a comment

Choose a reason for hiding this comment

dhardy Jan 5, 2023

Choose a reason for hiding this comment

wainwrightmark Jan 6, 2023

Choose a reason for hiding this comment

dhardy Jan 8, 2023

Choose a reason for hiding this comment

dhardy Jan 5, 2023

Choose a reason for hiding this comment

wainwrightmark Jan 6, 2023

Choose a reason for hiding this comment

dhardy left a comment

Choose a reason for hiding this comment

Performance improvements for `shuffle` and `partial_shuffle` #1272

Performance improvements for `shuffle` and `partial_shuffle` #1272