task-local xoshiro rebased #40546

JeffBezanson · 2021-04-20T21:02:33Z

This is a cleaned-up and rebased version of #34852.

@rfourquet Please make sure I addressed your comments correctly. There is also a stack overflow in the tests due (I think) to missing overloads for the new RNG types; please advise on what should be added.

@chriselrod I haven't touched the SIMD stuff; let me know if there are specific changes I should make, or feel free to add commits here if needed.

I included a separate commit to change the default RNG to the task-local one, to see what that would look like. The main issues are

We don't promise the same random streams between releases, so that part is ok, but there are also APIs for e.g. copying between the global state and MersenneTwister, which could no longer be supported. Is that ok?
The copy! and == methods for TaskLocal are a bit sketchy, since TaskLocal is a singleton but the operations actually affect the task-local state. Do we care?
Single generation (rand()) of all types, and bulk generation of integers seem to be faster, but bulk generation of floats seems to be slower. Can this be optimized further?

stdlib/Random/src/RNGs.jl

rfourquet · 2021-04-21T14:46:17Z

stdlib/Random/src/RNGs.jl

+
+function seed!(rng::Union{TaskLocal, Xoshiro}, seed::AbstractVector{UInt64})
+    if length(seed) > 4
+        throw(ArgumentError("seed should have no more than 256 bits"))


This might break code which does seed!(rand(UInt32, 100)). So we might have to run PkgEval, and I could also easily add a version which accepts arbitrary lenght vectors by cryptographically hashing it (I already did that in another unmerged PR, so will be very easy to adapt the code).

rfourquet · 2021-04-21T14:49:00Z

stdlib/Random/src/RNGs.jl

-end
-@noinline _rng_length_assert() =  @assert false "0 < tid <= length(THREAD_RNGs)"
+@inline default_rng() = TaskLocal()
+@inline default_rng(tid::Int) = TaskLocal()


Is this still used? otherwise:

Suggested change

@inline default_rng(tid::Int) = TaskLocal()

Not used here, and not exported. Just wondering if it's used in packages.

rfourquet · 2021-04-21T14:56:25Z

stdlib/Random/src/XoshiroSimd.jl

+
+module XoshiroSimd
+# Getting the xoroshiro RNG to reliably vectorize is somewhat of a hassle without Simd.jl.
+import ..Random: TaskLocal, rand, rand!, UnsafeView, Xoshiro, SamplerType, CloseOpen12, SamplerTrivial


Suggested change

import ..Random: TaskLocal, rand, rand!, UnsafeView, Xoshiro, SamplerType, CloseOpen12, SamplerTrivial

import ..Random: TaskLocal, rand, rand!, Xoshiro, SamplerType, CloseOpen01, SamplerTrivial

AFAICT:

UnsafeView is unused by Xoshiro, so methods for it are useless. (UnsafeView is used only internally for MersenneTwister, at no point a rand! method will be called on UnsafeView for another RNG type.)

Sorry to have mistakenly written CloseOpen12 in a comment from the original PR, here we generate in [0, 1) so should be CloseOpen01 (might be worth adding a version for [1, 2) but not urgent and I can take care of that). AFAICT this solves the efficiency issue for bulk generation (the methods below were not called because of that).

I can push a commit fixing those two issues in this file, or let you do it, as you like.

Done. That fixes it for rand; is there anything we can do for randn?

is there anything we can do for randn?

I could do the same trick as was done for MersenneTwister, IIRC it lead to a (very roughly) 2x speedup.

I implemented it in the branch rf/xoshiro/randn, you can incorporate the commit here, or I will open a PR once this one is merged. I actually had to add back the rand! method for UnsafeView ;-p

stdlib/Random/test/runtests.jl

rfourquet · 2021-04-21T15:17:22Z

stdlib/Random/test/runtests.jl

@@ -627,7 +627,7 @@ guardseed() do
    m = MersenneTwister(0)
    @test Random.seed!() === g
    @test Random.seed!(rand(UInt)) === g
-    @test Random.seed!(rand(UInt32, rand(1:10))) === g
+    @test Random.seed!(rand(UInt32, rand(1:8))) === g


Out of curiosity, why this change?

Because of the error for using too-long seeds. If we remove the error then this is ok.

Ah right of course, for some reason I thought 1:10 was the sampler from which values were sampled.

rfourquet · 2021-04-21T15:28:25Z

but there are also APIs for e.g. copying between the global state and MersenneTwister, which could no longer be supported. Is that ok?

copy etc. was supported, but if I'm not mistaken, neither Random.default_rng() nor Random.GLOBAL_RNG were documented or exported. Of course this was used in packages (I'm not sure to which extent), so I would say it's OK to not support that anymore (at least we are not tied by semver for this).

The copy! and == methods for TaskLocal are a bit sketchy, since TaskLocal is a singleton but the operations actually affect the task-local state. Do we care?

I'm not sure I understand, but if you refer to copy(::TaskLocal) returning a Xoshiro, this seems fine to me, and is useful. As Kristoffer noticed very recently (probably on slack's #gripes), there is precedent of that, with copy(ENV) returning a Dict, or copy on views.

Single generation (rand()) of all types, and bulk generation of integers seem to be faster, but bulk generation of floats seems to be slower. Can this be optimized further?

Should be trivially fixed, see inline comment.

rfourquet · 2021-04-21T15:29:40Z

I don't think I saw a discussion on Xoshiro256++ vs. Xoshiro256**, why was Xoshiro256++ preferred? (@chethega).

stdlib/Random/src/Random.jl

rfourquet · 2021-04-21T15:48:49Z

stdlib/Random/src/XoshiroSimd.jl

+using Base: BitInteger_types
+
+# Vector-width. Influences random stream. We may want to tune this before merging.
+xoshiroWidth() = Val(8)


Also from https://github.com/JuliaLang/julia/pull/34852/files#r469740269 :

The capitalization in this file seems a bit off (camelCase). Is that intentional?

chriselrod · 2021-04-21T16:50:28Z

stdlib/Random/src/RNGs.jl

+    s3::UInt64
+end
+
+Xoshiro(::Nothing) = Xoshiro()


Is this used somewhere without initializing seeds?
Junk is likely to be 0-heavy, and Xoshiro doesn't like 0-heavy inits.

It will end up using RandomDevice, but I also wonder why this exists. It's only here to copy other RNGs that also have such methods. Similarly, why do we have a seed!(rng, nothing) method?

Xoshiro() randomizes all the fields with RandomDevice(); having nothing equivalent to no explicitly passed seed can be useful to forward seeds, like in e.g. myalgo(...; seed=nothing). I don't think this is documented though...

chriselrod · 2021-04-21T17:12:26Z

stdlib/Random/src/XoshiroSimd.jl

+        ret <$N x i64> %res
+        """
+        @eval @inline _rotl45(x::NTuple{$N, VecElement{UInt64}}) = Core.Intrinsics.llvmcall($code,
+            NTuple{$N, VecElement{UInt64}},


https://llvm.org/docs/LangRef.html#llvm-fshl-intrinsic
LLVM now has the "funnel shift left/right" intrinsics for rotates.

For example

using Base: llvmcall, VecElement for N in [1,2,4,8,16,32] s = ntuple(_ -> Core.VecElement(45%UInt64), Val(N)) op = "llvm.fshl.v$(N)i64" @eval @inline function _rotl45(x::NTuple{$N,VecElement{UInt64}}) ccall($op, llvmcall, NTuple{$N,VecElement{UInt64}}, (NTuple{$N,VecElement{UInt64}},NTuple{$N,VecElement{UInt64}},NTuple{$N,VecElement{UInt64}}), x, x, $s) end end

A little shorter, and as LLVM's documentation specifically says that rotate can be implemented by passing the same vector as the first two arguments, we can be confident that this'll reliably produce rotate instructions. E.g.:

julia> vxu = ntuple(_ -> VecElement(rand(UInt64)), Val(8)); julia> @code_native debuginfo=:none syntax=:intel _rotl45(vxu) .text vprolq zmm0, zmm0, 45 ret nop dword ptr [rax + rax]

I didn't try the current definition, but I believe (last time I tried) that it was also optimized into a rotate.

Using too long a vector naturally does exactly what we'd want it to:

julia> vxu = ntuple(_ -> VecElement(rand(UInt64)), Val(16)); julia> @cn _rotl45(vxu) .text vprolq zmm0, zmm0, 45 vprolq zmm1, zmm1, 45 ret nop

So xoshiroWidth() = 8 should be fine. 512-bit SIMD architectures won't get as much out of order execution as others, but that's not the end of the world. That's what SMT is for...

I just confirmed that the current code in the PR produces the same assembly on my machine.

chriselrod · 2021-04-21T17:15:31Z

stdlib/Random/src/XoshiroSimd.jl

+        ret <$N x i64> %res
+        """
+        @eval @inline _lshr(x::NTuple{$N, VecElement{UInt64}}, y::Int64) = Core.Intrinsics.llvmcall($code,
+            NTuple{$N, VecElement{UInt64}},


I think we could just use ccall for all of these and interpolate the VecElements, like in my rotate example.

As an aside, I'd love it if Julia's codegen lowered NTuple{N,VecElement{T}}s with the same operations as Ts. I assume this was already discussed somewhere (and rejected?).

chriselrod · 2021-04-21T17:22:36Z

stdlib/Random/src/XoshiroSimd.jl

+        s0, s1, s2, s3 = task.rngState0, task.rngState1, task.rngState2, task.rngState3
+    else
+        rng::Xoshiro
+        s0, s1, s2, s3 = rng.s0, rng.s1, rng.s2, rng.s3


Could use the fancy new syntax:

(; s0, s1, s2, s3) = rng::Xoshiro

chriselrod · 2021-04-21T17:35:11Z

stdlib/Random/src/XoshiroSimd.jl

+
+function rand!(rng::Union{TaskLocal, Xoshiro}, dst::Union{Array{Float64}, UnsafeView{Float64}}, ::SamplerTrivial{CloseOpen12{Float64}})
+    GC.@preserve dst xoshiro_bulk(rng, convert(Ptr{UInt8}, pointer(dst)), length(dst)*8, Float64, xoshiroWidth())
+    @inbounds @simd for i = 1:length(dst)


Can we add a single pass method?
We could make xoshiro_bulk_simd except a function argument to apply to toWrite before storing it, which can do the masking to Float(64/32) and subtract 1.0.

For big arrays, passing over the data just once will likely be a big performance win.

test/threads_exec.jl

JeffBezanson · 2021-04-21T21:37:25Z

Plenty of random-stream-dependent tests to fix here ... 😅

fredrikekre · 2021-04-21T21:49:13Z

For doctests there is make -C doc html doctest=fix to automatically fix them.

JeffBezanson · 2021-04-23T00:16:30Z

Ok there is something wrong with the simd code causing memory corruption; that's the cause of the bitarray failures.

JeffBezanson · 2021-04-23T17:05:35Z

Made some fixes to the simd code. Would be good if somebody can review.

JeffBezanson · 2021-04-23T19:20:16Z

Another unanticipated change: MersenneTwister saves its original seed. The test system uses that, and I can fix it by saving the seed of the global default generator, but I don't know if there are other uses we should worry about.

rfourquet · 2021-04-23T19:27:47Z

MersenneTwister saves its original seed. The test system uses that, and I can fix it by saving the seed of the global default generator, but I don't know if there are other uses we should worry about.

I noticed that too, and I think that's OK without needing to save the seed of the global RNG. It's kinda breaking as this was documented, but I think that most of the benefit for the test system is kept by restoring at the end of a @testset the state of the global RNG as it was before the @testset started (this behavior is maintained in this PR). I don't think this was relied upon in packages, it's more of an interactive feature useful for debugging failing tests, and it's not too difficult to work around that (e.g. by avoiding calls to rand() outside of a @testset).

JeffBezanson · 2021-04-23T19:34:47Z

most of the benefit for the test system is kept by restoring at the end of a @testset the state of the global RNG as it was before the @testset started (this behavior is maintained in this PR)

I agree, but then there is this evil test:

    a = rand()
    @testset begin
        # global RNG must re-seeded at the beginning of @testset
        @test a == rand()

I can just keep it working for now.

rfourquet · 2021-04-23T19:38:09Z

I agree, but then there is this evil test:

Sure, this was testing the feature in question, so if we drop the feature we can drop the test...

I can just keep it working for now.

You mean by saving the seed for TaskLocalRNG ? I though this would be out of question, as this would double the size taken for storing the RNG in each task... or did I misunderstand something?

JeffBezanson · 2021-04-23T19:41:22Z

Right. It's only tested for the case of seeding the global default RNG (seed!(seed)), which only requires saving a single global seed.

rfourquet · 2021-04-23T19:49:03Z

Yes this is great if this is made to work. To be explicit, the feature was meant to be useful in this context: one of your @testset fails, and you can reproduce the failure by doing Random.seed!(GLOBAL_RNG.seed) followed by including only the failed testset. An example of this in for Base tests, where the seed is shown in CI and you can re-run a given failed testset by copy pasting this seed. I didn't think enough about the task-local RNG setting to understand to which extent this can be replicated by storing a single seed globally.

JeffBezanson · 2021-04-23T19:51:36Z

We could even make GLOBAL_RNG.seed work via getproperty overloading if we want.

JeffBezanson · 2021-04-24T00:38:29Z

Linear algebra tests (ノಠ益ಠ)ノ彡┻━┻

stdlib/Random/src/Xoshiro.jl

stdlib/Random/src/normal.jl

JeffBezanson · 2021-05-25T20:36:41Z

Ok so we have 153 package test failures due to this; not TOO terrible really. I think this is ready to merge. Any objections?

oscardssmith · 2021-05-25T20:48:59Z

I vote merge and see what happens.

vtjnash · 2021-05-25T23:02:22Z

That sounds sad, but I’m assuming it is bad tests (and not numeric issues, say)?

Co-Authored-By: Jeff Bezanson <[email protected]> Co-Authored-by: Rafael Fourquet <[email protected]> - try to use high bits instead of low when we take a subset of them - use shift and multiply instead of mask and subtract for generating floats

oscardssmith · 2021-06-01T21:30:02Z

Is this ready to merge? Feature freeze is today.

JeffBezanson · 2021-06-01T22:09:53Z

Yes, but I adjusted the simd thresholds (they were way too high) and so now there are new test failures to fix...

chriselrod · 2021-08-28T19:44:51Z

Note that the approach of generating Float64 numbers here is only SIMD with AVX512, so SIMD performance is bad without it.

I don't recall where this was discussed, but seems worth it's own issue.

Two possibilities:

use feature detection to (given absence of AVX512) decompose the vector of 64 bit integers into two vectors of 32 bit integers, convert, and recombine. This has full accuracy and is much faster than what LLVM does (decompose the vector of 64 bit integers into scalars and convert these one at a time, and then recombine).
change algorithms to cast to [1,2) instead. This has the added bonus of once again changing streams and encouraging unit tests to use StableRNG. ;)

Alternatively, someone who knows LLVM could make a PR to do "1)" at the LLVM level.

JeffBezanson added randomness Random number generation and the Random stdlib multithreading Base.Threads and related functionality needs news A NEWS entry is required for this change needs docs Documentation for this change is required labels Apr 20, 2021

JeffBezanson force-pushed the jb/chethega/tlrng branch from db62af9 to 6d15eb7 Compare April 20, 2021 22:06

vtjnash mentioned this pull request Apr 21, 2021

WIP: Tasklocal xoshiro #34852

Closed

rfourquet reviewed Apr 21, 2021

View reviewed changes

stdlib/Random/src/Random.jl Outdated Show resolved Hide resolved

rfourquet reviewed Apr 21, 2021

View reviewed changes

chriselrod reviewed Apr 21, 2021

View reviewed changes

JeffBezanson force-pushed the jb/chethega/tlrng branch from 6d15eb7 to 00cf679 Compare April 21, 2021 18:24

rfourquet reviewed Apr 21, 2021

View reviewed changes

test/threads_exec.jl Outdated Show resolved Hide resolved

JeffBezanson force-pushed the jb/chethega/tlrng branch from 00cf679 to 74a4d01 Compare April 21, 2021 19:00

JeffBezanson added the triage This should be discussed on a triage call label Apr 22, 2021

JeffBezanson force-pushed the jb/chethega/tlrng branch 2 times, most recently from f1a6382 to c2bcc8f Compare April 23, 2021 22:24

rfourquet reviewed May 24, 2021

View reviewed changes

stdlib/Random/src/Xoshiro.jl Outdated Show resolved Hide resolved

rfourquet reviewed May 24, 2021

View reviewed changes

stdlib/Random/src/normal.jl Outdated Show resolved Hide resolved

rfourquet reviewed May 24, 2021

View reviewed changes

stdlib/Random/src/normal.jl Outdated Show resolved Hide resolved

rfourquet reviewed May 24, 2021

View reviewed changes

stdlib/Random/src/normal.jl Outdated Show resolved Hide resolved

JeffBezanson force-pushed the jb/chethega/tlrng branch from 501493b to d638c88 Compare May 24, 2021 20:32

simeonschaub mentioned this pull request May 25, 2021

Print methods(f) with color like stack traces #40251

Closed

JeffBezanson removed the needs news A NEWS entry is required for this change label May 25, 2021

JeffBezanson added this to the 1.7 milestone May 27, 2021

JeffBezanson removed the needs pkgeval Tests for all registered packages should be run with this change label May 28, 2021

JeffBezanson force-pushed the jb/chethega/tlrng branch from d638c88 to 0e3ea6e Compare June 1, 2021 19:17

make TaskLocal the default RNG

cfd940e

JeffBezanson force-pushed the jb/chethega/tlrng branch from 0e3ea6e to cfd940e Compare June 1, 2021 23:48

JeffBezanson merged commit e5884e7 into master Jun 2, 2021

JeffBezanson deleted the jb/chethega/tlrng branch June 2, 2021 06:24

This was referenced Jun 2, 2021

Better default RNG in the future? #27614

Closed

make rand unbiased for floats #33222

Closed

This was referenced Jun 3, 2021

Performance of threadsafe rand() #36418

Closed

Regression in rand due to inlining behavior #37030

Closed

Threaded RNG should use same stream at different locations not different streams #34386

Closed

Call to randn not inlined in randn! #31607

Closed

Seelengrab mentioned this pull request Jun 24, 2021

Random doc still talks about MersenneTwister instead of xoroshiro #41346

Closed

Pangoraw mentioned this pull request Sep 20, 2022

Random.seed! not reproducible fonsp/Pluto.jl#2290

Open

giordano mentioned this pull request Oct 5, 2024

Vectorise random vectors of Float16 #55997

Merged

	import ..Random: TaskLocal, rand, rand!, UnsafeView, Xoshiro, SamplerType, CloseOpen12, SamplerTrivial
	import ..Random: TaskLocal, rand, rand!, Xoshiro, SamplerType, CloseOpen01, SamplerTrivial

task-local xoshiro rebased #40546

task-local xoshiro rebased #40546

Conversation

JeffBezanson commented Apr 20, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfourquet Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfourquet commented Apr 21, 2021

rfourquet commented Apr 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chriselrod Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

JeffBezanson commented Apr 21, 2021

fredrikekre commented Apr 21, 2021

JeffBezanson commented Apr 23, 2021

JeffBezanson commented Apr 23, 2021

JeffBezanson commented Apr 23, 2021

rfourquet commented Apr 23, 2021

JeffBezanson commented Apr 23, 2021

rfourquet commented Apr 23, 2021

JeffBezanson commented Apr 23, 2021

rfourquet commented Apr 23, 2021

JeffBezanson commented Apr 23, 2021

JeffBezanson commented Apr 24, 2021

JeffBezanson commented May 25, 2021

oscardssmith commented May 25, 2021

vtjnash commented May 25, 2021

oscardssmith commented Jun 1, 2021

JeffBezanson commented Jun 1, 2021

chriselrod commented Aug 28, 2021 • edited Loading

rfourquet Apr 21, 2021 •

edited

Loading

chriselrod Apr 21, 2021 •

edited

Loading

chriselrod commented Aug 28, 2021 •

edited

Loading