Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seed! is not thread safe #1158

Closed
bjarthur opened this issue Sep 21, 2021 · 3 comments
Closed

seed! is not thread safe #1158

bjarthur opened this issue Sep 21, 2021 · 3 comments

Comments

@bjarthur
Copy link
Contributor

CPU code works fine:

julia> using Random

julia> rng = Random.default_rng()
TaskLocalRNG()

julia> Random.seed!(rng)
TaskLocalRNG()

julia> rngs = [deepcopy(rng) for _=1:Threads.nthreads()]
4-element Vector{TaskLocalRNG}:
 TaskLocalRNG()
 TaskLocalRNG()
 TaskLocalRNG()
 TaskLocalRNG()

julia> Threads.@threads for _=1:10
          Random.seed!(rngs[Threads.threadid()])
       end

corresponding GPU code segfaults:

julia> using CUDA

julia> rng = CURAND.default_rng()
CUDA.CURAND.RNG(Ptr{Nothing} @0x00000000045237a0, CuContext(0x0000000003da79a0, instance 88b5bef5e4efdc87), CuStream(0x000000000245c890, CuContext(0x0000000003da79a0, instance 88b5bef5e4efdc87)), 100)

julia> Random.seed!(rng)

julia> rngs = [deepcopy(rng) for _=1:Threads.nthreads()]
4-element Vector{CUDA.CURAND.RNG}:
 CUDA.CURAND.RNG(Ptr{Nothing} @0x00000000045237a0, CuContext(0x0000000003da79a0, instance 5ad56f4aa5973234), CuStream(0x000000000245c890, CuContext(0x0000000003da79a0, instance 5ad56f4aa5973234)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x00000000045237a0, CuContext(0x0000000003da79a0, instance 3a5f0052e3a11c11), CuStream(0x000000000245c890, CuContext(0x0000000003da79a0, instance 3a5f0052e3a11c11)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x00000000045237a0, CuContext(0x0000000003da79a0, instance 33d2f16d56891242), CuStream(0x000000000245c890, CuContext(0x0000000003da79a0, instance 33d2f16d56891242)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x00000000045237a0, CuContext(0x0000000003da79a0, instance c9b27c7d82482458), CuStream(0x000000000245c890, CuContext(0x0000000003da79a0, instance c9b27c7d82482458)), 100)

julia> Threads.@threads for _=1:10
          Random.seed!(rngs[Threads.threadid()])
       end

signal (11): Segmentation fault
in expression starting at REPL[11]:1
unknown function (ip: 0x2b976fe76eca)
unknown function (ip: 0x2b976fd14f8e)
cuMemFree_v2 at /lib64/libcuda.so.1 (unknown line)
unknown function (ip: 0x2b979a091e7f)
unknown function (ip: 0x2b979a0cd2bf)
curandGenerateSeeds at /groups/scicompsoft/home/arthurb/.julia/artifacts/b37afdc18b754625b23a968986ffe35252f9c875/lib/libcurand.so.10 (unknown line)
unsafe_curandGenerateSeeds at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/9T5Sq/lib/curand/libcurand.jl:173 [inlined]
macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/9T5Sq/src/pool.jl:340 [inlined]
macro expansion at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/9T5Sq/lib/curand/error.jl:71 [inlined]
seed! at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/9T5Sq/lib/curand/random.jl:48
seed! at /groups/scicompsoft/home/arthurb/.julia/packages/CUDA/9T5Sq/lib/curand/random.jl:45
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2245 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2427
macro expansion at ./REPL[11]:2 [inlined]
#19#threadsfor_fun at ./threadingconstructs.jl:85
#19#threadsfor_fun at ./threadingconstructs.jl:52
unknown function (ip: 0x2b97500a668f)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2245 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2427
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1790 [inlined]
start_task at /buildworker/worker/package_linux64/build/src/task.c:881
Allocations: 11313970 (Pool: 11312008; Big: 1962); GC: 7
Segmentation fault

it might be specific to @threads because @spawn seems to work:

julia> Threads.@spawn Random.seed!(rngs[1])
Task (runnable) @0x00002ab946c6fde0

julia> wait(ans)
@bjarthur bjarthur added the bug Something isn't working label Sep 21, 2021
@maleadt
Copy link
Member

maleadt commented Sep 22, 2021

You can't deepcopy a CURAND RNG object; it contains a pointer to a CURAND library handle. As you can see in the output, that handle remains the same, so you're effectively calling seed! on the same object from different threads. Create a new handle instead:

julia> rngs = [CURAND.RNG() for _=1:Threads.nthreads()]
32-element Vector{CUDA.CURAND.RNG}:
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f29ae0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x00000000034f2bb0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000002a7eb60, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f497b0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003d6de10, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003ce49c0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f4bc00, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003cc76b0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003cea4a0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003cbffb0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f43340, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003d6ea40, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003d09560, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003597e10, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f2e2b0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003cc51c0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f46950, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000002332b60, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f4c9c0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f72400, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f55110, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f40ed0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f532f0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f405f0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000002a7e7f0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f47370, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x00000000034c1320, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f339a0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003d21d50, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f42e10, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f500c0, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)
 CUDA.CURAND.RNG(Ptr{Nothing} @0x0000000003f72070, CuContext(0x00000000019d0880, instance 65eb41c103e15e63), CuStream(0x0000000001e3fc70, CuContext(0x00000000019d0880, instance 65eb41c103e15e63)), 100)

julia> Threads.@threads for _=1:10
          Random.seed!(rngs[Threads.threadid()])
       end

julia>

@maleadt maleadt closed this as completed Sep 22, 2021
@maleadt maleadt removed the bug Something isn't working label Sep 22, 2021
@bjarthur
Copy link
Contributor Author

maybe a new deepcopy method for CURAND.rng() should be created which throws an error?

@maleadt
Copy link
Member

maleadt commented Sep 23, 2021

Hmm, I understand why but I'm not inclined to, since that applies to so many CUDA.jl objects. We also can't guarantee to be able to make an exact copy, since there's lots of hidden state (e.g. with CURAND we can't recover the seed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants