Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heisenbug in "test guarded Random.seed!" #42752

Closed
tkf opened this issue Oct 21, 2021 · 17 comments · Fixed by #45212
Closed

Heisenbug in "test guarded Random.seed!" #42752

tkf opened this issue Oct 21, 2021 · 17 comments · Fixed by #45212
Assignees
Labels
bug Indicates an unexpected problem or unintended behavior ci Continuous integration heisenbug This bug occurs unpredictably randomness Random number generation and the Random stdlib rr trace included

Comments

@tkf
Copy link
Member

tkf commented Oct 21, 2021

Just as a record, I saw:

Error in testset Test:
Test Failed at /buildworker/worker/tester_linuxaarch64/build/share/julia/stdlib/v1.8/Test/test/runtests.jl:777
  Expression: orig == Random.default_rng()
   Evaluated: Xoshiro(0x947b9af2ba269897, 0x95ae8d9a31382b32, 0x8a532dc2baa61f2e, 0x7213512241a4dfbd) == TaskLocalRNG()
Error in testset Test:
Test Failed at /buildworker/worker/tester_linuxaarch64/build/share/julia/stdlib/v1.8/Test/test/runtests.jl:778
  Expression: rand(orig) == rand()
   Evaluated: 0.6209946085911393 == 0.7846500004305739

https://build.julialang.org/#/builders/7/builds/4863/steps/5/logs/stdio

in #42725. It vanished after re-running the test by merging master even though there doesn't seem a change that could fix the result a4903fd...76c2431.

@tkf tkf added the ci Continuous integration label Oct 21, 2021
@vchuravy vchuravy added randomness Random number generation and the Random stdlib heisenbug This bug occurs unpredictably labels Oct 21, 2021
@vtjnash
Copy link
Member

vtjnash commented Oct 21, 2021

I thought I was seeing this quite a bit on builds at random this week

@tkf
Copy link
Member Author

tkf commented Oct 21, 2021

Is it architecture-specific?

@Seelengrab
Copy link
Contributor

Seelengrab commented Oct 22, 2021

When I run the test in question with the global seed given from the build log, I get the same result as the orig Xoshiro on the left:

julia> function testfun()
           Random.seed!(0x2f5d1b5dc893fdf5ee2a1080933cd16d)
           @testset "test guarded Random.seed!" begin
           seed = rand(UInt)
           orig = copy(Random.default_rng())
           @test guardseed(()->rand(), seed) == guardseed(()->rand(),
           seed)
           @test guardseed(()->rand(Int), seed) ==
           guardseed(()->rand(Int), seed)
           r1, r2 = MersenneTwister(0), MersenneTwister(0)
           a, b = guardseed(r1) do
               Random.seed!(r1, 0)
               rand(r1), rand(r1, Int)
           end::Tuple{Float64,Int}
           c, d = guardseed(r2) do
               Random.seed!(r2, 0)
               rand(r2), rand(r2, Int)
           end::Tuple{Float64,Int}
           @test a == c == rand(r1) == rand(r2)
           @test b == d == rand(r1, Int) == rand(r2, Int)
           @show orig
           @show Random.default_rng()
           @test orig == Random.default_rng()
           @test rand(orig) == rand()
       end
       end
testfun (generic function with 1 method)

julia> testfun()
orig = Xoshiro(0x947b9af2ba269897, 0x95ae8d9a31382b32, 0x8a532dc2baa61f2e, 0x7213512241a4dfbd)
Random.default_rng() = TaskLocalRNG()
Test Summary:             | Pass  Total
test guarded Random.seed! |    6      6
Test.DefaultTestSet("test guarded Random.seed!", Any[], 6, false, false)

Sadly we don't have the other internal state in the log :/ I checked all combinations of 2 flipped bits, so 65536 RNGs, but none showed a match. We do have the result of the next rand call though, so we could try to reconstruct the internal state of the task local RNG that way. Unfortunately, Symbolics.jl doesn't support << on Num and I'm not well versed in Z3 🤷

We'd basically have to solve this:

s0::UInt, s3::UInt
tmp = s0 + s3
res = (((tmp << 23) | (tmp >> 41)) + s0) >> 11
res == 0x00191bda5154fb80

which I got by seeing that the random values are generated via x = rand(rng, UInt); Float64(x >> 11) * 0x1.0p-53. Since we have a floating point result, we can easily invert everything up to x = by doing UInt(0.7846500004305739 / 0x1.0p-53) (I checked that this works by making sure that it works for the other floating point value generated by orig), giving us the above hex number 0x00191bda5154fb80. Still, I can tell you that in s0 and s3 only the following bits matter, from MSB to LSB:

  • 13 - matter
  • 10 - don't matter
  • 41 - matter

This is due to the >> 11 at the end and propagating that requirement through everything, as long as bit 51 is set so that it "works" in s0 and s3.

@DilumAluthge DilumAluthge added the bug Indicates an unexpected problem or unintended behavior label Oct 22, 2021
@nalimilan
Copy link
Member

nalimilan commented Oct 23, 2021

FWIW I also saw it with 1.7.0-rc2 on x86_64, but with different values:

Test Failed at /builddir/build/BUILD/julia-1.7.0-rc2/build/usr/share/julia/stdlib/v1.7/Test/test/runtests.jl:741
  Expression: orig == Random.default_rng()
   Evaluated: Xoshiro(0x98eb785bda6da52d, 0xa71f2324bc045d76, 0x9739d712cfb7c053, 0xa001e265c6af9015) == TaskLocalRNG()
Error in testset Test:
Test Failed at /builddir/build/BUILD/julia-1.7.0-rc2/build/usr/share/julia/stdlib/v1.7/Test/test/runtests.jl:742
  Expression: rand(orig) == rand()
   Evaluated: 0.9755253173603154 == 0.45116343295850014

https://kojipkgs.fedoraproject.org//work/tasks/9369/77709369/build.log

(This was the first time I built 1.7.0-rc2 so it may not be so rare.)

@KristofferC
Copy link
Member

Maybe something to do with #42201? Although that was a while ago.

@Seelengrab
Copy link
Contributor

If we modify the test here to show the true state of Random.default_rng(), it will be much easier to debug because we can more easily distinguish bitflips from whole bytes missing etc. I doubt that it's a problem with our Xoshiro, since then I wouldn't be able to make the tests run through when running them with the same seed locally.

I tried to debug this by letting Z3 spit out a bunch of number pairs that would create those values, but there wasn't really a pattern in them. Please excuse the bad quality - I hadn't yet setup my screenshotting tool.

Picture of histogram

image

This screenshot shows the number of times a bit flip had to have occured in the generated numbers in each nibble (4-bit group) of each UInt. Z3 spit out 15 values with exactly 9 bitflips (non-consecutive) and didn't find any possible values with fewer. The largest number of bitflips possible to still find solutions is 18 (but again no pattern in those).

@tkf tkf changed the title Presumably rare failure from "test guarded Random.seed!" in linuxaarch64 Heisenbug in "test guarded Random.seed!" Oct 24, 2021
@Seelengrab
Copy link
Contributor

Seelengrab commented Oct 31, 2021

Is this expected/related behavior?

[sukera@tempman ~]$ julia -q -t1
julia> t = current_task()
Task (runnable) @0x00007f6723fdc010

julia> state = (t.rngState0, t.rngState1, t.rngState2, t.rngState3)
(0xf6e7025b1a108853, 0x1949f9d36755c8cf, 0xdc13b6c30461a2b5, 0xb199766836484420)

julia> using Random

julia> copy(Random.default_rng()) # yep, same state as the rng in our task
Xoshiro(0xf6e7025b1a108853, 0x1949f9d36755c8cf, 0xdc13b6c30461a2b5, 0xb199766836484420)

julia> Random.default_rng()
TaskLocalRNG()

julia> function Base.show(io::IO, ::MIME"text/plain", x::Random.TaskLocalRNG)
           t = current_task()
           write(io, "TaskLocalRNG(")
           show(io, MIME"text/plain"(), t.rngState0)
           write(io, ',')
           show(io, MIME"text/plain"(), t.rngState1)
           write(io, ',')
           show(io, MIME"text/plain"(), t.rngState2)
           write(io, ',')
           show(io, MIME"text/plain"(), t.rngState3)
           print(io, ')')
       end

julia> state1 = (t.rngState0, t.rngState1, t.rngState2, t.rngState3) # ok, our state didn't change..
(0xf6e7025b1a108853, 0x1949f9d36755c8cf, 0xdc13b6c30461a2b5, 0xb199766836484420)

julia> t = current_task()  # still the same task
Task (runnable) @0x00007f6723fdc010

julia> state1 = (t.rngState0, t.rngState1, t.rngState2, t.rngState3)  # still the same state
(0xf6e7025b1a108853, 0x1949f9d36755c8cf, 0xdc13b6c30461a2b5, 0xb199766836484420)

julia> copy(Random.default_rng()) # should just copy rng of current task, so far so good
Xoshiro(0xf6e7025b1a108853, 0x1949f9d36755c8cf, 0xdc13b6c30461a2b5, 0xb199766836484420)

julia> Random.default_rng()   # ?! shouldn't this be the same state we just copied from?
TaskLocalRNG(0x6c4866be67e41b35,0x190098b53e17879a,0xe342dc7b9c59fdeb,0xb156895802bc4cb0)

julia> using Base.Threads

julia> Threads.nthreads()
1

julia> versioninfo()
Julia Version 1.8.0-DEV.798
Commit 3ae25059fb (2021-10-22 06:13 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

I have no idea why the copy differs from the original - as far as I can tell, even the current task is the same, so no new RNG was split off! Am I doing something wrong with printing here? It's basically the same as for regular Xoshiro, and printing itself doesn't seem to change RNG either:

julia> Random.default_rng()
TaskLocalRNG(0x6c4866be67e41b35,0x190098b53e17879a,0xe342dc7b9c59fdeb,0xb156895802bc4cb0)

julia> Random.default_rng()
TaskLocalRNG(0x6c4866be67e41b35,0x190098b53e17879a,0xe342dc7b9c59fdeb,0xb156895802bc4cb0)

julia> Random.default_rng()
TaskLocalRNG(0x6c4866be67e41b35,0x190098b53e17879a,0xe342dc7b9c59fdeb,0xb156895802bc4cb0)

@Seelengrab
Copy link
Contributor

Seelengrab commented Oct 31, 2021

Ok, when I just dump(t) instead of printing the state manually, I get a very different picture. Seems like there's a task for printing/background stuff that's different from the REPL, which has a different RNG state. The REPL with our task/rng state is waiting diligently though - maybe something similar is going on here? Is there a somewhat easy way/tool to inspect task state of a running julia session, without messing with the julia session?

@vtjnash
Copy link
Member

vtjnash commented Nov 8, 2021

These numbers don't appear to be random, but is advanced exactly 8 numbers from where it was supposed to be generating:
0.6209946085911393 == 0.7846500004305739

julia> rng = Xoshiro(0x947b9af2ba269897, 0x95ae8d9a31382b32, 0x8a532dc2baa61f2e, 0x7213512241a4dfbd)
Xoshiro(0x947b9af2ba269897, 0x95ae8d9a31382b32, 0x8a532dc2baa61f2e, 0x7213512241a4dfbd)

julia> rand(rng)
0.6209946085911393

julia> rand(rng)
0.08217519705091403

julia> rand(rng)
0.7653532529104843

julia> rand(rng)
0.5444902979941714

julia> rand(rng)
0.696656487590396

julia> rand(rng)
0.40339486705752836

julia> rand(rng)
0.3460152995484739

julia> rand(rng)
0.3412018309199384

julia> rand(rng)
0.7846500004305739

@vtjnash
Copy link
Member

vtjnash commented Nov 8, 2021

which means the 2 states were probably:

orig = Xoshiro(0x7395c1c4acfb627a, 0x4c26b3294081a4a8, 0x4b551ddeca4f6ea5, 0x4da71ae5e526297b)
rng  = Xoshiro(0x947b9af2ba269897, 0x95ae8d9a31382b32, 0x8a532dc2baa61f2e, 0x7213512241a4dfbd)

@Seelengrab
Copy link
Contributor

Very interesting! That's also consistent with the other observation posted above

states
julia> orig = Xoshiro(0x98eb785bda6da52d, 0xa71f2324bc045d76, 0x9739d712cfb7c053, 0xa001e265c6af9015)
Xoshiro(0x98eb785bda6da52d, 0xa71f2324bc045d76, 0x9739d712cfb7c053, 0xa001e265c6af9015)

julia> rand(orig)
0.9755253173603154

julia> rand(orig)
0.6218595743287779

julia> rand(orig)
0.18401662900007387

julia> rand(orig)
0.4942449913515099

julia> rand(orig)
0.19740142179010944

julia> rand(orig)
0.5657614748780961

julia> rand(orig)
0.5698126213082862

julia> rand(orig)
0.6621331406776867

julia> orig
Xoshiro(0x23cdaf44f4fd1c54, 0x84dec7f31d026a9c, 0x6fdbeb04a2abd21a, 0x4cd9c15a6e89152c)

julia> rand(orig)
0.45116343295850014

so I'd be inclined to say that finding what advances RNGs by exactly 8 numbers makes it likely to find the cause.

@Seelengrab
Copy link
Contributor

The testfun for testing guarded seeds from above calls rand() with the default_rng() 4 times - is it possible that, for some reason, the task handling that testset is running twice?

@DilumAluthge
Copy link
Member

I just saw what appears to be the same bug, on tester_linux64. Full log: https://build.julialang.org/#/builders/69/builds/6230

Since this is on linux64, we have an rr trace.

@Seelengrab
Copy link
Contributor

Yep, exhibits the same behavior of being 8 rand calls behind!

@vtjnash
Copy link
Member

vtjnash commented Dec 10, 2021

Still broken, but somewhat less so?? (or more so??) https://build.julialang.org/#/builders/2/builds/76

We are still generating the wrong random number, and it is still off by 8, but now the contents of the copy and the Task are exactly the same.

@DilumAluthge DilumAluthge added the rr trace wanted An rr trace would help with debugging this issue - you can help out by creating one label Apr 12, 2022
@DilumAluthge
Copy link
Member

DilumAluthge commented Apr 18, 2022

I got an rr trace: https://buildkite.com/julialang/julia-master/builds/11189#1be3920d-621e-439a-8d43-7be5ba357a1b

Please note that the rr trace will be deleted from Buildkite in six months.

The global RNG seed was 0x5c2b5120001bf6fd2dfbf836dc500c5f.

Error in testset Test:
Test Failed at /cache/build/default-amdci5-1/julialang/julia-master/julia-1600cb97d2/share/julia/stdlib/v1.9/Test/test/runtests.jl:795
  Expression: orig == Random.default_rng()
   Evaluated: Xoshiro(0xf87f64b7bf553e41, 0x6bb192733ba008f5, 0x88582df1eaa56f74, 0xc4d459a683f010e3) == TaskLocalRNG()

Error in testset Test:
Test Failed at /cache/build/default-amdci5-1/julialang/julia-master/julia-1600cb97d2/share/julia/stdlib/v1.9/Test/test/runtests.jl:796
  Expression: rand(orig) == rand()
   Evaluated: 0.1548008544647953 == 0.9281513333593826

@DilumAluthge DilumAluthge added rr trace included and removed rr trace wanted An rr trace would help with debugging this issue - you can help out by creating one labels Apr 18, 2022
@vtjnash
Copy link
Member

vtjnash commented Apr 18, 2022

Here is the cause of the fault:

(rr) bt
#0  0x00007f64c6633201 in jl_genrandom (rngState=0x7f64661a1468) at /cache/build/default-amdci4-4/julialang/julia-master/src/task.c:726
#1  rng_split (to=0x7f64632c9880, from=0x7f64661a1430) at /cache/build/default-amdci4-4/julialang/julia-master/src/task.c:753
#2  ijl_new_task (start=0x7f646283de60, completion_future=<optimized out>, ssize=<optimized out>) at /cache/build/default-amdci4-4/julialang/julia-master/src/task.c:794
#3  0x00007f64af996040 in _Task () at boot.jl:444                        
#4  Task () at task.jl:5                                                                                                                                   
#5  Task () at task.jl:5                                            
#6  errormonitor () at task.jl:539
#7  julia_process_worker_67608 () at /cache/build/default-amdci4-4/julialang/julia-master/usr/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:328
#8  0x00007f64af99c00e in send_del_client_no_lock () at /cache/build/default-amdci4-4/julialang/julia-master/usr/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:306
#9  julia_finalize_ref_67633 () at /cache/build/default-amdci4-4/julialang/julia-master/usr/share/julia/stdlib/v1.9/Distributed/src/remotecall.jl:103
#10 0x00007f64af99c06c in jfptr_finalize_ref_67634.clone_1 () at abstractarray.jl:668
#11 0x00007f64c660b977 in _jl_invoke (world=<optimized out>, mfunc=<optimized out>, nargs=1, args=0x7f64a31503f8, F=0x7f64b2d320d0 <jl_system_image_data+36638224>)
    at /cache/build/default-amdci4-4/julialang/julia-master/src/gf.c:2373
#12 ijl_apply_generic (F=<optimized out>, args=args@entry=0x7f64a31503f8, nargs=nargs@entry=1) at /cache/build/default-amdci4-4/julialang/julia-master/src/gf.c:2574
#13 0x00007f64c6669d3d in jl_apply (nargs=2, args=0x7f64a31503f0) at /cache/build/default-amdci4-4/julialang/julia-master/src/julia.h:1838
#14 run_finalizer (o=<optimized out>, ff=<optimized out>, ct=<optimized out>) at /cache/build/default-amdci4-4/julialang/julia-master/src/gc.c:280
#15 0x00007f64c666acd5 in jl_gc_run_finalizers_in_list (ct=ct@entry=0x7f64661a1430, list=list@entry=0x7f64a3150570) at /cache/build/default-amdci4-4/julialang/julia-master/src/gc.c:370
#16 0x00007f64c666af69 in run_finalizers (ct=0x7f64661a1430) at /cache/build/default-amdci4-4/julialang/julia-master/src/gc.c:400
#17 run_finalizers (ct=0x7f64661a1430) at /cache/build/default-amdci4-4/julialang/julia-master/src/gc.c:378
#18 0x00007f64c666c1e9 in jl_gc_run_pending_finalizers (ct=<optimized out>, ct@entry=0x7f64661a1430) at /cache/build/default-amdci4-4/julialang/julia-master/src/gc.c:411
#19 0x00007f64c5cce88c in jl_mutex_unlock (lock=<optimized out>) at /cache/build/default-amdci4-4/julialang/julia-master/src/julia_locks.h:131
#20 jl_generate_fptr_impl (mi=<optimized out>, world=33829) at /cache/build/default-amdci4-4/julialang/julia-master/src/jitlayers.cpp:361
#21 0x00007f64c660aad8 in jl_compile_method_internal (world=33829, mi=0x7f649bd55410) at /cache/build/default-amdci4-4/julialang/julia-master/src/gf.c:2106
#22 jl_compile_method_internal (mi=0x7f649bd55410, world=33829) at /cache/build/default-amdci4-4/julialang/julia-master/src/gf.c:2047
#23 0x00007f64c660bb38 in _jl_invoke (world=33829, mfunc=<optimized out>, nargs=2, args=0x7f64a31508a8, F=0x7f64b5232a70 <jl_system_image_data+75438000>)
    at /cache/build/default-amdci4-4/julialang/julia-master/src/gf.c:2384
#24 ijl_apply_generic (F=<optimized out>, args=args@entry=0x7f64a31508a8, nargs=nargs@entry=2) at /cache/build/default-amdci4-4/julialang/julia-master/src/gf.c:2574
#25 0x00007f64c662c7bd in jl_apply (nargs=3, args=0x7f64a31508a0) at /cache/build/default-amdci4-4/julialang/julia-master/src/julia.h:1838
#26 do_call (args=args@entry=0x7f649bd39078, nargs=nargs@entry=3, s=s@entry=0x7f64a3151090) at /cache/build/default-amdci4-4/julialang/julia-master/src/interpreter.c:126
#27 0x00007f64c662bf77 in eval_value (e=e@entry=0x7f64a26eca70, s=s@entry=0x7f64a3151090) at /cache/build/default-amdci4-4/julialang/julia-master/src/interpreter.c:215
#28 0x00007f64c662d0ac in eval_stmt_value (s=0x7f64a3151090, stmt=0x7f64a26eca70) at /cache/build/default-amdci4-4/julialang/julia-master/src/interpreter.c:594
#29 eval_body (stmts=stmts@entry=0x3ebbdc00, s=s@entry=0x7f64a3151090, ip=<optimized out>, ip@entry=79, toplevel=toplevel@entry=1)
    at /cache/build/default-amdci4-4/julialang/julia-master/src/interpreter.c:594
#30 0x00007f64c662d750 in eval_body (stmts=stmts@entry=0x3ebbdc00, s=s@entry=0x7f64a3151090, ip=<optimized out>, ip@entry=40, toplevel=toplevel@entry=1)
    at /cache/build/default-amdci4-4/julialang/julia-master/src/interpreter.c:522
#31 0x00007f64c662d750 in eval_body (stmts=stmts@entry=0x3ebbdc00, s=s@entry=0x7f64a3151090, ip=<optimized out>, ip@entry=38, toplevel=toplevel@entry=1)
    at /cache/build/default-amdci4-4/julialang/julia-master/src/interpreter.c:522
#32 0x00007f64c662d750 in eval_body (stmts=<optimized out>, s=s@entry=0x7f64a3151090, ip=<optimized out>, ip@entry=0, toplevel=toplevel@entry=1)
    at /cache/build/default-amdci4-4/julialang/julia-master/src/interpreter.c:522
#33 0x00007f64c662e386 in jl_interpret_toplevel_thunk (m=m@entry=0x7f6463dff0d0, src=0x7f6468b50d90) at /cache/build/default-amdci4-4/julialang/julia-master/src/interpreter.c:750
#34 0x00007f64c66507df in jl_toplevel_eval_flex (m=m@entry=0x7f6463dff0d0, e=<optimized out>, fast=fast@entry=1, expanded=expanded@entry=0)
    at /cache/build/default-amdci4-4/julialang/julia-master/src/toplevel.c:912
#35 0x00007f64c665163b in jl_toplevel_eval_flex (m=m@entry=0x7f6463dff0d0, e=e@entry=0x7f646886cd50, fast=fast@entry=1, expanded=expanded@entry=0)
    at /cache/build/default-amdci4-4/julialang/julia-master/src/julia.h:996
#36 0x00007f64c6652b0c in ijl_toplevel_eval (m=m@entry=0x7f6463dff0d0, v=v@entry=0x7f646886cd50) at /cache/build/default-amdci4-4/julialang/julia-master/src/toplevel.c:921
#37 0x00007f64c6652c2b in ijl_toplevel_eval_in (m=0x7f6463dff0d0, ex=0x7f646886cd50) at /cache/build/default-amdci4-4/julialang/julia-master/src/toplevel.c:971
#38 0x00007f64b03eaa42 in eval () at boot.jl:370
#39 japi1_include_string_30440 () at loading.jl:1297

@JeffBezanson JeffBezanson self-assigned this May 6, 2022
JeffBezanson added a commit that referenced this issue May 6, 2022
JeffBezanson added a commit that referenced this issue Jul 12, 2022
KristofferC pushed a commit that referenced this issue Jul 17, 2022
KristofferC pushed a commit that referenced this issue Jul 17, 2022
ffucci pushed a commit to ffucci/julia that referenced this issue Aug 11, 2022
pcjentsch pushed a commit to pcjentsch/julia that referenced this issue Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior ci Continuous integration heisenbug This bug occurs unpredictably randomness Random number generation and the Random stdlib rr trace included
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants