Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurrent intermittent travis test failure #4016

Closed
kmsquire opened this issue Aug 11, 2013 · 7 comments
Closed

Recurrent intermittent travis test failure #4016

kmsquire opened this issue Aug 11, 2013 · 7 comments
Labels
bug Indicates an unexpected problem or unintended behavior

Comments

@kmsquire
Copy link
Member

Happens both on clang (here and here) and gcc (here)

$ cd /tmp/julia/share/julia/test && /tmp/julia/bin/julia runtests.jl all
    From worker 4:       * numbers
    From worker 5:       * strings
    From worker 3:       * keywordargs
    From worker 6:       * unicode
    From worker 2:       * core
    From worker 7:       * collections
    From worker 9:       * remote
    From worker 8:       * hashing
    From worker 9:       * iostring
    From worker 3:       * arrayops
    From worker 9:       * linalg
    From worker 8:       * blas
    From worker 6:       * fft
    From worker 2:       * dsp
    From worker 7:       * sparse
    From worker 5:       * bitarray
    From worker 8:       * random
Worker 2 terminated.
ERROR: read: end of file
 in read at iobuffer.jl:68
 in read at stream.jl:609
 in anonymous at task.jl:797

ERROR: ProcessExitedException()
 in yield at multi.jl:1490
 in wait at task.jl:105
 in wait_full at multi.jl:545
 in remotecall_fetch at multi.jl:645
 in remotecall_fetch at multi.jl:650
 in anonymous at multi.jl:1332
at /tmp/julia/share/julia/test/runtests.jl:20
The command "/tmp/julia/bin/julia runtests.jl all" exited with 1.

It's also unclear where task.jl:797 is, since task.jl only has 164 lines, but possibly from stream.jl:797.
Other backtrace locations are iobuffer.jl:68 and stream.jl:609

I was looking to see if there might be a race condition in IOBuffer, e.g., where isopen() becomes false in wait_nb before data is written, or the readnotify condition is notified before the buffer is filled, etc., but didn't see anything obvious.

@staticfloat
Copy link
Member

I can get this on my OSX box as well, if I just run make testall in a loop, I eventually hit this.

If there's anything I can do to help debug this, let me know.

@kmsquire
Copy link
Member Author

So if I had paid more attention, the problem occurs during the DSP tests, where the worker terminates. The following is sufficient to cause a segfault on two linux systems that I tried:

julia> ;cd test
/home/kmsquire/Source/julia/test

julia> using Base.Test

julia> while true
           include("dsp.jl")
       end
Segmentation fault (core dumped)

@Keno
Copy link
Member

Keno commented Aug 13, 2013

It might be worth valgrinding this one with MEMDEBUG enabled. I did earlier today (unreleatedly) and I saw a fair number of invalid reads/writes though I can't rule out that those weren't caused by my changes.

@kmsquire
Copy link
Member Author

@loladiro, will do. Right now, in the debugger, I can see that there's memory corruption.

julia> while true
           include("dsp.jl")
       end

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff72529a4 in pool_alloc (p=0x7ffff7fcdb68) at gc.c:489
489     p->freelist = p->freelist->next;
Missing separate debuginfos, use: debuginfo-install ncurses-libs-5.7-3.20090208.el6.x86_64
(gdb) backtrace
#0  0x00007ffff72529a4 in pool_alloc (p=0x7ffff7fcdb68) at gc.c:489
#1  0x00007ffff72540ef in allocobj (sz=368) at gc.c:981
#2  0x00007ffff7242008 in _new_array (atype=0x67fa60, ndims=1, dims=0x7fffffffb7f0) at array.c:80
#3  0x00007ffff7242c95 in jl_alloc_array_1d (atype=0x67fa60, nr=40) at array.c:297
#4  0x00007ffff0e382a9 in ?? ()
#5  0x00007fffffffb930 in ?? ()
#6  0x01007ffff7242008 in ?? ()
#7  0x0000000003c48350 in ?? ()
#8  0x0000004e00000000 in ?? ()
#9  0x000000000000000b in ?? ()
#10 0x0000000000000008 in ?? ()
#11 0x000000000000000b in ?? ()
#12 0x00007fffffffb960 in ?? ()
#13 0x0000000200000100 in ?? ()
#14 0x0000000000ad90c0 in ?? ()
#15 0x0000000000000580 in ?? ()
#16 0x0000000000000000 in ?? ()
(gdb) print p     
$1 = (pool_t *) 0x7ffff7fcdb68
(gdb) print *p
$2 = {osize = 384, pages = 0x3e5c380, freelist = 0x4009000000000000}
(gdb) print *(p.freelist)
Cannot access memory at address 0x4009000000000000
(gdb) 

@amitmurthy
Copy link
Contributor

@Keno
Copy link
Member

Keno commented Aug 13, 2013

Yup, that's the one I saw earlier today as well. I'm not quite sure but I think it might be related to the size of the work array in gesdd that was changed recently.

@kmsquire
Copy link
Member Author

FWIW, I reported this upstream back when this problem originally appeared, and the documentation of ZGESDD was recently fixed (although the fix won't appear in an LAPACK release until sometime this summer).

kmsquire referenced this issue Mar 19, 2015
the size of the RWORK array in zgesdd was wrong.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants