Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BoundsError() in flush_gc_msgs #6297

Closed
carlobaldassi opened this issue Mar 28, 2014 · 9 comments
Closed

BoundsError() in flush_gc_msgs #6297

carlobaldassi opened this issue Mar 28, 2014 · 9 comments
Labels
bug Indicates an unexpected problem or unintended behavior parallelism Parallel or distributed computation

Comments

@carlobaldassi
Copy link
Member

I have seen occasionally this error in long running jobs:

ERROR: BoundsError()
 in flush_gc_msgs at multi.jl:140
 in send_msg_ at multi.jl:164
 in remotecall_fetch at multi.jl:672
 in sync_end at task.jl:300

Line 140 of multi.jl is:

msgs = copy(w.del_msgs)

which seems a strange place to throw a BoundsError.

To give context, my code uses SharedArrays and the error comes from within a @sync'd for block with @spwanat, something like:

@sync for p in ps
    @spawnat p begin
        out[p] = update(p, shrd, args)
    end
end

where ps is a list of processes, out and shrd are SharedArrays (shared among all ps).

I wouln't know how to reproduce though. Reporting for the record and just in case someone can guess what's going on. But I still have a Julia session where it happened open if that can be of any use.

Version which was running (with 5 workers):

Julia Version 0.3.0-prerelease+2077
Commit 6b9fa29* (2014-03-17 20:45 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: AMD Opteron(tm) Processor 6282 SE              
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm
@amitmurthy
Copy link
Contributor

If your program is doing an adpprocs/rmprocs dynamically, the value of p in out[p] = update(p, shrd, args) could extend beyond length(out) and hence the BoundsError().

The BoundsError() could be the result of the remotecall_fetch, no explanation for the printed stack showing line 140 though.

@carlobaldassi
Copy link
Member Author

Unfortunately that's not a possible explanation, I'm not changing the number of processes dynamically.

@amitmurthy
Copy link
Contributor

If you still have the session open could you println(ps), println(length(out)) and println(out) ?

@carlobaldassi
Copy link
Member Author

The session is open but I don't have access to those variables since they were local to a function (I removed that from the backtrace).

@amitmurthy
Copy link
Contributor

OK. Anyways, just to be safe, your code should probably be changed to

@sync for (i,p) in enumerate(ps)
    @spawnat p begin
        out[i] = update(p, shrd, args)
    end
end

No?

@carlobaldassi
Copy link
Member Author

Yes, that is basically what it actually does. I simplified too much wrt the actual code, sorry. I also didn't specify that that portion of the code runs thousands of times without issues before crashing.

@carlobaldassi
Copy link
Member Author

So this bug is really killing me now, it happens randomly but given enough time it will show up reliably, and I'm running some long simulations, which means they never reach the end, crashing instead.

I have some more data though (12 Mb of data, to be precise), but I'm not sure it's useful. If given directions, I could produce something more detailed (give a few days of the simulation running). For the time being, I put a try...catch block around the call to flush_gc_msgs inside send_msg_ and made it call dump and xdump on all variables:

try
    flush_gc_msgs(w)
catch
    open("dump.txt", "w") do f
        println(f, "worker:")
        dump(f, w)
        println(f, "worker (x):")
        xdump(f, w)
        println(f, "kind:")
        dump(f, kind)
        println(f, "kind (x):")
        xdump(f, kind)
        println(f, "args:")
        dump(f, args)
        println(f, "args (x):")
        xdump(f, args)
    end
    rethrow()
end

The result is collected in this compressed file.

More information: the job was running with 1 master and 25 worker processes (all local). I've seen it happening on 2 different machines. The stack trace was similar to the one reported before, however I've seen different traces in other cases, all of them ending with the same 2 calls, send_msg_ and flush_gc_msgs:

ERROR: BoundsError()
 in flush_gc_msgs at multi.jl:140
 in send_msg_ at multi.jl:165
 in remotecall_fetch at multi.jl:690
 in sync_end at task.jl:304
 in ... [etc. (my script)]

versioninfo:

julia> versioninfo()
Julia Version 0.3.0-prerelease+2579
Commit 036c6cc* (2014-04-10 07:17 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: AMD Opteron(tm) Processor 6282 SE              
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

I'm now restarting producing a different output file for each worker, and removing sys.so.

@ihnorton
Copy link
Member

Still an issue with the new GC?

@jakebolewski
Copy link
Member

Please reopen if this is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior parallelism Parallel or distributed computation
Projects
None yet
Development

No branches or pull requests

5 participants