failed worker startup #30031

bjarthur · 2018-11-14T14:22:59Z

even if remote cluster worker succeeds, text is printed which says it fails:

julia> using Distributed, ClusterManagers

julia> addprocs_lsf(1)
<<Waiting for dispatch ...>>
<<Starting on h11u01>>
	From failed worker startup:	Job <50204976> is submitted to default queue <interactive>.
1-element Array{Int64,1}:
 2

julia> fetch(@spawnat 2 myid())
2

julia> nprocs()
2

i believe all that needs to be changed is removing "failed" from this line.

could be something wrong with this PR though, which was used in the MWE above.

The text was updated successfully, but these errors were encountered:

vchuravy · 2018-11-14T15:12:37Z

We should have early returned from that function without printing:

julia/stdlib/Distributed/src/cluster.jl

Lines 286 to 288 in 453a7dd

    
           if !isempty(bind_addr) 
        
               return bind_addr, port 
        
           end

But why are we calling that function anyway... How does LSF communicate which ip:port to use?

bjarthur · 2018-11-15T23:38:55Z

finally-blocks seem to be executed even if the try-block is returned from:

julia> function foo()
           try
               return 1
           finally
               @info 2
           end
       end
foo (generic function with 1 method)

julia> foo()
[ Info: 2
1
``

StefanKarpinski · 2018-11-15T23:42:27Z

finally-blocks seem to be executed even if the try-block is returned from

Yes, that's how they work.

bjarthur · 2018-11-15T23:42:35Z

and LSF communication is as expected i believe:

$ bsub -I -cwd /home/arthurb -J julia-131471 /home/arthurb/bin/julia-1.0.2/bin/julia --worker=MwiE9htAxfw6LEMW 1>1.out 2>2.out
^C^C

$ cat 1.out 
Job <50277705> is submitted to default queue <interactive>.
julia_worker:9230#10.36.111.11
fatal: error thrown and no exception handler available.
InterruptException()
jl_run_once at /buildworker/worker/package_linux64/build/src/jl_uv.c:185
process_events at ./libuv.jl:98 [inlined]
wait at ./event.jl:246
task_done_hook at ./task.jl:309
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1537 [inlined]
finish_task at /buildworker/worker/package_linux64/build/src/task.c:233
start_task at /buildworker/worker/package_linux64/build/src/task.c:276
unknown function (ip: 0xffffffffffffffff)

$ cat 2.out 
<<Waiting for dispatch ...>>
<<Starting on h11u01>>

bjarthur · 2018-11-15T23:49:38Z

so read_worker_host_port() needs to be refactored then to not print the warning if parse_connection_info() succeeds.

now i just need to figure out how to get addprocs_lsf() to not block the command line when the cluster is full and worker nodes end up pending in the queue...

bjarthur · 2018-11-16T22:05:00Z

this @async and yield() do not seem to be returning control to the REPL. does anyone have any idea why?

i base this deduction on the stack trace below, which results from breaking out of an addprocs() for which the worker gets stuck in the pending queue:

julia> using Distributed, ClusterManagers

julia> addprocs_lsf(1, `-n 49`)
<<Waiting for dispatch ...>>
^C	From failed worker startup:	Job <50362709> is submitted to default queue <interactive>.
ERROR: InterruptException:
process_events at ./libuv.jl:98 [inlined]
wait() at ./event.jl:246
wait(::Condition) at ./event.jl:46
stream_wait(::Timer, ::Condition) at ./stream.jl:47
wait at ./event.jl:375 [inlined]
sleep at ./event.jl:429 [inlined]
macro expansion at ./task.jl:264 [inlined]
read_worker_host_port(::Base.Process) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:270
connect(::LSFManager, ::Int64, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/managers.jl:397
create_worker(::LSFManager, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:505
setup_launched_worker(::LSFManager, ::WorkerConfig, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:451
(::getfield(Distributed, Symbol("##47#50")){LSFManager,WorkerConfig})() at ./task.jl:259
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:226
 [2] #addprocs_locked#44(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::LSFManager) at ./task.jl:266
 [3] addprocs_locked at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:376 [inlined]
 [4] #addprocs#43(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::LSFManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:369
 [5] #addprocs_lsf#35 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:363 [inlined]
 [6] addprocs_lsf(::Int64, ::Cmd) at /groups/scicompsoft/home/arthurb/.julia/dev/ClusterManagers/src/lsf.jl:40
 [7] top-level scope at none:0

bjarthur · 2018-11-16T22:17:23Z

it would be nice too if the process is killed should the worker timeout. otherwise it will remain in the pending queue. is the right thing to do to just add a call to kill here?

bjarthur mentioned this issue Nov 28, 2018

cluster manager fixes #30172

Merged

ViralBShah closed this as completed in #30172 Dec 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed worker startup #30031

failed worker startup #30031

bjarthur commented Nov 14, 2018

vchuravy commented Nov 14, 2018

bjarthur commented Nov 15, 2018

StefanKarpinski commented Nov 15, 2018

bjarthur commented Nov 15, 2018

bjarthur commented Nov 15, 2018

bjarthur commented Nov 16, 2018

bjarthur commented Nov 16, 2018

failed worker startup #30031

failed worker startup #30031

Comments

bjarthur commented Nov 14, 2018

vchuravy commented Nov 14, 2018

bjarthur commented Nov 15, 2018

StefanKarpinski commented Nov 15, 2018

bjarthur commented Nov 15, 2018

bjarthur commented Nov 15, 2018

bjarthur commented Nov 16, 2018

bjarthur commented Nov 16, 2018