Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed worker startup #30031

Closed
bjarthur opened this issue Nov 14, 2018 · 7 comments · Fixed by #30172
Closed

failed worker startup #30031

bjarthur opened this issue Nov 14, 2018 · 7 comments · Fixed by #30172

Comments

@bjarthur
Copy link
Contributor

even if remote cluster worker succeeds, text is printed which says it fails:

julia> using Distributed, ClusterManagers

julia> addprocs_lsf(1)
<<Waiting for dispatch ...>>
<<Starting on h11u01>>
	From failed worker startup:	Job <50204976> is submitted to default queue <interactive>.
1-element Array{Int64,1}:
 2

julia> fetch(@spawnat 2 myid())
2

julia> nprocs()
2

i believe all that needs to be changed is removing "failed" from this line.

could be something wrong with this PR though, which was used in the MWE above.

@vchuravy
Copy link
Member

We should have early returned from that function without printing:

if !isempty(bind_addr)
return bind_addr, port
end

But why are we calling that function anyway... How does LSF communicate which ip:port to use?

@bjarthur
Copy link
Contributor Author

finally-blocks seem to be executed even if the try-block is returned from:

julia> function foo()
           try
               return 1
           finally
               @info 2
           end
       end
foo (generic function with 1 method)

julia> foo()
[ Info: 2
1
``

@StefanKarpinski
Copy link
Member

finally-blocks seem to be executed even if the try-block is returned from

Yes, that's how they work.

@bjarthur
Copy link
Contributor Author

and LSF communication is as expected i believe:

$ bsub -I -cwd /home/arthurb -J julia-131471 /home/arthurb/bin/julia-1.0.2/bin/julia --worker=MwiE9htAxfw6LEMW 1>1.out 2>2.out
^C^C

$ cat 1.out 
Job <50277705> is submitted to default queue <interactive>.
julia_worker:9230#10.36.111.11
fatal: error thrown and no exception handler available.
InterruptException()
jl_run_once at /buildworker/worker/package_linux64/build/src/jl_uv.c:185
process_events at ./libuv.jl:98 [inlined]
wait at ./event.jl:246
task_done_hook at ./task.jl:309
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1537 [inlined]
finish_task at /buildworker/worker/package_linux64/build/src/task.c:233
start_task at /buildworker/worker/package_linux64/build/src/task.c:276
unknown function (ip: 0xffffffffffffffff)

$ cat 2.out 
<<Waiting for dispatch ...>>
<<Starting on h11u01>>

@bjarthur
Copy link
Contributor Author

so read_worker_host_port() needs to be refactored then to not print the warning if parse_connection_info() succeeds.

now i just need to figure out how to get addprocs_lsf() to not block the command line when the cluster is full and worker nodes end up pending in the queue...

@bjarthur
Copy link
Contributor Author

this @async and yield() do not seem to be returning control to the REPL. does anyone have any idea why?

i base this deduction on the stack trace below, which results from breaking out of an addprocs() for which the worker gets stuck in the pending queue:

julia> using Distributed, ClusterManagers

julia> addprocs_lsf(1, `-n 49`)
<<Waiting for dispatch ...>>
^C	From failed worker startup:	Job <50362709> is submitted to default queue <interactive>.
ERROR: InterruptException:
process_events at ./libuv.jl:98 [inlined]
wait() at ./event.jl:246
wait(::Condition) at ./event.jl:46
stream_wait(::Timer, ::Condition) at ./stream.jl:47
wait at ./event.jl:375 [inlined]
sleep at ./event.jl:429 [inlined]
macro expansion at ./task.jl:264 [inlined]
read_worker_host_port(::Base.Process) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:270
connect(::LSFManager, ::Int64, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/managers.jl:397
create_worker(::LSFManager, ::WorkerConfig) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:505
setup_launched_worker(::LSFManager, ::WorkerConfig, ::Array{Int64,1}) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:451
(::getfield(Distributed, Symbol("##47#50")){LSFManager,WorkerConfig})() at ./task.jl:259
Stacktrace:
 [1] sync_end(::Array{Any,1}) at ./task.jl:226
 [2] #addprocs_locked#44(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::LSFManager) at ./task.jl:266
 [3] addprocs_locked at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:376 [inlined]
 [4] #addprocs#43(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::LSFManager) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:369
 [5] #addprocs_lsf#35 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.0/Distributed/src/cluster.jl:363 [inlined]
 [6] addprocs_lsf(::Int64, ::Cmd) at /groups/scicompsoft/home/arthurb/.julia/dev/ClusterManagers/src/lsf.jl:40
 [7] top-level scope at none:0

@bjarthur
Copy link
Contributor Author

it would be nice too if the process is killed should the worker timeout. otherwise it will remain in the pending queue. is the right thing to do to just add a call to kill here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants