-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
failed worker startup #30031
Comments
We should have early returned from that function without printing: julia/stdlib/Distributed/src/cluster.jl Lines 286 to 288 in 453a7dd
But why are we calling that function anyway... How does LSF communicate which ip:port to use? |
finally-blocks seem to be executed even if the try-block is returned from:
|
Yes, that's how they work. |
and LSF communication is as expected i believe:
|
so read_worker_host_port() needs to be refactored then to not print the warning if parse_connection_info() succeeds. now i just need to figure out how to get addprocs_lsf() to not block the command line when the cluster is full and worker nodes end up pending in the queue... |
this @async and yield() do not seem to be returning control to the REPL. does anyone have any idea why? i base this deduction on the stack trace below, which results from breaking out of an addprocs() for which the worker gets stuck in the pending queue:
|
it would be nice too if the process is killed should the worker timeout. otherwise it will remain in the pending queue. is the right thing to do to just add a call to |
even if remote cluster worker succeeds, text is printed which says it fails:
i believe all that needs to be changed is removing "failed" from this line.
could be something wrong with this PR though, which was used in the MWE above.
The text was updated successfully, but these errors were encountered: