Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tar_make_clustermq not shutting down cleanly in recent versions of targets #265

Closed
6 tasks
mattwarkentin opened this issue Jan 11, 2021 · 5 comments
Closed
6 tasks
Assignees

Comments

@mattwarkentin
Copy link
Contributor

Prework

  • Read and agree to the code of conduct and contributing guidelines.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • Post a minimal reproducible example like this one so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • Readable: format your code according to the tidyverse style guide.

Description

Hi @wlandau,

Has anything changed with targets recently that would change the way a pipeline finishes? With the last couple installations I get unclean shutdowns for some of my clustermq workers, but the pipeline seems to actually finish successfully. Just seems to be related to how the main process is handled when finishing, perhaps?

targets::tar_make_clustermq(
  reporter = "timestamp",
  workers = 4L
)
Warning in self$crew$finalize() :
  Unclean shutdown for PIDs: 27540, 27547, 27558
Error in private$zmq$poll(sid, timeout) : 1 peer(s) lost
Error: callr subprocess failed: 1 peer(s) lost
Type .Last.error.trace to see where the error occured
Stack trace:

 Process 25513:
 1. targets::tar_make_clustermq(reporter = "timestamp", workers = 4L)
 2. targets:::callr_outer(targets_function = tar_make_clustermq_ ...
 3. targets:::trn(is.null(callr_function), callr_inner(target_sc ...
 4. base:::do.call(callr_function, prepare_callr_arguments(callr ...
 5. (function (func, args = list(), libpath = .libPaths(), repos ...
 6. callr:::get_result(output = out, options)
 7. throw(newerr, parent = remerr[[2]])

 x callr subprocess failed: 1 peer(s) lost 

 Process 27528:
 19. (function (targets_script, targets_function, targets_argumen ...
 20. base:::do.call(targets_function, targets_arguments)
 21. (function (pipeline, names_quosure, reporter, workers, log_w ...
 22. clustermq_init(pipeline = pipeline, names = names, queue = " ...
 23. self$run_clustermq()
 24. self$crew$cleanup()
 25. super$cleanup(quiet = quiet, timeout = timeout)
 26. self$receive_data(timeout = timeout, with_checks = FALSE)
 27. self$receive_data(timeout, with_checks = with_checks)
 28. private$zmq$poll(timeout = msec)
 29. private$zmq$poll(sid, timeout)
 30. base:::.handleSimpleError(function (e)  ...
 31. h(simpleError(msg, call))

 x 1 peer(s) lost 
@mattwarkentin
Copy link
Contributor Author

mattwarkentin commented Jan 11, 2021

Of note, I noticed you've added this message to the end of the build...

● 2021-01-11 18:10:37.22 -0500 GMT end pipeline

I get that message for tar_make(), but it does not show up when calling tar_make_clustermq(). So something isn't wrapping up properly with clustermq, it seems...

@wlandau
Copy link
Member

wlandau commented Jan 12, 2021

Are you using development clustermq by any chance? Using CRAN clustermq (0.8.95.1) HPC on SGE seems to work:

remotes::install_github("wlandau/targets")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'targets' from a github remote, the SHA1 (0aeabe85) has not changed since last install.
#>  Use `force = TRUE` to force installation

packageDescription("targets")$GithubSHA1
#> [1] "0aeabe85431e5e10416f42253319cb5226558643"

library(targets)

writeLines("#$ -N {{ job_name }}
#$ -t 1-{{ n_jobs }}
#$ -j y
#$ -cwd
#$ -V
module load R/4.0.3
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker(\"{{ master }}\")'
", "cmq.tmpl")

tar_script({
  options(
    clustermq.scheduler = "sge",
    clustermq.template = "cmq.tmpl"
  )
  sleep <- function(x) {
    Sys.sleep(1)
    x
  }
  list(
    tar_target(x, seq_len(12L)),
    tar_target(y, sleep(x), pattern = map(x)),
    tar_target(z, sleep(y), pattern = map(y))
  )
})

tar_make_clustermq(workers = 4L)
#> ● run target x
#> ● run branch y_29239c8a
#> ● run branch y_7cc32924
#> ● run branch y_bd602d50
#> ● run branch y_05f206d7
#> ● run branch y_d8bc1b56
#> ● run branch y_356ae18f
#> ● run branch y_47b4a323
#> ● run branch y_9964667b
#> ● run branch y_a4d6927f
#> ● run branch y_00e82c73
#> ● run branch y_7ffb6cff
#> ● run branch y_e45d295b
#> ● run branch z_d0697b30
#> ● run branch z_959da4d3
#> ● run branch z_92be5b5f
#> ● run branch z_f698fbf4
#> ● run branch z_80b956fa
#> ● run branch z_2a3aa09e
#> ● run branch z_ab885e3c
#> ● run branch z_ae3fc0b8
#> ● run branch z_ce6eccbc
#> ● run branch z_c3b501f1
#> ● run branch z_f0291ce7
#> ● run branch z_d742c285
#> Master: [13.0s 2.6% CPU]; Worker: [avg 8.2% CPU, max 280.6 Mb]
#> ● end pipeline

tar_read(z)
#> [1]  1  2  3  4  5  6  7  8  9 10 11 12

But not after I install mschubert/clustermq@b61d8da.

targets::tar_make_clustermq(workers = 4L)
#> ● run target x
#>  ● run branch y_29239c8a
#>  ● run branch y_7cc32924
#>  ● run branch y_bd602d50
#>  ● run branch y_05f206d7
#>  ● run branch y_d8bc1b56
#>  ● run branch y_356ae18f
#>  ● run branch y_47b4a323
#>  ● run branch y_9964667b
#>  ● run branch y_a4d6927f
#>  ● run branch y_00e82c73
#>  ● run branch y_7ffb6cff
#>  ● run branch y_e45d295b
#>  ● run branch z_d0697b30
#>  ● run branch z_959da4d3
#>  ● run branch z_92be5b5f
#>  ● run branch z_f698fbf4
#>  ● run branch z_80b956fa
#>  ● run branch z_2a3aa09e
#>  ● run branch z_ab885e3c
#>  ● run branch z_ae3fc0b8
#>  ● run branch z_ce6eccbc
#>  ● run branch z_c3b501f1
#>  ● run branch z_f0291ce7run branch z_d742c285
#>  Error in private$zmq$poll(sid, timeout) : 1 peer(s) lost
#>  <CENSORED> has registered the job-array task 77764053.1 for deletion
#>  <CENSORED> has registered the job-array task 77764053.4 for deletion
#>  job 77764053.1 is already in deletion
#>  job 77764053.2 is already in deletion
#>  job 77764053.3 is already in deletion
#>  job 77764053.4 is already in deletion
#>  Error: callr subprocess failed: 1 peer(s) lost
#>  Type .Last.error.trace to see where the error occured

Doesn't seem to affect Q().

options(
  clustermq.scheduler = "sge",
  clustermq.template = "cmq.tmpl"
)
clustermq::Q(function(x) x, x = 1:3, n_jobs = 2)
#> Submitting 2 worker jobs (ID: cmq8202) ...
#> Running 3 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
#> Master: [1.9s 0.7% CPU]; Worker: [avg 33.1% CPU, max 248.6 Mb]

Maybe it's from a recent change to clustermq's worker API? @mschubert, what do you think?

@wlandau
Copy link
Member

wlandau commented Jan 12, 2021

If the new worker API really is supposed to be different, I am afraid all I can do for now is just try() to shut down the workers properly (e6e645f). targets needs to be compatible with both versions.

...unless the recommended way to shut down workers is now just workers$finalize() instead of if !workers$cleanup()) workers$finalize(). targets does the latter. @mschubert, if I should switch to the former, please let me know.

@wlandau
Copy link
Member

wlandau commented Jan 12, 2021

I wonder, could this be related to mschubert/clustermq#223?

@mschubert
Copy link

x 1 peer(s) lost

Yes, that's 223.

Unfortunately, it's difficult to solve because my dev code behaves differently than the ZeroMQ documentation says it should (or at least not as I expected after reading the docs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants