tar_make_clustermq not shutting down cleanly in recent versions of targets #265

mattwarkentin · 2021-01-11T22:46:31Z

Prework

Read and agree to the code of conduct and contributing guidelines.
If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
Post a minimal reproducible example like this one so the maintainer can troubleshoot the problems you identify. A reproducible example is:
- Runnable: post enough R code and data so any onlooker can create the error on their own computer.
- Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
- Readable: format your code according to the tidyverse style guide.

Description

Has anything changed with targets recently that would change the way a pipeline finishes? With the last couple installations I get unclean shutdowns for some of my clustermq workers, but the pipeline seems to actually finish successfully. Just seems to be related to how the main process is handled when finishing, perhaps?

targets::tar_make_clustermq(
  reporter = "timestamp",
  workers = 4L
)

Warning in self$crew$finalize() :
  Unclean shutdown for PIDs: 27540, 27547, 27558
Error in private$zmq$poll(sid, timeout) : 1 peer(s) lost
Error: callr subprocess failed: 1 peer(s) lost
Type .Last.error.trace to see where the error occured

Stack trace:

 Process 25513:
 1. targets::tar_make_clustermq(reporter = "timestamp", workers = 4L)
 2. targets:::callr_outer(targets_function = tar_make_clustermq_ ...
 3. targets:::trn(is.null(callr_function), callr_inner(target_sc ...
 4. base:::do.call(callr_function, prepare_callr_arguments(callr ...
 5. (function (func, args = list(), libpath = .libPaths(), repos ...
 6. callr:::get_result(output = out, options)
 7. throw(newerr, parent = remerr[[2]])

 x callr subprocess failed: 1 peer(s) lost 

 Process 27528:
 19. (function (targets_script, targets_function, targets_argumen ...
 20. base:::do.call(targets_function, targets_arguments)
 21. (function (pipeline, names_quosure, reporter, workers, log_w ...
 22. clustermq_init(pipeline = pipeline, names = names, queue = " ...
 23. self$run_clustermq()
 24. self$crew$cleanup()
 25. super$cleanup(quiet = quiet, timeout = timeout)
 26. self$receive_data(timeout = timeout, with_checks = FALSE)
 27. self$receive_data(timeout, with_checks = with_checks)
 28. private$zmq$poll(timeout = msec)
 29. private$zmq$poll(sid, timeout)
 30. base:::.handleSimpleError(function (e)  ...
 31. h(simpleError(msg, call))

 x 1 peer(s) lost

The text was updated successfully, but these errors were encountered:

mattwarkentin · 2021-01-11T23:12:59Z

Of note, I noticed you've added this message to the end of the build...

● 2021-01-11 18:10:37.22 -0500 GMT end pipeline

I get that message for tar_make(), but it does not show up when calling tar_make_clustermq(). So something isn't wrapping up properly with clustermq, it seems...

wlandau · 2021-01-12T00:25:05Z

Are you using development clustermq by any chance? Using CRAN clustermq (0.8.95.1) HPC on SGE seems to work:

remotes::install_github("wlandau/targets")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'targets' from a github remote, the SHA1 (0aeabe85) has not changed since last install.
#>  Use `force = TRUE` to force installation

packageDescription("targets")$GithubSHA1
#> [1] "0aeabe85431e5e10416f42253319cb5226558643"

library(targets)

writeLines("#$ -N {{ job_name }}
#$ -t 1-{{ n_jobs }}
#$ -j y
#$ -cwd
#$ -V
module load R/4.0.3
CMQ_AUTH={{ auth }} R --no-save --no-restore -e 'clustermq:::worker(\"{{ master }}\")'
", "cmq.tmpl")

tar_script({
  options(
    clustermq.scheduler = "sge",
    clustermq.template = "cmq.tmpl"
  )
  sleep <- function(x) {
    Sys.sleep(1)
    x
  }
  list(
    tar_target(x, seq_len(12L)),
    tar_target(y, sleep(x), pattern = map(x)),
    tar_target(z, sleep(y), pattern = map(y))
  )
})

tar_make_clustermq(workers = 4L)
#> ● run target x
#> ● run branch y_29239c8a
#> ● run branch y_7cc32924
#> ● run branch y_bd602d50
#> ● run branch y_05f206d7
#> ● run branch y_d8bc1b56
#> ● run branch y_356ae18f
#> ● run branch y_47b4a323
#> ● run branch y_9964667b
#> ● run branch y_a4d6927f
#> ● run branch y_00e82c73
#> ● run branch y_7ffb6cff
#> ● run branch y_e45d295b
#> ● run branch z_d0697b30
#> ● run branch z_959da4d3
#> ● run branch z_92be5b5f
#> ● run branch z_f698fbf4
#> ● run branch z_80b956fa
#> ● run branch z_2a3aa09e
#> ● run branch z_ab885e3c
#> ● run branch z_ae3fc0b8
#> ● run branch z_ce6eccbc
#> ● run branch z_c3b501f1
#> ● run branch z_f0291ce7
#> ● run branch z_d742c285
#> Master: [13.0s 2.6% CPU]; Worker: [avg 8.2% CPU, max 280.6 Mb]
#> ● end pipeline

tar_read(z)
#> [1]  1  2  3  4  5  6  7  8  9 10 11 12

But not after I install mschubert/clustermq@b61d8da.

targets::tar_make_clustermq(workers = 4L)
#> ● run target x
#>  ● run branch y_29239c8a
#>  ● run branch y_7cc32924
#>  ● run branch y_bd602d50
#>  ● run branch y_05f206d7
#>  ● run branch y_d8bc1b56
#>  ● run branch y_356ae18f
#>  ● run branch y_47b4a323
#>  ● run branch y_9964667b
#>  ● run branch y_a4d6927f
#>  ● run branch y_00e82c73
#>  ● run branch y_7ffb6cff
#>  ● run branch y_e45d295b
#>  ● run branch z_d0697b30
#>  ● run branch z_959da4d3
#>  ● run branch z_92be5b5f
#>  ● run branch z_f698fbf4
#>  ● run branch z_80b956fa
#>  ● run branch z_2a3aa09e
#>  ● run branch z_ab885e3c
#>  ● run branch z_ae3fc0b8
#>  ● run branch z_ce6eccbc
#>  ● run branch z_c3b501f1
#>  ● run branch z_f0291ce7
● run branch z_d742c285
#>  Error in private$zmq$poll(sid, timeout) : 1 peer(s) lost
#>  <CENSORED> has registered the job-array task 77764053.1 for deletion
#>  <CENSORED> has registered the job-array task 77764053.4 for deletion
#>  job 77764053.1 is already in deletion
#>  job 77764053.2 is already in deletion
#>  job 77764053.3 is already in deletion
#>  job 77764053.4 is already in deletion
#>  Error: callr subprocess failed: 1 peer(s) lost
#>  Type .Last.error.trace to see where the error occured

Doesn't seem to affect Q().

options(
  clustermq.scheduler = "sge",
  clustermq.template = "cmq.tmpl"
)
clustermq::Q(function(x) x, x = 1:3, n_jobs = 2)
#> Submitting 2 worker jobs (ID: cmq8202) ...
#> Running 3 calculations (0 objs/0 Mb common; 1 calls/chunk) ...
#> Master: [1.9s 0.7% CPU]; Worker: [avg 33.1% CPU, max 248.6 Mb]

Maybe it's from a recent change to clustermq's worker API? @mschubert, what do you think?

wlandau · 2021-01-12T00:41:11Z

If the new worker API really is supposed to be different, I am afraid all I can do for now is just try() to shut down the workers properly (e6e645f). targets needs to be compatible with both versions.

...unless the recommended way to shut down workers is now just workers$finalize() instead of if !workers$cleanup()) workers$finalize(). targets does the latter. @mschubert, if I should switch to the former, please let me know.

wlandau · 2021-01-12T00:42:45Z

I wonder, could this be related to mschubert/clustermq#223?

mschubert · 2021-01-12T09:17:20Z

x 1 peer(s) lost

Yes, that's 223.

Unfortunately, it's difficult to solve because my dev code behaves differently than the ZeroMQ documentation says it should (or at least not as I expected after reading the docs)

mattwarkentin added the topic: performance label Jan 11, 2021

mattwarkentin assigned wlandau Jan 11, 2021

wlandau-lilly closed this as completed in e6e645f Jan 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tar_make_clustermq not shutting down cleanly in recent versions of targets #265

tar_make_clustermq not shutting down cleanly in recent versions of targets #265

mattwarkentin commented Jan 11, 2021

mattwarkentin commented Jan 11, 2021 •

edited

Loading

wlandau commented Jan 12, 2021

wlandau commented Jan 12, 2021

wlandau commented Jan 12, 2021

mschubert commented Jan 12, 2021

tar_make_clustermq not shutting down cleanly in recent versions of targets #265

tar_make_clustermq not shutting down cleanly in recent versions of targets #265

Comments

mattwarkentin commented Jan 11, 2021

Prework

Description

mattwarkentin commented Jan 11, 2021 • edited Loading

wlandau commented Jan 12, 2021

wlandau commented Jan 12, 2021

wlandau commented Jan 12, 2021

mschubert commented Jan 12, 2021

mattwarkentin commented Jan 11, 2021 •

edited

Loading