-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Monitor classes for SLURM, PBS, and LSF #32
Comments
Hi @wlandau - just to confirm my understanding, you're proposing we add, for instance |
Yes, exactly! On SGE, the hardest part for me was parsing job status information. I had to dig into the XML because the non-XML output from |
Actually, first I would like to simplify this part by creating a common abstract parent class for all the monitors to inherit from... |
I'll give some thought to slurm. There are the usual slurm commands (squeue, scancel, etc...) whose output we could parse, but there's also a DB (optional and typically used in larger installations) that could be queried. Maybe the former is better at least in the short term, since not everyone will have the DB. |
Thanks for looking into this! In the end I would prefer something that all/most SLURM users would be able to use. By the way, as of 8cf036b I created parent monitor class that all cluster-specific monitors inherit from: https://github.com/wlandau/crew.cluster/blob/main/R/crew_monitor_cluster.R. This helps reduce duplicated code/docs. The SGE monitor is much shorter now and easy to copy: https://github.com/wlandau/crew.cluster/blob/main/R/crew_monitor_sge.R. Tests are at https://github.com/wlandau/crew.cluster/blob/main/tests/testthat/test-crew_monitor_sge.R and https://github.com/wlandau/crew.cluster/blob/main/tests/sge/monitor.R. |
To make sure I understand, the monitor is only for interactive use? So the data.frame which is output by |
There are two options for # this is the default format given in `man squeue`, but specify it
# in case some user's configuration is different
default_format <- "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R"
text <- system2(
"squeue",
args = shQuote(c("-u", user, "-o", default_format)),
stdout = TRUE,
stderr = if_any(private$.verbose, "", FALSE),
wait = TRUE
)
con <- textConnection(text)
out <- read.fwf(
con,
widths = c(18, -1, 9, -1, 8, -1, 8, -1, 2, -1, 10, -1, 6, -1, 100),
skip = 1,
col.names = c("JOBID", "PARTITION", "NAME", "USER", "ST", "TIME", "NODES", "NODELIST_REASON"),
strip.white = TRUE
)
tibble::as_tibble(out)
## A tibble: 7 × 8
# JOBID PARTITION NAME USER ST TIME NODES NODELIST_REASON
# <int> <chr> <chr> <chr> <chr> <chr> <int> <chr>
#1 20504876 small crew-Opt brfurnea R 52:46 1 r18c36
#2 20504877 small crew-Opt brfurnea R 52:46 1 r18c23
#3 20504863 small crew-Opt brfurnea R 52:50 1 r18c41
#4 20504851 small crew-Opt brfurnea R 53:06 1 r18c33
#5 20504854 small crew-Opt brfurnea R 53:06 1 r18c35
#6 20504857 small crew-Opt brfurnea R 53:06 1 r18c40
#7 20504848 small OptimOTU brfurnea R 53:35 1 r18c43 The second option is text <- system2("squeue", args = shQuote("--yaml"), stdout = TRUE, stderr = FALSE, wait = TRUE)
length(text)
# [1] 269314 This both because there are a lot of jobs, but also because it gives all possible fields, more than 100 per job. My feeling is that option 1 is the way to go, despite the fact that fixed-width outputs may cut some values (for instance, NAME above). |
That's a tough choice, and it's a shame that the more structured YAML-based is large. How large exactly, in terms of the size of the output and the execution time? I am concerned that subtle variations from cluster to cluster and odd things like spaces in job names could interfere with the standard output. |
On my cluster, |
Yeah, monitor objects are just for interactive use. I think those performance metrics are not terrible as long as the documentation gives the user a heads up. |
The yaml queue dump includes 111 fields for each job, some of which are themselves structured; e.g. one field is "job resources" which looks like this:
|
This code approximately recreates the default user <- ps::ps_username()
monitor_cols <- c("job_id", "partition", "name", "user_name", "job_state",
"start_time", "node_count", "state_reason")
text <- system2(
"squeue",
args = "--yaml",
stdout = TRUE,
#stderr = ifany(private$.verbose, "", FALSE),
wait = TRUE
)
yaml = yaml::read_yaml(text = text)
out <- map(
yaml$jobs,
~ tibble::new_tibble(
c(
map(.x[monitor_cols], ~ unlist(.x) %||% NA),
list(nodes = paste(unlist(.x$job_resources$nodes), collapse = ",") %||% NA)
)
)
)
out <- do.call(vctrs::vec_rbind, out)
out <- out[out$user_name == user,]
out$start_time <- as.POSIXct(out$start_time, origin = "1970-01-01")
out
# A tibble: 14 × 9
job_id partition name user_name job_state start_time node_count
<int> <chr> <chr> <chr> <chr> <dttm> <int>
1 20386512 longrun R_Moth… guilbaul RUNNING 2024-02-09 09:05:33 1
2 20386513 longrun R_Moth… guilbaul RUNNING 2024-02-09 09:05:33 1
3 20386514 longrun R_Moth… guilbaul RUNNING 2024-02-09 09:05:33 1
4 20386515 longrun R_Moth… guilbaul RUNNING 2024-02-09 09:05:33 1
5 20386516 longrun R_Moth… guilbaul RUNNING 2024-02-09 09:05:33 1
6 20386517 longrun R_Moth… guilbaul RUNNING 2024-02-09 09:05:33 1
7 20386509 longrun R_Moth… guilbaul RUNNING 2024-02-09 09:05:33 1
8 20446032 longrun R_Moth… guilbaul RUNNING 2024-02-14 09:27:25 1
9 20446033 longrun R_Moth… guilbaul RUNNING 2024-02-14 09:27:25 1
10 20446034 longrun R_Moth… guilbaul RUNNING 2024-02-14 09:27:25 1
11 20446035 longrun R_Moth… guilbaul RUNNING 2024-02-14 09:27:25 1
12 20446036 longrun R_Moth… guilbaul RUNNING 2024-02-14 09:27:25 1
13 20446037 longrun R_Moth… guilbaul RUNNING 2024-02-14 09:27:25 1
14 20446004 longrun R_Moth… guilbaul RUNNING 2024-02-14 09:27:25 1
# ℹ 2 more variables: state_reason <chr>, nodes <chr>
|
Nice! Got time for a PR? |
Sorry I was pulled away from this thread by work. The yaml option looks like a much better approach than parsing squeue, but I think it requires an extra plugin and minimum slurm version. It would be worth adding a warning or something. See: Why am I getting the following error: "Unable to find plugin: serializer/json"?. |
It looks like the LSF job output can similarly be parsed either using the fixed-width table, or JSON (see example below) - this would add a text <- system2(
"bjobs",
args = c("-o 'user jobid job_name stat queue slots mem start_time run_time'", "-json"),
stdout = TRUE,
wait = TRUE
)
json <- jsonlite::fromJSON(text)
out <- json$RECORDS
out
user <- ps::ps_username()
text <- system2(
"bjobs",
args = c("-o 'user jobid job_name stat queue slots mem start_time run_time'", "-json"),
stdout = TRUE,
#stderr = ifany(private$.verbose, "", FALSE),
wait = TRUE
)
json <- jsonlite::fromJSON(text)
out <- json$RECORDS
out
USER JOBID JOB_NAME STAT QUEUE SLOTS MEM START_TIME RUN_TIME
1 mglevin 25900189 bash RUN voltron_interactive 1 8 Mbytes Feb 29 09:12 313 second(s)
2 mglevin 25900201 bash RUN voltron_interactive 1 2 Mbytes Feb 29 09:17 22 second(s)
3 mglevin 25665912 rstudio RUN voltron_rstudio 2 87.9 Gbytes Feb 26 15:36 236482 second(s) |
Awesome! Would you be willing to open a PR? |
just here to say hi, still early days for me with {crew} but I'm excited to learn, I have access to SLURM and PBS, and I'm reading along |
Prework
Proposal
crew.cluster
0.2.0 supports a new "monitor" class to help list and terminate SGE jobs from R instead of the command line. https://wlandau.github.io/crew.cluster/index.html#monitoring shows an example usingcrew_monitor_sge()
:Currently only SGE is supported. I would like to add other monitor classes for other clusters, but I do not have access to SLURM, PBS, or LSF. cc'ing @nviets, @brendanf, and/or @mglev1n, in case you are interested.
The text was updated successfully, but these errors were encountered: