Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use cases for partition + qos or queues #4306

Open
ryanday36 opened this issue Apr 27, 2022 · 1 comment
Open

use cases for partition + qos or queues #4306

ryanday36 opened this issue Apr 27, 2022 · 1 comment

Comments

@ryanday36
Copy link

ryanday36 commented Apr 27, 2022

Here are the basic things that we do with the combination of partitions and qos in Slurm, and just queues in LSF:

  1. set default limits on all user jobs that are trying to access a given set of nodes.
    E.g. all users with a current bank have access to a ‘pbatch’ partition + ‘normal’ qos (Slurm) or just ‘pbatch’ queue (LSF) which is a set of nodes with certain limits on max job size, time limit, etc. as well as a baseline priority factor. They will also have access to a ‘pdebug’ partition (Slurm) or queue (LSF) which is a different set of nodes with different limits.
  2. allow some users to ignore those limits.
    E.g. users / jobs may be given access to an ‘exempt’ qos (Slurm) which overrides the pbatch partition limits or an ‘exempt’ queue (LSF) which starts jobs on the same set of nodes as pbatch, but doesn’t have limits on job size, time limit, etc.
  3. give users a big priority boost.
    E.g. users / jobs may be given access to an ‘expedite’ qos (Slurm) or ‘expedite’ queue (LSF) which does the same thing as ‘exempt’, but also has a higher baseline priority factor than the ‘normal’ qos (Slurm) or ‘pbatch’ queue (LSF).
  4. restrict access to a given set of nodes.
    E.g. we often define a ‘pall’ partition (Slurm) or queue (LSF) that has all of the nodes and only give specific users access to it during a DAT.
  5. allow some user jobs to be pre-empted.
    E.g. expired banks may still be given access to the ‘pbatch’ queue + a ‘standby’ qos (Slurm) or just the ‘standby’ queue (LSF). Jobs in this queue / qos are not subject to the same limits as the ‘normal/exempt/expedite’ qos or queues, but they have a lower baseline priority factor and they can be pre-empted (cancelled) by jobs that are submitted to the other queues.
  6. (stretch goal, not something we currently do at LLNL) maintain a dynamic pool of nodes for interactive use.
    E.g. this is something that LANL is doing for debug/interactive jobs and is looking at using for CI jobs, but we haven’t implemented here at LLNL. It actually uses a dynamic reservation in Slurm rather than queues or qos, but I could see it being implemented with queues. The idea is that there are a small number of nodes in a reservation that can only be used by short, small jobs. When a job starts on that reservation, idle nodes get added to the reservation (up to some maximum size) so that there are effectively always some idle nodes available for interactive / debug use.

Generally, users or user+bank combos are given access to specific queues / qos by administrators and can then submit their jobs directly to those queues / with that qos. Administrators can also change the queue / qos of a specific job even if the user doesn’t otherwise have access to that queue / qos.

@dongahn
Copy link
Member

dongahn commented Apr 28, 2022

This is outstanding. I'm summarizing this in my multi-level queue scheduler architecture as we speak. Thanks @ryanday36!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants