use cases for partition + qos or queues #4306

ryanday36 · 2022-04-27T21:11:24Z

Here are the basic things that we do with the combination of partitions and qos in Slurm, and just queues in LSF:

set default limits on all user jobs that are trying to access a given set of nodes.
E.g. all users with a current bank have access to a ‘pbatch’ partition + ‘normal’ qos (Slurm) or just ‘pbatch’ queue (LSF) which is a set of nodes with certain limits on max job size, time limit, etc. as well as a baseline priority factor. They will also have access to a ‘pdebug’ partition (Slurm) or queue (LSF) which is a different set of nodes with different limits.
allow some users to ignore those limits.
E.g. users / jobs may be given access to an ‘exempt’ qos (Slurm) which overrides the pbatch partition limits or an ‘exempt’ queue (LSF) which starts jobs on the same set of nodes as pbatch, but doesn’t have limits on job size, time limit, etc.
give users a big priority boost.
E.g. users / jobs may be given access to an ‘expedite’ qos (Slurm) or ‘expedite’ queue (LSF) which does the same thing as ‘exempt’, but also has a higher baseline priority factor than the ‘normal’ qos (Slurm) or ‘pbatch’ queue (LSF).
restrict access to a given set of nodes.
E.g. we often define a ‘pall’ partition (Slurm) or queue (LSF) that has all of the nodes and only give specific users access to it during a DAT.
allow some user jobs to be pre-empted.
E.g. expired banks may still be given access to the ‘pbatch’ queue + a ‘standby’ qos (Slurm) or just the ‘standby’ queue (LSF). Jobs in this queue / qos are not subject to the same limits as the ‘normal/exempt/expedite’ qos or queues, but they have a lower baseline priority factor and they can be pre-empted (cancelled) by jobs that are submitted to the other queues.
(stretch goal, not something we currently do at LLNL) maintain a dynamic pool of nodes for interactive use.
E.g. this is something that LANL is doing for debug/interactive jobs and is looking at using for CI jobs, but we haven’t implemented here at LLNL. It actually uses a dynamic reservation in Slurm rather than queues or qos, but I could see it being implemented with queues. The idea is that there are a small number of nodes in a reservation that can only be used by short, small jobs. When a job starts on that reservation, idle nodes get added to the reservation (up to some maximum size) so that there are effectively always some idle nodes available for interactive / debug use.

Generally, users or user+bank combos are given access to specific queues / qos by administrators and can then submit their jobs directly to those queues / with that qos. Administrators can also change the queue / qos of a specific job even if the user doesn’t otherwise have access to that queue / qos.

dongahn · 2022-04-28T00:13:58Z

This is outstanding. I'm summarizing this in my multi-level queue scheduler architecture as we speak. Thanks @ryanday36!

grondo mentioned this issue May 2, 2022

job manager needs new interface for job "limits" #4309

Open

This was referenced May 5, 2022

[WIP] Add RFC33 draft flux-framework/rfc#331

Closed

rfc33: add queues RFC flux-framework/rfc#332

Merged

garlick mentioned this issue May 16, 2022

resources are not scheduled fairly among competing queues of job requests flux-framework/flux-sched#939

Open

garlick mentioned this issue Jul 26, 2022

Support enforcement of all necessary limits (e.g. total nodes for user+bank) #4434

Open

6 tasks

ryanday36 added this to TOSS4 system instance tracking Feb 16, 2023

ryanday36 mentioned this issue Jul 30, 2024

Consider running a separate qmanager for each queue flux-framework/flux-sched#1258

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use cases for partition + qos or queues #4306

use cases for partition + qos or queues #4306

ryanday36 commented Apr 27, 2022 •

edited by grondo

Loading

dongahn commented Apr 28, 2022

use cases for partition + qos or queues #4306

use cases for partition + qos or queues #4306

Comments

ryanday36 commented Apr 27, 2022 • edited by grondo Loading

dongahn commented Apr 28, 2022

ryanday36 commented Apr 27, 2022 •

edited by grondo

Loading