diff --git a/doc/speed-manual.pdf b/doc/speed-manual.pdf index 8c0aaec..c25883d 100644 Binary files a/doc/speed-manual.pdf and b/doc/speed-manual.pdf differ diff --git a/doc/speed-manual.tex b/doc/speed-manual.tex index 7a89053..cc1a35e 100644 --- a/doc/speed-manual.tex +++ b/doc/speed-manual.tex @@ -610,6 +610,17 @@ \subsubsection{Anaconda} from the python distribution while \texttt{conda install} installs modules from anaconda's repository. +% ------------------------------------------------------------------------------ +\paragraph{Conda Env without --prefix: } + +If you don't want to use the \texttt{\-\-prefix} option every time you create a new environment and you don't want to use the default \texttt{\-\$HOME}. +Create a new directory an set the following variables to point to the new created directory, e.g: +\begin{verbatim} + setenv CONDA_ENVS_PATH /speed-scratch/$USER/condas + setenv CONDA_PKGS_DIRS /speed-scratch/$USER/condas/pkg +\end{verbatim} +If you want to make these changes permanent, add the variables to your \texttt{.tcshrc} or \texttt{.bashrc} (depending on the default shell you are using) + % ------------------------------------------------------------------------------ \subsubsection{Python} \label{sect:python-venv} diff --git a/doc/web/index.html b/doc/web/index.html index 50850f8..d5bd9c7 100644 --- a/doc/web/index.html +++ b/doc/web/index.html @@ -75,48 +75,48 @@

Contents


 2.10 SSH Keys For MPI
 2.11 Creating Virtual Environments
  2.11.1 Anaconda -
  2.11.2 Python -
 2.12 Example Job Script: Fluent -
 2.13 Example Job: efficientdet -
 2.14 Java Jobs -
 2.15 Scheduling On The GPU Nodes -
  2.15.1 P6 on Multi-GPU, Multi-Node -
  2.15.2 CUDA -
  2.15.3 Special Notes for sending CUDA jobs to the GPU Queue -
  2.15.4 OpenISS Examples -
 2.16 Singularity Containers -
3 Conclusion -
 3.1 Important Limitations -
 3.2 Tips/Tricks -
 3.3 Use Cases -
A History -
 A.1 Acknowledgments -
 A.2 Migration from UGE to SLURM -
 A.3 Phases -
  A.3.1 Phase 4 -
  A.3.2 Phase 3 -
  A.3.3 Phase 2 -
  A.3.4 Phase 1 -
B Frequently Asked Questions -
 B.1 Where do I learn about Linux? -
 B.2 How to use the “bash shell” on Speed? -
  B.2.1 How do I set bash as my login shell? -
  B.2.2 How do I move into a bash shell on Speed? -
  B.2.3 How do I use the bash shell in an interactive session on Speed? -
  B.2.4 How do I run scripts written in bash on Speed? -
 B.3 How to resolve “Disk quota exceeded” errors? -
  B.3.1 Probable Cause -
  B.3.2 Possible Solutions -
  B.3.3 Example of setting working directories for COMSOL -
  B.3.4 Example of setting working directories for Python Modules -
 B.4 How do I check my job’s status? - - - -
 B.5 Why is my job pending when nodes are empty? -
  B.5.1 Disabled nodes -
  B.5.2 Error in job submit request. -
C Sister Facilities +
  2.11.2 Python +
 2.12 Example Job Script: Fluent +
 2.13 Example Job: efficientdet +
 2.14 Java Jobs +
 2.15 Scheduling On The GPU Nodes +
  2.15.1 P6 on Multi-GPU, Multi-Node +
  2.15.2 CUDA +
  2.15.3 Special Notes for sending CUDA jobs to the GPU Queue +
  2.15.4 OpenISS Examples +
 2.16 Singularity Containers +
3 Conclusion +
 3.1 Important Limitations +
 3.2 Tips/Tricks +
 3.3 Use Cases +
A History +
 A.1 Acknowledgments +
 A.2 Migration from UGE to SLURM +
 A.3 Phases +
  A.3.1 Phase 4 +
  A.3.2 Phase 3 +
  A.3.3 Phase 2 +
  A.3.4 Phase 1 +
B Frequently Asked Questions +
 B.1 Where do I learn about Linux? +
 B.2 How to use the “bash shell” on Speed? +
  B.2.1 How do I set bash as my login shell? +
  B.2.2 How do I move into a bash shell on Speed? +
  B.2.3 How do I use the bash shell in an interactive session on Speed? +
  B.2.4 How do I run scripts written in bash on Speed? +
 B.3 How to resolve “Disk quota exceeded” errors? +
  B.3.1 Probable Cause +
  B.3.2 Possible Solutions +
  B.3.3 Example of setting working directories for COMSOL +
  B.3.4 Example of setting working directories for Python Modules +
 B.4 How do I check my job’s status? + + + +
 B.5 Why is my job pending when nodes are empty? +
  B.5.1 Disabled nodes +
  B.5.2 Error in job submit request. +
C Sister Facilities
Annotated Bibliography @@ -346,25 +346,26 @@

2

-#SBATCH --account=speed1 --mem=100M -t 600 -J job-name
-#SBATCH --gpus=2 --mail-type=ALL -t 600 --mail-user=YOUR_USERNAME
+#SBATCH --mem=100M -t 600 -J <job-name> -A <slurm account>
+#SBATCH -p pg --gpus=2 --mail-type=ALL
 

We use srun for every complex compute step inside the script. Use interactive jobs to set up virtual environments, compilation, and debugging. salloc is preferred; allows multiple steps. srun can start -interactive jobs as well (see Section 2.8). Required and common job parameters: job-name (J), -mail-type, mem, ntasks (n), cpus-per-task, account, -p (partition). -

+interactive jobs as well (see Section 2.8). Required and common job parameters: memory (mem), +time (t), job-name (J), slurm project account (A), partition (p), mail-type, ntasks (n), +cpus-per-task. +

2.1 Getting Started

-

Before getting started, please review the “What Speed is” (Section 1.4) and “What Speed is Not” +

Before getting started, please review the “What Speed is” (Section 1.4) and “What Speed is Not” (Section 1.5). Once your GCS ENCS account has been granted access to “Speed”, use your GCS ENCS account credentials to create an SSH connection to speed (an alias for speed-submit.encs.concordia.ca). All users are expected to have a basic understanding of Linux and its commonly used commands (see Appendix B.1 for resources). -

+

2.1.1 SSH Connections
-

Requirements to create connections to Speed: +

Requirements to create connections to Speed:

  1. An active GCS ENCS user account, which has permission to connect to Speed (see Section 1.7). @@ -374,11 +375,11 @@
    2.1.1
  2. Windows systems require a terminal emulator such as PuTTY, Cygwin, or MobaXterm.
  3. -
  4. macOS systems do have a Terminal app for this or xterm that comes with XQuarz.
-

Open up a terminal window and type in the following SSH command being sure to replace +

  • macOS systems do have a Terminal app for this or xterm that comes with XQuarz.
  • +

    Open up a terminal window and type in the following SSH command being sure to replace <ENCSusername> with your ENCS account’s username. @@ -387,8 +388,8 @@

    2.1.1
     ssh <ENCSusername>@speed.encs.concordia.ca
     
    -

    -

    Read the AITS FAQ: How do I securely connect to a GCS server? +

    +

    Read the AITS FAQ: How do I securely connect to a GCS server?

    2.1.2 Environment Set Up
    @@ -413,10 +414,10 @@
    2.

    Note: If a “command not found” error appears after you log in to speed, your user account many have probably have defunct Grid Engine environment commands. See Appendix A.2 to learn how to prevent this error on login. -

    +

    2.2 Job Submission Basics

    -

    Preparing your job for submission is fairly straightforward. Start by basing your job script on one of the +

    Preparing your job for submission is fairly straightforward. Start by basing your job script on one of the examples available in the src/ directory of our GitHub’s (https://github.com/NAG-DevOps/speed-hpc). Job scripts are broken into four main sections:

    @@ -426,7 +427,7 @@

    2.
  • Module Loads
  • User Scripting
  • -

    You can clone the tip of our repository to get the examples to start with or download them +

    You can clone the tip of our repository to get the examples to start with or download them individually via a browser or command line: @@ -436,8 +437,8 @@

    2. git clone --depth=1 https://github.com/NAG-DevOps/speed-hpc.git cd speed-hpc/src -

    -

    Then to quickly run some sample jobs, you can: +

    +

    Then to quickly run some sample jobs, you can: @@ -448,7 +449,7 @@

    2. sbatch -p ps -t 10 manual.sh sbatch -p pg -t 10 lambdal-singularity.sh -

    +

    2.2.1 Directives
    @@ -466,36 +467,30 @@
    2.2.1 #SBATCH --job-name=<jobname>        ## or -J. Give the job a name #SBATCH --mail-type=<type>          ## Set type of email notifications -#SBATCH --mail-user=<YOUR_USERNAME>@encs.concordia.ca #SBATCH --chdir=<directory>         ## or -D, Set working directory where output files will go #SBATCH --nodes=1                   ## or -N, Node count required for the job #SBATCH --ntasks=1                  ## or -n, Number of tasks to be launched #SBATCH --cpus-per-task=<corecount> ## or -c, Core count requested, e.g. 8 cores #SBATCH --mem=<memory>              ## Assign memory for this job, e.g., 32G memory per node -

    -

    Replace the following to adjust the job script for your project(s) +

    +

    Replace the following to adjust the job script for your project(s)

    1. <jobname> with a job name for the job
    2. -
    3. <YOUR_USERNAME> with your GCS username -
    4. -
    5. <directory> with the fullpath to your job’s working directory, e.g., where your code, +
    6. <directory> with the fullpath to your job’s working directory, e.g., where your code, source files and where the standard output files will be written to. By default, --chdir sets the current directory as the job’s working directory
    7. -
    8. <type> with the type of e-mail notifications you wish to receive. Valid options are: NONE, +
    9. <type> with the type of e-mail notifications you wish to receive. Valid options are: NONE, BEGIN, END, FAIL, REQUEUE, ALL
    10. -
    11. <corecount> with the degree of multithreaded parallelism (i.e., cores) allocated to your +
    12. <corecount> with the degree of multithreaded parallelism (i.e., cores) allocated to your job. Up to 32 by default.
    13. -
    14. <memory> with the amount of memory, in GB, that you want to be allocated per node. Up +
    15. <memory> with the amount of memory, in GB, that you want to be allocated per node. Up to 500 depending on the node. NOTE: All jobs MUST set a value for the --mem option.
    - - - -

    Example with short option equivalents: +

    Example with short option equivalents: @@ -503,15 +498,14 @@

    2.2.1 #SBATCH -J tmpdir                   ## Job’s name set to ’tmpdir’ #SBATCH --mail-type=ALL             ## Receive all email type notifications -#SBATCH --mail-user=a_user@encs.concordia.ca #SBATCH -D ./                       ## Use current directory as working directory #SBATCH -N 1                        ## Node count required for the job #SBATCH -n 1                        ## Number of tasks to be launched #SBATCH -c 1                        ## Request 8 cores #SBATCH --mem=32G                   ## Allocate 32G memory per node -

    -

    If you are unsure about memory footprints, err on assigning a generous memory space to +

    +

    If you are unsure about memory footprints, err on assigning a generous memory space to your job, so that it does not get prematurely terminated. You can refine --mem values for future jobs by monitoring the size of a job’s active memory space on speed-submit with: @@ -523,8 +517,8 @@

    2.2.1 -

    -

    This can be customized to show specific columns: +

    +

    This can be customized to show specific columns: @@ -533,11 +527,11 @@

    2.2.1 -

    -

    Memory-footprint values are also provided for completed jobs in the final e-mail notification as +

    +

    Memory-footprint values are also provided for completed jobs in the final e-mail notification as “maxvmsize”. Jobs that request a low-memory footprint are more likely to load on a busy cluster. -

    Other essential options are --time, or -t, and --account, or -A.

    +

    Other essential options are --time, or -t, and --account, or -A.

    • --time=<time> – is the estimate of wall clock time required for your job to run. As preiviously mentioned, the maximum is 7 days for batch and 24 hours for interactive jobs. @@ -550,13 +544,13 @@
      2.2.1 aits, vidpro, gipsy, ai2, mpackir, cmos, among others.
    -

    +

    2.2.2 Module Loads
    -

    As your job will run on a compute or GPU “Speed” node, and not the submit node, any software that +

    As your job will run on a compute or GPU “Speed” node, and not the submit node, any software that is needed must be loaded by the job script. Software is loaded within the script just as it would be from the command line. -

    To see a list of which modules are available, execute the following from the command line on +

    To see a list of which modules are available, execute the following from the command line on speed-submit. @@ -565,8 +559,8 @@

    2.2.2
     module avail
     
    -

    -

    To list for a particular program (matlab, for example): +

    +

    To list for a particular program (matlab, for example): @@ -574,8 +568,8 @@

    2.2.2
     module -t avail matlab
     
    -

    -

    Which, of course, can be shortened to match all that start with a particular letter: +

    +

    Which, of course, can be shortened to match all that start with a particular letter: @@ -583,8 +577,8 @@

    2.2.2
     module -t avail m
     
    -

    -

    Insert the following in your script to load the matlab/R2020a) module: +

    +

    Insert the following in your script to load the matlab/R2020a) module: @@ -592,9 +586,9 @@

    2.2.2
     module load matlab/R2020a/default
     
    -

    -

    Use, unload, in place of, load, to remove a module from active use. -

    To list loaded modules: +

    +

    Use, unload, in place of, load, to remove a module from active use. +

    To list loaded modules: @@ -602,8 +596,8 @@

    2.2.2
     module list
     
    -

    -

    To purge all software in your working environment: +

    +

    To purge all software in your working environment: @@ -611,8 +605,8 @@

    2.2.2
     module purge
     
    -

    -

    Typically, only the module load command will be used in your script. +

    +

    Typically, only the module load command will be used in your script.

    2.2.3 User Scripting
    @@ -787,19 +781,17 @@

    seff [job-ID]: reports on the efficiency of a job’s cpu and memory utilization. Don’t execute it on RUNNING jobs (only on completed/finished jobs), efficiency statistics may be misleading. -

    If you define the following directives in your batch script, you will receive seff output in your - email when your job is finished. +

    If you define the following directive in your batch script, your ENCS email address will receive + an email with seff output when your job is finished.

          #SBATCH --mail-type=ALL
    -     #SBATCH --mail-user=USER_NAME@encs.concordia.ca
    -     ## Replace USER_NAME with your encs username.
     
    -

    -

    Output example: +

    +

    Output example: @@ -817,13 +809,13 @@

    -

    +

    -

    +

    2.5 Advanced sbatch Options

    -

    In addition to the basic sbatch options presented earlier, there are a few additional options that are +

    In addition to the basic sbatch options presented earlier, there are a few additional options that are generally useful:

      @@ -848,19 +840,19 @@

    • --depend=[state:job-ID]: run this job only when job [job-ID] finishes. Held jobs appear in the queue.
    -

    The many sbatch options available are read with, man sbatch. Also note that sbatch options can +

    The many sbatch options available are read with, man sbatch. Also note that sbatch options can be specified during the job-submission command, and these override existing script options (if present). The syntax is, sbatch [options] PATHTOSCRIPT, but unlike in the script, the options are specified without the leading #SBATCH (e.g., sbatch -J sub-test --chdir=./ --mem=1G ./tcsh.sh). -

    +

    2.6 Array Jobs

    -

    Array jobs are those that start a batch job or a parallel job multiple times. Each iteration of the job +

    Array jobs are those that start a batch job or a parallel job multiple times. Each iteration of the job array is called a task and receives a unique job ID. Only supported for batch jobs; submit time \(< 1\) second, compared to repeatedly submitting the same regular job over and over even from a script. -

    To submit an array job, use the --array option of the sbatch command as follows: +

    To submit an array job, use the --array option of the sbatch command as follows: @@ -868,15 +860,15 @@

    2.6
     sbatch --array=n-m[:s]] <batch_script>
     
    -

    -

    -t Option Syntax:

    +

    +

    -t Option Syntax:

    • n: indicates the start-id.
    • m: indicates the max-id.
    • s: indicates the step size.
    -

    Examples:

    +

    Examples:

    • sbatch --array=1-50000 -N1 -i my_in_%a -o my_out_%a array.sh: submits a job with 50000 elements, %a maps to the task-id between 1 and 50K. @@ -888,20 +880,20 @@

      2.6

    • sbatch --array=3-15:3 array.sh: submits a jobs with 5 tasks numbered consecutively with step size 3 (task-ids 3,6,9,12,15).
    -

    Output files for Array Jobs: -

    The default and output and error-files are slurm-job_id_task_id.out. This means that Speed +

    Output files for Array Jobs: +

    The default and output and error-files are slurm-job_id_task_id.out. This means that Speed creates an output and an error-file for each task generated by the array-job as well as one for the super-ordinate array-job. To alter this behavior use the -o and -e option of sbatch. -

    For more details about Array Job options, please review the manual pages for sbatch by executing +

    For more details about Array Job options, please review the manual pages for sbatch by executing the following at the command line on speed-submit man sbatch. -

    +

    2.7 Requesting Multiple Cores (i.e., Multithreading Jobs)

    -

    For jobs that can take advantage of multiple machine cores, up to 32 cores (per job) can be requested +

    For jobs that can take advantage of multiple machine cores, up to 32 cores (per job) can be requested in your script with: @@ -910,8 +902,8 @@

    #SBATCH -n [#cores for processes] -

    -

    or +

    +

    or @@ -920,19 +912,19 @@

    -

    -

    Both sbatch and salloc support -n on the command line, and it should always be used either in +

    +

    Both sbatch and salloc support -n on the command line, and it should always be used either in the script or on the command line as the default \(n=1\). Do not request more cores than you think will be useful, as larger-core jobs are more difficult to schedule. On the flip side, though, if you are going to be running a program that scales out to the maximum single-machine core count available, please (please) request 32 cores, to avoid node oversubscription (i.e., to avoid overloading the CPUs). -

    Important note about --ntasks or --ntasks-per-node (-n) talks about processes (usually the +

    Important note about --ntasks or --ntasks-per-node (-n) talks about processes (usually the ones ran with srun). --cpus-per-task (-c) corresponds to threads per process. Some programs consider them equivalent, some don’t. Fluent for example uses --ntasks-per-node=8 and --cpus-per-task=1, some just set --cpus-per-task=8 and --ntasks-per-node=1. If one of them is not \(1\) then some applications need to be told to use \(n*c\) total cores. -

    Core count associated with a job appears under, “AllocCPUS”, in the, qacct -j, output. +

    Core count associated with a job appears under, “AllocCPUS”, in the, qacct -j, output. @@ -957,17 +949,17 @@

    -

    -

    +

    +

    2.8 Interactive Jobs

    -

    Job sessions can be interactive, instead of batch (script) based. Such sessions can be useful for testing, +

    Job sessions can be interactive, instead of batch (script) based. Such sessions can be useful for testing, debugging, and optimising code and resource requirements, conda or python virtual environments setup, or any likewise preparatory work prior to batch submission. -

    +

    2.8.1 Command Line
    -

    To request an interactive job session, use, salloc [options], similarly to a sbatch command-line +

    To request an interactive job session, use, salloc [options], similarly to a sbatch command-line job, e.g., @@ -976,11 +968,11 @@

    2.8.1
     salloc -J interactive-test --mem=1G -p ps -n 8
     
    -

    Inside the allocated salloc session you can run shell commands as usual; it is recommended to use +

    Inside the allocated salloc session you can run shell commands as usual; it is recommended to use srun for the heavy compute steps inside salloc. If it is a quick a short job just to compile something, e.g., on a GPU node you can use an interactive srun directly (note no srun can run within srun), e.g., a 1 hour allocation: -

    For tcsh: +

    For tcsh: @@ -988,8 +980,8 @@

    2.8.1
     srun --pty -n 8 -p pg --gpus=1 --mem=1Gb -t 60 /encs/bin/tcsh
     
    -

    -

    For bash: +

    +

    For bash: @@ -997,18 +989,18 @@

    2.8.1
     srun --pty -n 8 -p pg --gpus=1 --mem=1Gb -t 60 /encs/bin/bash
     
    -

    -

    +

    +

    2.8.2 Graphical Applications
    -

    If you need to run an on-Speed graphical-based UI application (e.g., MALTLAB, Abaqus CME, etc.), +

    If you need to run an on-Speed graphical-based UI application (e.g., MALTLAB, Abaqus CME, etc.), or an IDE (PyCharm, VSCode, Eclipse) to develop and test your job’s code interactively you need to enable X11-forwarding from your client machine to speed then to the compute node. To do so: -

    +

    1. -

      you need to run an X server on your client machine, such as,

      +

      you need to run an X server on your client machine, such as,

      • on Windows: MobaXterm with X turned on, or Xming + PuTTY with X11 forwarding, or XOrg under Cygwin @@ -1016,17 +1008,17 @@
        on macOS: XQuarz – use its xterm and ssh -X
      • on Linux just use ssh -X speed.encs.concordia.ca
      -

      See https://www.concordia.ca/ginacody/aits/support/faq/xserver.html for +

      See https://www.concordia.ca/ginacody/aits/support/faq/xserver.html for details.

    2. -

      verify your X connection was properly forwarded by printing the DISPLAY variable: -

      echo $DISPLAY If it has no output, then your X forwarding is not on and you may need to +

      verify your X connection was properly forwarded by printing the DISPLAY variable: +

      echo $DISPLAY If it has no output, then your X forwarding is not on and you may need to re-login to Speed.

    3. -

      Use the --x11 with salloc or srun: -

      salloc ... --x11=first ... +

      Use the --x11 with salloc or srun: +

      salloc ... --x11=first ... @@ -1034,7 +1026,7 @@

      Once landed on a compute node, verify DISPLAY again.
    4. -

      While running under scheduler, create a run-user directory and set the variable +

      While running under scheduler, create a run-user directory and set the variable XDG_RUNTIME_DIR. @@ -1044,15 +1036,15 @@

      +

    5. -

      Launch your graphical application: -

      module load the required version, then matlab, or abaqus cme, etc.

    -

    Here’s an example of starting PyCharm (see Figure 4), of which we made a sample local +

    Launch your graphical application: +

    module load the required version, then matlab, or abaqus cme, etc.

    +

    Here’s an example of starting PyCharm (see Figure 4), of which we made a sample local installation. You can make a similar install under your own directory. If using VSCode, it’s currently only supported with the --no-sandbox option.
    -

    BASH version: +

    BASH version: @@ -1070,8 +1062,8 @@

    -

    TCSH version: +

    +

    TCSH version: @@ -1091,7 +1083,7 @@

    +

    @@ -1102,7 +1094,7 @@
    PIC +

    PIC

    Figure 4: PyCharm Starting up on a Speed Node
    @@ -1110,17 +1102,17 @@
    2.8.3 Jupyter Notebooks in Singularity
    -

    This is an example of running Jupyter notebooks together with Singularity (more on Singularity see +

    This is an example of running Jupyter notebooks together with Singularity (more on Singularity see Section 2.16). Here we are using one of the OpenISS-derived containers (see Section 2.15.4 as well). -

    +

    1. Use the --x11 with salloc or srun as described in the above example
    2. Load Singularity module module load singularity/3.10.4/default
    3. -

      Execute this Singularity command on a single line. It’s best to save it in a shell script that you +

      Execute this Singularity command on a single line. It’s best to save it in a shell script that you could call, since it’s long. @@ -1132,10 +1124,10 @@

      -

      +

    4. -

      Create an ssh tunnel between your computer and the node (speed-XX) where Jupyter is +

      Create an ssh tunnel between your computer and the node (speed-XX) where Jupyter is running (Using speed-submit as a “jump server”) (Preferably: PuTTY, see Figure 5 and Figure 6) @@ -1145,10 +1137,10 @@

      ssh -L 8888:speed-XX:8888 YOUR_USER@speed-submit.encs.concordia.ca -

      Don’t close the tunnel. +

      Don’t close the tunnel.

    5. -

      Open a browser, and copy your Jupyter’s token, in the screenshot example in Figure 7; each +

      Open a browser, and copy your Jupyter’s token, in the screenshot example in Figure 7; each time the token will be different, as it printed to you in the terminal. @@ -1157,7 +1149,7 @@

      http://localhost:8888/?token=5a52e6c0c7dfc111008a803e5303371ed0462d3d547ac3fb -

      +

    6. Work with your notebook.
    @@ -1207,11 +1199,11 @@
    2.8.4 Jupyter Labs in Conda and Pytorch
    -

    This is an example of Jupyter Labs running in a Conda environment, with Pytorch +

    This is an example of Jupyter Labs running in a Conda environment, with Pytorch

    • -

      Environment preparation: for the FIRST time: +

      Environment preparation: for the FIRST time:

      1. Go to your speed-scratch directory: cd /speed-scratch/$USER
      2. @@ -1222,7 +1214,7 @@
        Open an Interactive session: salloc --mem=50G --gpus=1 -ppg (or -ppt)
      3. -

        Set env. variables, conda environment, jupyter+pytorch installation +

        Set env. variables, conda environment, jupyter+pytorch installation @@ -1238,13 +1230,13 @@

        -

      +

    • -

      Running Jupyter Labs, from speed-submit: +

      Running Jupyter Labs, from speed-submit:

      1. -

        Open an Interactive session: salloc --mem=50G --gpus=1 -ppg (or -ppt) +

        Open an Interactive session: salloc --mem=50G --gpus=1 -ppg (or -ppt) @@ -1258,7 +1250,7 @@

        -

        +

      2. Verify which port the system has assigned to Jupyter: http://localhost:XXXX/lab?token=
      3. @@ -1266,15 +1258,15 @@
      4. Open a browser and type: localhost:XXXX (port assigned)
    -

    +

    2.8.5 Jupyter Labs + Pytorch in Python venv
    -

    This is an example of Jupyter Labs running in a Python Virtual environment (venv), with +

    This is an example of Jupyter Labs running in a Python Virtual environment (venv), with Pytorch

    • -

      Environment preparation: for the FIRST time: +

      Environment preparation: for the FIRST time:

      1. Go to your speed-scratch directory: cd /speed-scratch/$USER
      2. @@ -1284,7 +1276,7 @@
      3. -

        Create Python venv and install jupyterlab+pytorch +

        Create Python venv and install jupyterlab+pytorch @@ -1300,13 +1292,13 @@

        -

      +

    • -

      Running Jupyter Labs, from speed-submit: +

      Running Jupyter Labs, from speed-submit:

      1. -

        Open an Interactive session: salloc --mem=50G --gpus=1 --constraint=el9 +

        Open an Interactive session: salloc --mem=50G --gpus=1 --constraint=el9 @@ -1318,7 +1310,7 @@

        -

        +

      2. Verify which port the system has assigned to Jupyter: http://localhost:XXXX/lab?token=
      3. @@ -1326,16 +1318,16 @@
      4. Open a browser and type: localhost:XXXX (port assigned)
    -

    +

    2.8.6 VScode
    -

    This is an example of running VScode, it’s similar to Jupyter notebooks, but it doesn’t use containers. +

    This is an example of running VScode, it’s similar to Jupyter notebooks, but it doesn’t use containers. This a Web version, it exists the local(workstation)-remote(speed-node) version too, but it is for Advanced users (no support, execute it at your own risk).

    • -

      Environment preparation: for the FIRST time: +

      Environment preparation: for the FIRST time:

      1. Go to your speed-scratch directory: cd /speed-scratch/$USER
      2. @@ -1351,7 +1343,7 @@
        2.8.6 Create this directory: mkdir -p /speed-scratch/$USER/run-user
    • -

      Running VScode +

      Running VScode

      1. Go to your vscode directory: cd /speed-scratch/$USER/vscode
      2. @@ -1361,7 +1353,7 @@
        2.8.6 $USER/run-user
      3. -

        Run VScode, change the port if needed. +

        Run VScode, change the port if needed. @@ -1370,14 +1362,14 @@

        2.8.6 -

        +

      4. SSH Tunnel creation: similar to Jupyter, see Section 2.8.3
      5. Open a browser and type: localhost:8080
      6. -

        If the browser asks for password: +

        If the browser asks for password: @@ -1385,7 +1377,7 @@

        2.8.6 cat /speed-scratch/$USER/vscode/home/.config/code-server/config.yaml -

        +

    @@ -1405,7 +1397,7 @@
    2.8.6

    2.9 Scheduler Environment Variables

    -

    The scheduler presents a number of environment variables that can be used in your jobs. You can +

    The scheduler presents a number of environment variables that can be used in your jobs. You can invoke env or printenv in your job to know what hose are (most begin with the prefix SLURM). Some of the more useful ones are:

    @@ -1425,49 +1417,48 @@

    $SLURM_ARRAY_TASK_ID=for array jobs (see Section 2.6).
  • -

    See a more complete list here: +

    See a more complete list here:

  • -

    In Figure 9 is a sample script, using some of these. +

    In Figure 9 is a sample script, using some of these.

    - + -
    #!/encs/bin/tcsh 
    +
    #!/encs/bin/tcsh 
      
     #SBATCH --job-name=tmpdir      ## Give the job a name 
     #SBATCH --mail-type=ALL        ## Receive all email type notifications 
    -#SBATCH --mail-user=YOUR_USER_NAME@encs.concordia.ca 
    -#SBATCH --chdir=./             ## Use currect directory as working directory 
    -#SBATCH --nodes=1 
    -#SBATCH --ntasks=1 
    -#SBATCH --cpus-per-task=8      ## Request 8 cores 
    -#SBATCH --mem=32G              ## Assign 32G memory per node 
    - 
    -cd $TMPDIR 
    -mkdir input 
    -rsync -av $SLURM_SUBMIT_DIR/references/ input/ 
    -mkdir results 
    -srun STAR --inFiles $TMPDIR/input --parallel $SRUN_CPUS_PER_TASK --outFiles $TMPDIR/results 
    -rsync -av $TMPDIR/results/ $SLURM_SUBMIT_DIR/processed/
    +#SBATCH --chdir=./             ## Use currect directory as working directory 
    +#SBATCH --nodes=1 
    +#SBATCH --ntasks=1 
    +#SBATCH --cpus-per-task=8      ## Request 8 cores 
    +#SBATCH --mem=32G              ## Assign 32G memory per node 
    + 
    +cd $TMPDIR 
    +mkdir input 
    +rsync -av $SLURM_SUBMIT_DIR/references/ input/ 
    +mkdir results 
    +srun STAR --inFiles $TMPDIR/input --parallel $SRUN_CPUS_PER_TASK --outFiles $TMPDIR/results 
    +rsync -av $TMPDIR/results/ $SLURM_SUBMIT_DIR/processed/
     
    -
    Figure 9: Source code for tmpdir.sh
    +
    Figure 9: Source code for tmpdir.sh

    2.10 SSH Keys For MPI

    -

    Some programs effect their parallel processing via MPI (which is a communication protocol). An +

    Some programs effect their parallel processing via MPI (which is a communication protocol). An example of such software is Fluent. MPI needs to have ‘passwordless login’ set up, which means SSH keys. In your NFS-mounted home directory:

    @@ -1482,19 +1473,19 @@

    2.10
  • Set file permissions of authorized_keys to 600; of your NFS-mounted home to 700 (note that you likely will not have to do anything here, as most people will have those permissions by default).
  • -

    +

    2.11 Creating Virtual Environments

    -

    The following documentation is specific to the Speed HPC Facility at the Gina Cody School of +

    The following documentation is specific to the Speed HPC Facility at the Gina Cody School of Engineering and Computer Science. Virtual environments typically instantiated via Conda or Python. Another option is Singularity detailed in Section 2.16. Usually, virtual environments are created once during an interactive session before submitting a batch job to the scheduler. The job script submitted to the scheduler is then written to (1) activate the virtual environment, (2) use it, and (3) close it at the end of the job. -

    +

    2.11.1 Anaconda
    -

    Request an interactive session in the queue you wish to submit your jobs to (e.g., salloc -p pg +

    Request an interactive session in the queue you wish to submit your jobs to (e.g., salloc -p pg –gpus=1 for GPU jobs). Once your interactive has started, create an anaconda environment in your speed-scratch directory by using the prefix option when executing conda create. For example, @@ -1510,11 +1501,11 @@

    2.11.1 -

    -

    Note: Without the prefix option, the conda create command creates the environment in a_user’s +

    +

    Note: Without the prefix option, the conda create command creates the environment in a_user’s home directory by default.

    -

    List Environments. +

    List Environments. To view your conda environments, type: conda info --envs @@ -1526,9 +1517,9 @@

    2.11.1 -

    +

    -

    Activate an Environment. +

    Activate an Environment. Activate the environment speedscratcha_usermyconda as follows @@ -1537,7 +1528,7 @@

    2.11.1 conda activate /speed-scratch/a_user/myconda -

    After activating your environment, add pip to your environment by using +

    After activating your environment, add pip to your environment by using @@ -1545,11 +1536,11 @@

    2.11.1 conda install pip -

    This will install pip and pip’s dependencies, including python, into the environment. +

    This will install pip and pip’s dependencies, including python, into the environment.

    • -

      A consolidated example using Conda: +

      A consolidated example using Conda: @@ -1567,26 +1558,41 @@

      2.11.1 -

      +

    • No Space left error: Read our Github HERE
    -

    Important Note: pip (and pip3) are used to install modules from the python distribution while +

    Important Note: pip (and pip3) are used to install modules from the python distribution while conda install installs modules from anaconda’s repository. -

    -
    2.11.2 Python
    -

    Setting up a Python virtual environment is fairly straightforward. The first step is to request an +

    Conda Env without –prefix: + If you don’t want to use the prefix option every time you create a new environment and you +don’t want to use the default $HOME. Create a new directory an set the following variables to point to +the new created directory, e.g: + + + +

    +
    +setenv CONDA_ENVS_PATH /speed-scratch/$USER/condas
    +setenv CONDA_PKGS_DIRS /speed-scratch/$USER/condas/pkg
    +
    +

    If you want to make these changes permanent, add the variables to your .tcshrc or .bashrc +(depending on the default shell you are using) +

    +

    +
    2.11.2 Python
    +

    Setting up a Python virtual environment is fairly straightforward. The first step is to request an interactive session in the queue you wish to submit your jobs to. -

    We have a simple example that use a Python virtual environment: +

    We have a simple example that use a Python virtual environment:

    • -

      Using Python Venv +

      Using Python Venv

      -
      +     
            salloc -p pg --gpus=1 --mem=10GB -A <slurm account name>
            cd /speed-scratch/$USER
            module load python/3.9.1/default
      @@ -1599,58 +1605,57 @@ 
      2.11.2 -

      +

    • See, e.g., gurobi-with-python.sh
    -

    Important Note: partition ps is used for CPU jobs, partitions pg, pt are used for GPU jobs, no +

    Important Note: partition ps is used for CPU jobs, partitions pg, pt are used for GPU jobs, no need to use --gpus= when preparing environments for CPU jobs.

    -

    2.12 Example Job Script: Fluent

    +

    2.12 Example Job Script: Fluent

    - + -
    #!/encs/bin/tcsh 
    - 
    -#SBATCH --job-name=flu10000    ## Give the job a name 
    -#SBATCH --mail-type=ALL        ## Receive all email type notifications 
    -#SBATCH --mail-user=YOUR_USER_NAME@encs.concordia.ca 
    -#SBATCH --chdir=./             ## Use currect directory as working directory 
    -#SBATCH --nodes=1              ## Number of nodes to run on 
    -#SBATCH --ntasks-per-node=32   ## Number of cores 
    -#SBATCH --cpus-per-task=1      ## Number of MPI threads 
    -#SBATCH --mem=160G             ## Assign 160G memory per node 
    - 
    -date 
    - 
    -module avail ansys 
    - 
    -module load ansys/19.2/default 
    -cd $TMPDIR 
    - 
    -set FLUENTNODES = "‘scontrol␣show␣hostnames‘" 
    -set FLUENTNODES = ‘echo $FLUENTNODES | tr ’ ’ ’,’‘ 
    - 
    -date 
    - 
    -srun fluent 3ddp \ 
    -        -g -t$SLURM_NTASKS \ 
    -        -g-cnf=$FLUENTNODES \ 
    -        -i $SLURM_SUBMIT_DIR/fluentdata/info.jou > call.txt 
    - 
    -date 
    - 
    -srun rsync -av $TMPDIR/ $SLURM_SUBMIT_DIR/fluentparallel/ 
    - 
    -date
    +
    #!/encs/bin/tcsh 
    + 
    +#SBATCH --job-name=flu10000    ## Give the job a name 
    +#SBATCH --mail-type=ALL        ## Receive all email type notifications 
    +#SBATCH --chdir=./             ## Use currect directory as working directory 
    +#SBATCH --nodes=1              ## Number of nodes to run on 
    +#SBATCH --ntasks-per-node=32   ## Number of cores 
    +#SBATCH --cpus-per-task=1      ## Number of MPI threads 
    +#SBATCH --mem=160G             ## Assign 160G memory per node 
    + 
    +date 
    + 
    +module avail ansys 
    + 
    +module load ansys/19.2/default 
    +cd $TMPDIR 
    + 
    +set FLUENTNODES = "‘scontrol␣show␣hostnames‘" 
    +set FLUENTNODES = ‘echo $FLUENTNODES | tr ’ ’ ’,’‘ 
    + 
    +date 
    + 
    +srun fluent 3ddp \ 
    +        -g -t$SLURM_NTASKS \ 
    +        -g-cnf=$FLUENTNODES \ 
    +        -i $SLURM_SUBMIT_DIR/fluentdata/info.jou > call.txt 
    + 
    +date 
    + 
    +srun rsync -av $TMPDIR/ $SLURM_SUBMIT_DIR/fluentparallel/ 
    + 
    +date
     
    -
    Figure 10: Source code for fluent.sh
    +
    Figure 10: Source code for fluent.sh
    @@ -1665,7 +1670,7 @@

    Caveat: take care with journal-file file paths.

    -

    2.13 Example Job: efficientdet

    +

    2.13 Example Job: efficientdet

    The following steps describing how to create an efficientdet environment on Speed, were submitted by a member of Dr. Amer’s research group.

    @@ -1686,7 +1691,7 @@

    +
     pip install tensorflow==2.7.0
     pip install lxml>=4.6.1
     pip install absl-py>=0.10.0
    @@ -1705,7 +1710,7 @@ 

    -

    2.14 Java Jobs

    +

    2.14 Java Jobs

    Jobs that call java have a memory overhead, which needs to be taken into account when assigning a value to --mem. Even the most basic java call, java -Xmx1G -version, will need to have, --mem=5G, with the 4-GB difference representing the memory overhead. Note that this memory @@ -1714,7 +1719,7 @@

    2.14 314G.

    -

    2.15 Scheduling On The GPU Nodes

    +

    2.15 Scheduling On The GPU Nodes

    The primary cluster has two GPU nodes, each with six Tesla (CUDA-compatible) P6 cards: each card has 2048 cores and 16GB of RAM. Though note that the P6 is mainly a single-precision card, so unless you need the GPU double precision, double-precision calculations will be faster on a CPU @@ -1725,7 +1730,7 @@

    +
     #SBATCH --gpus=[1|2]
     

    @@ -1735,7 +1740,7 @@

    +
     sbatch -p pg ./<myscript>.sh
     

    @@ -1744,7 +1749,7 @@

    +
     ssh <username>@speed[-05|-17|37-43] nvidia-smi
     

    @@ -1753,7 +1758,7 @@

    +
     sinfo -p pg --long --Node
     

    @@ -1769,7 +1774,7 @@

    +
     [serguei@speed-submit src] % sinfo -p pg --long --Node
     Thu Oct 19 22:31:04 2023
     NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
    @@ -1797,7 +1802,7 @@ 

    +
     [serguei@speed-submit src] % squeue -p pg -o "%15N %.6D %7P %.11T %.4c %.8z %.6m %.8d %.6w %.8f %20G %20E"
     NODELIST         NODES PARTITI       STATE MIN_    S:C:T MIN_ME MIN_TMP_  WCKEY FEATURES GROUP DEPENDENCY
     speed-05             1 pg          RUNNING    1    *:*:*     1G        0 (null)   (null) 11929     (null)
    @@ -1810,7 +1815,7 @@ 

    -
    2.15.1 P6 on Multi-GPU, Multi-Node
    +
    2.15.1 P6 on Multi-GPU, Multi-Node

    As described lines above, P6 cards are not compatible with Distribute and DataParallel functions (Pytorch, Tensorflow) when running on Multi-GPUs. One workaround is to run the job in Multi-node, single GPU per node; per example: @@ -1818,7 +1823,7 @@

    +
     #SBATCH --nodes=2
     #SBATCH --gpus-per-node=1
     
    @@ -1828,7 +1833,7 @@

    -
    2.15.2 CUDA
    +
    2.15.2 CUDA

    When calling CUDA within job scripts, it is important to create a link to the desired CUDA libraries and set the runtime link path to the same libraries. For example, to use the cuda-11.5 libraries, specify the following in your Makefile. @@ -1836,7 +1841,7 @@

    2.15.2

    -
    +   
     -L/encs/pkg/cuda-11.5/root/lib64 -Wl,-rpath,/encs/pkg/cuda-11.5/root/lib64
     

    @@ -1844,14 +1849,14 @@

    2.15.2 load gcc/8.4 or module load gcc/9.3

    -
    2.15.3 Special Notes for sending CUDA jobs to the GPU Queue
    +
    2.15.3 Special Notes for sending CUDA jobs to the GPU Queue

    Interactive jobs (Section 2.8) must be submitted to the GPU partition in order to compile and link. We have several versions of CUDA installed in:

    -
    +   
     /encs/pkg/cuda-11.5/root/
     /encs/pkg/cuda-10.2/root/
     /encs/pkg/cuda-9.2/root
    @@ -1861,15 +1866,15 @@ 
    usrlocalcuda with one of the above.

    -
    2.15.4 OpenISS Examples
    +
    2.15.4 OpenISS Examples

    These represent more comprehensive research-like examples of jobs for computer vision and other tasks with a lot longer runtime (a subject to the number of epochs and other parameters) derive from the actual research works of students and their theses. These jobs require the use of CUDA and GPUs. These examples are available as “native” jobs on Speed and as Singularity containers.

    -

    OpenISS and REID - +

    OpenISS and REID + The example openiss-reid-speed.sh illustrates a job for a computer-vision based person re-identification (e.g., motion capture-based tracking for stage performance) part of the OpenISS project by Haotao Lai [10] using TensorFlow and Keras. The fork of the original repo [12] adjusted to @@ -1884,8 +1889,8 @@

    2.15 -

    OpenISS and YOLOv3 - +

    OpenISS and YOLOv3 + The related code using YOLOv3 framework is in the the fork of the original repo [11] adjusted to to run on Speed is here:

    @@ -1906,7 +1911,7 @@
    2.15
  • https://github.com/NAG-DevOps/speed-hpc/tree/master/src#openiss-yolov3
  • -

    2.16 Singularity Containers

    +

    2.16 Singularity Containers

    If the /encs software tree does not have a required software instantaneously available, another option is to run Singularity containers. We run EL7 flavor of Linux, and if some projects require Ubuntu or other distributions, there is a possibility to run that software as a container, including the ones @@ -1938,7 +1943,7 @@

    2

    -
    +   
     /speed-scratch/nag-public:
     
     openiss-cuda-conda-jupyter.sif
    @@ -1980,7 +1985,7 @@ 

    2

    -
    +   
     salloc --gpus=1 -n8 --mem=4Gb -t60
     cd /speed-scratch/$USER/
     singularity pull openiss-cuda-devicequery.sif docker://openiss/openiss-cuda-devicequery
    @@ -1991,16 +1996,16 @@ 

    2

    This method can be used for converting Docker containers directly on Speed. On GPU nodes make sure to pass on the --nv flag to Singularity, so its containers could access the GPUs. See the linked example. -

    +

    -

    3 Conclusion

    -

    The cluster is, “first come, first served”, until it fills, and then job position in the queue is +

    3 Conclusion

    +

    The cluster is, “first come, first served”, until it fills, and then job position in the queue is based upon past usage. The scheduler does attempt to fill gaps, though, so sometimes a single-core job of lower priority will schedule before a multi-core job of higher priority, for example. -

    +

    -

    3.1 Important Limitations

    +

    3.1 Important Limitations

    • New users are restricted to a total of 32 cores: write to rt-ex-hpc@encs.concordia.ca if you need more temporarily (192 is the maximum, or, 6 jobs of 32 cores each). @@ -2009,9 +2014,9 @@

      3. interactive jobs, see Section 2.8).

    • -

      Scripts can live in your NFS-provided home, but any substantial data need to be in your +

      Scripts can live in your NFS-provided home, but any substantial data need to be in your cluster-specific directory (located at /speed-scratch/<ENCSusername>/). -

      NFS is great for acute activity, but is not ideal for chronic activity. Any data that a job will +

      NFS is great for acute activity, but is not ideal for chronic activity. Any data that a job will read more than once should be copied at the start to the scratch disk of a compute node using $TMPDIR (and, perhaps, $SLURM_SUBMIT_DIR), any intermediary job data should be produced in $TMPDIR, and once a job is near to finishing, those data should be copied @@ -2031,7 +2036,7 @@

      3.

    -

    3.2 Tips/Tricks

    +

    3.2 Tips/Tricks

    • Files/scripts must have Linux line breaks in them (not Windows ones). Use file command to verify; and dos2unix command to convert. @@ -2052,24 +2057,24 @@

      3.2

    • E-mail, rt-ex-hpc AT encs.concordia.ca, with any concerns/questions.
    -

    +

    -

    3.3 Use Cases

    +

    3.3 Use Cases

    • -

      HPC Committee’s initial batch about 6 students (end of 2019):

      +

      HPC Committee’s initial batch about 6 students (end of 2019):

      • 10000 iterations job in Fluent finished in \(<26\) hours vs. 46 hours in Calcul Quebec
    • -

      NAG’s MAC spoofer analyzer [1817], such as https://github.com/smokhov/atsm/tree/master/examples/flucid +

      NAG’s MAC spoofer analyzer [1817], such as https://github.com/smokhov/atsm/tree/master/examples/flucid

      • compilation of forensic computing reasoning cases about false or true positives of hardware address spoofing in the labs
    • -

      S4 LAB/GIPSY R&D Group’s:

      +

      S4 LAB/GIPSY R&D Group’s:

      • MARFCAT and MARFPCAT (OSS signal processing and machine learning tools for vulnerable and weak code analysis and network packet capture analysis) [20156] @@ -2114,7 +2119,7 @@

        3.3 https://doi.org/10.1177/0278364920913945

      • -

        The work “Haotao Lai. An OpenISS framework specialization for deep learning-based +

        The work “Haotao Lai. An OpenISS framework specialization for deep learning-based person re-identification. Master’s thesis, Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada, August 2019. https://spectrum.library.concordia.ca/id/eprint/985788/” using TensorFlow and Keras @@ -2128,15 +2133,15 @@

        3.3

      • Haotao Lai et al. OpenISS keras-yolo3 v0.1.0, June 2021. https://github.com/OpenISS/openiss-yolov3
      -

      and theirs forks by the team. +

      and theirs forks by the team.

    -

    +

    -

    A History

    -

    +

    A History

    +

    -

    A.1 Acknowledgments

    +

    A.1 Acknowledgments

    • The first 6 (to 6.5) versions of this manual and early UGE job script samples, Singularity testing and user support were produced/done by Dr. Scott Bunnell during his time at @@ -2146,102 +2151,102 @@

      A.1
    • Dr. Tariq Daradkeh, was our IT Instructional Specialist August 2022 to September 2023; working on the scheduler, scheduling research, end user support, and integration - of examples, such as YOLOv3 in Section 2.15.4.0 other tasks. We have a continued + of examples, such as YOLOv3 in Section 2.15.4.0 other tasks. We have a continued collaboration on HPC/scheduling research.
    -

    +

    -

    A.2 Migration from UGE to SLURM

    -

    For long term users who started off with Grid Engine here are some resources to make a transition +

    A.2 Migration from UGE to SLURM

    +

    For long term users who started off with Grid Engine here are some resources to make a transition and mapping to the job submission process.

    • -

      Queues are called “partitions” in SLURM. Our mapping from the GE queues to SLURM +

      Queues are called “partitions” in SLURM. Our mapping from the GE queues to SLURM partitions is as follows:

      -
      +     
            GE  => SLURM
            s.q    ps
            g.q    pg
            a.q    pa
       
      -

      We also have a new partition pt that covers SPEED2 nodes, which previously did not +

      We also have a new partition pt that covers SPEED2 nodes, which previously did not exist.

    • -

      Commands and command options mappings are found in Figure 11 from
      https://slurm.schedmd.com/rosetta.pdf
      https://slurm.schedmd.com/pdfs/summary.pdf
      Other related helpful resources from similar organizations who either used SLURM for awhile or +

      Commands and command options mappings are found in Figure 11 from
      https://slurm.schedmd.com/rosetta.pdf
      https://slurm.schedmd.com/pdfs/summary.pdf
      Other related helpful resources from similar organizations who either used SLURM for awhile or also transitioned to it:
      https://docs.alliancecan.ca/wiki/Running_jobs
      https://www.depts.ttu.edu/hpcc/userguides/general_guides/Conversion_Table_1.pdf
      https://docs.mpcdf.mpg.de/doc/computing/clusters/aux/migration-from-sge-to-slurm

      - PIC -
      Figure 11: Rosetta Mappings of Scheduler Commands from SchedMD
      + PIC +
      Figure 11: Rosetta Mappings of Scheduler Commands from SchedMD
    • -

      NOTE: If you have used UGE commands in the past you probably still have these lines there; +

      NOTE: If you have used UGE commands in the past you probably still have these lines there; they should now be removed, as they have no use in SLURM and will start giving “command not found” errors on login when the software is removed: -

      csh/tcsh: Sample .tcshrc file: +

      csh/tcsh: Sample .tcshrc file:

      -
      +     
            # Speed environment set up
            if ($HOSTNAME == speed-submit.encs.concordia.ca) then
               source /local/pkg/uge-8.6.3/root/default/common/settings.csh
            endif
       
      -

      -

      Bourne shell/bash: Sample .bashrc file: +

      +

      Bourne shell/bash: Sample .bashrc file:

      -
      +     
            # Speed environment set up
            if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then
                . /local/pkg/uge-8.6.3/root/default/common/settings.sh
                printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
            fi
       
      -

      -

      Note that you will need to either log out and back in, or execute a new shell, for the +

      +

      Note that you will need to either log out and back in, or execute a new shell, for the environment changes in the updated .tcshrc or .bashrc file to be applied (important).

    -

    +

    -

    A.3 Phases

    -

    Brief summary of Speed evolution phases. -

    +

    A.3 Phases

    +

    Brief summary of Speed evolution phases. +

    -
    A.3.1 Phase 4
    -

    Phase 4 had 7 SuperMicro servers with 4x A100 80GB GPUs each added, dubbed as “SPEED2”. We +

    A.3.1 Phase 4
    +

    Phase 4 had 7 SuperMicro servers with 4x A100 80GB GPUs each added, dubbed as “SPEED2”. We also moved from Grid Engine to SLURM. -

    +

    -
    A.3.2 Phase 3
    -

    Phase 3 had 4 vidpro nodes added from Dr. Amer totalling 6x P6 and 6x V100 GPUs +

    A.3.2 Phase 3
    +

    Phase 3 had 4 vidpro nodes added from Dr. Amer totalling 6x P6 and 6x V100 GPUs added. -

    +

    -
    A.3.3 Phase 2
    -

    Phase 2 saw 6x NVIDIA Tesla P6 added and 8x more compute nodes. The P6s replaced 4x of FirePro +

    A.3.3 Phase 2
    +

    Phase 2 saw 6x NVIDIA Tesla P6 added and 8x more compute nodes. The P6s replaced 4x of FirePro S7150. -

    +

    -
    A.3.4 Phase 1
    -

    Phase 1 of Speed was of the following configuration: +

    A.3.4 Phase 1
    +

    Phase 1 of Speed was of the following configuration:

    • Sixteen, 32-core nodes, each with 512 GB of memory and approximately 1 TB of @@ -2251,20 +2256,20 @@
      A.3.4

    -

    B Frequently Asked Questions

    +

    B Frequently Asked Questions

    -

    B.1 Where do I learn about Linux?

    +

    B.1 Where do I learn about Linux?

    All Speed users are expected to have a basic understanding of Linux and its commonly used commands.

    -
    Software Carpentry
    +
    Software Carpentry

    Software Carpentry provides free resources to learn software, including a workshop on the Unix shell. https://software-carpentry.org/lessons/

    -
    Udemy
    +
    Udemy

    There are a number of Udemy courses, including free ones, that will assist you in learning Linux. Active Concordia faculty, staff and students have access to Udemy courses. The course Linux Mastery: Master the Linux Command Line in 11.5 Hours is a good starting point for @@ -2275,25 +2280,25 @@

    Udemy

    -

    B.2 How to use the “bash shell” on Speed?

    +

    B.2 How to use the “bash shell” on Speed?

    This section describes how to use the “bash shell” on Speed. Review Section 2.1.2 to ensure that your bash environment is set up.

    -
    B.2.1 How do I set bash as my login shell?
    +
    B.2.1 How do I set bash as my login shell?

    In order to set your default login shell to bash on Speed, your login shell on all GCS servers must be changed to bash. To make this change, create a ticket with the Service Desk (or email help at concordia.ca) to request that bash become your default login shell for your ENCS user account on all GCS servers.

    -
    B.2.2 How do I move into a bash shell on Speed?
    +
    B.2.2 How do I move into a bash shell on Speed?

    To move to the bash shell, type bash at the command prompt. For example:

    -
    +   
     [speed-submit] [/home/a/a_user] > bash
     bash-4.4$ echo $0
     bash
    @@ -2303,7 +2308,7 @@ 
    bash-4.4$ after entering the bash shell.

    -
    B.2.3 How do I use the bash shell in an interactive session on Speed?
    +
    B.2.3 How do I use the bash shell in an interactive session on Speed?

    Below are examples of how to use bash as a shell in your interactive job sessions with both the salloc and srun commands.

    @@ -2313,41 +2318,41 @@
    srun --mem=50G -n 5 --pty /encs/bin/bash

    Note: Make sure the interactive job requests memory, cores, etc.

    -
    B.2.4 How do I run scripts written in bash on Speed?
    +
    B.2.4 How do I run scripts written in bash on Speed?

    To execute bash scripts on Speed:

      -
    1. Ensure that the shebang of your bash job script is #!/encs/bin/bash +
    2. Ensure that the shebang of your bash job script is #!/encs/bin/bash
    3. -
    4. Use the sbatch command to submit your job script to the scheduler.
    +
  • Use the sbatch command to submit your job script to the scheduler.
  • The Speed GitHub contains a sample bash job script.

    -

    B.3 How to resolve “Disk quota exceeded” errors?

    +

    B.3 How to resolve “Disk quota exceeded” errors?

    -
    B.3.1 Probable Cause
    +
    B.3.1 Probable Cause

    The “Disk quota exceeded” Error occurs when your application has run out of disk space to write to. On Speed this error can be returned when:

      -
    1. Your NFS-provided home is full and cannot be written to. You can verify this using quota +
    2. Your NFS-provided home is full and cannot be written to. You can verify this using quota and bigfiles commands.
    3. -
    4. The /tmp directory on the speed node your application is running on is full and cannot +
    5. The /tmp directory on the speed node your application is running on is full and cannot be written to.

    -
    B.3.2 Possible Solutions
    +
    B.3.2 Possible Solutions

      -
    1. Use the --chdir job script option to set the directory that the job script is submitted +
    2. Use the --chdir job script option to set the directory that the job script is submitted from the job working directory. The job working directory is the directory that the job will write output files in.
    3. -
    4. +
    5. The use local disk space is generally recommended for IO intensive operations. However, as the size of /tmp on speed nodes is 1TB it can be necessary for scripts to store temporary data elsewhere. Review the documentation for each module called within your script to determine @@ -2368,7 +2373,7 @@

      B.

      -
      +         
                mkdir -m 750 /speed-scratch/$USER/output
                 
       
      @@ -2380,7 +2385,7 @@
      B.

      -
      +         
                mkdir -m 750 /speed-scratch/$USER/recovery
       

      @@ -2391,7 +2396,7 @@

      B.

      In the above example, $USER is an environment variable containing your ENCS username.

      -
      B.3.3 Example of setting working directories for COMSOL
      +
      B.3.3 Example of setting working directories for COMSOL
      • Create directories for recovery, temporary, and configuration files. For example, to create these @@ -2400,7 +2405,7 @@

        +
              mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config}
         

        @@ -2412,7 +2417,7 @@

        +
              -recoverydir /speed-scratch/$USER/comsol/recovery
              -tmpdir /speed-scratch/$USER/comsol/tmp
              -configuration/speed-scratch/$USER/comsol/config
        @@ -2421,7 +2426,7 @@ 
        In the above example, $USER is an environment variable containing your ENCS username.

        -
        B.3.4 Example of setting working directories for Python Modules
        +
        B.3.4 Example of setting working directories for Python Modules

        By default when adding a python module the /tmp directory is set as the temporary repository for files downloads. The size of the /tmp directory on speed-submit is too small for pytorch. To add a python module

        @@ -2432,7 +2437,7 @@
        +
                mkdir /speed-scratch/$USER/tmp
         

        @@ -2443,7 +2448,7 @@

        +
                setenv TMPDIR /speed-scratch/$USER/tmp
         

        @@ -2452,17 +2457,17 @@

        In the above example, $USER is an environment variable containing your ENCS username.

        -

        B.4 How do I check my job’s status?

        +

        B.4 How do I check my job’s status?

        When a job with a job id of 1234 is running or terminated, the status of that job can be tracked using ‘sacct -j 1234’. squeue -j 1234 can show while the job is sitting in the queue as well. Long term statistics on the job after its terminated can be found using sstat -j 1234 after slurmctld purges it its tracking state into the database.

        -

        B.5 Why is my job pending when nodes are empty?

        +

        B.5 Why is my job pending when nodes are empty?

        -
        B.5.1 Disabled nodes
        +
        B.5.1 Disabled nodes

        It is possible that one or a number of the Speed nodes are disabled. Nodes are disabled if they require maintenance. To verify if Speed nodes are disabled, see if they are in a draining or drained state: @@ -2470,7 +2475,7 @@

        B.5.1

        -
        +   
         [serguei@speed-submit src] % sinfo --long --Node
         Thu Oct 19 21:25:12 2023
         NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
        @@ -2519,17 +2524,17 @@ 
        B.5.1 and the disabled nodes have a state of idle.

        -
        B.5.2 Error in job submit request.
        +
        B.5.2 Error in job submit request.

        It is possible that your job is pending, because the job requested resources that are not available within Speed. To verify why job id 1234 is not running, execute ‘sacct -j 1234’. A summary of the reasons is available via the squeue command. -

        +

        -

        C Sister Facilities

        -

        Below is a list of resources and facilities similar to Speed at various capacities. Depending on your +

        C Sister Facilities

        +

        Below is a list of resources and facilities similar to Speed at various capacities. Depending on your research group and needs, they might be available to you. They are not managed by HPC/NAG of AITS, so contact their respective representatives.

        @@ -2548,7 +2553,7 @@

        C
      • -

        There are various Lambda Labs other GPU servers and like computers acquired by individual +

        There are various Lambda Labs other GPU servers and like computers acquired by individual researchers; if you are member of their research group, contact them directly. These resources are not managed by us.

          @@ -2583,8 +2588,8 @@

          C -

          References

          +

          +

          References

          diff --git a/src/README.md b/src/README.md index e39b2a1..55bde13 100644 --- a/src/README.md +++ b/src/README.md @@ -179,7 +179,7 @@ This will install pip and pip's dependencies, including python. #### No Space left error when creating Conda Environment -You are using your /home directory as conda default directory, the tarballs and pkgs are using all the space +You are using your `$HOME` directory as conda default directory, the tarballs and pkgs are using all the space `conda clean --all --dry-run` will show you the size of tarballs, packages, caches `conda clean -all` will wipe-out all unused packages, caches and tarballs @@ -204,6 +204,15 @@ setenv CONDA_PKGS_DIRS $TMP/pkgs conda create -p $TMP/Venv-Name python==3.11 conda activate $TMP/Venv-Name ``` +#### Conda envs without prefix +If you don't want to use the `--prefix` option everytime you create a new environment and you don't want to use the default `$HOME` directory, create a new directory and set CONDA_ENVS_PATH and CONDA_PKGS_DIRS variables to point to the new created directory, e.g: + +``` +setenv CONDA_ENVS_PATH /speed-scratch/$USER/condas +setenv CONDA_PKGS_DIRS /speed-scratch/$USER/condas/pkg +``` + +If you want to make these changes permanent, add the variables to your .tcshrc or .bashrc (depending on the default shell you are using) ### efficientdet