From b80461989cb1872654e5bee929fb83a739fa0df3 Mon Sep 17 00:00:00 2001
From: Serguei Mokhov This document primarily presents a quick start guide to the usage of the Gina Cody
+ This document primarily presents a quick start guide to the usage of the Gina Cody
School of Engineering and Computer Science compute server farm called “Speed” – the
GCS ENCS Speed cluster, managed by HPC/NAG of GCS ENCS, Concordia University,
Montreal, Canada.
@@ -39,76 +39,78 @@ This document contains basic information required to use “Speed” as well as tips and tricks,
+ This document contains basic information required to use “Speed” as well as tips and tricks,
examples, and references to projects and papers that have used Speed. User contributions of sample
jobs and/or references are welcome. Details are sent to the hpc-ml mailing list.
-
+
+
We receive support from the rest of AITS teams, such as NAG, SAG, FIS, and DOG.
-
+ We receive support from the rest of AITS teams, such as NAG, SAG, FIS, and DOG.
+
+
Prepare them for big clusters: Prepare them for big clusters:
+
We have a great number of open-source software available and installed on Speed – various Python,
+ We have a great number of open-source software available and installed on Speed – various Python,
CUDA versions, C++/Java compilers, OpenGL, OpenFOAM, OpenCV, TensorFlow, OpenMPI,
-OpenISS, MARF [18], etc. There are also a number of commercial packages, subject to
-licensing contributions, available, such as MATLAB [7, 17], Abaqus [1], Ansys, Fluent [2],
+OpenISS, MARF [21], etc. There are also a number of commercial packages, subject to
+licensing contributions, available, such as MATLAB [10, 20], Abaqus [1], Ansys, Fluent [2],
etc.
- To see the packages available, run ls -al /encs/pkg/ on speed.encs.
- In particular, there are over 2200 programs available in /encs/bin and /encs/pkg under Scientific
+ To see the packages available, run ls -al /encs/pkg/ on speed.encs.
+ In particular, there are over 2200 programs available in /encs/bin and /encs/pkg under Scientific
Linux 7 (EL7).
Popular concrete examples: Popular concrete examples: Popular examples mentioned (loaded with, module): Popular examples mentioned (loaded with, module):
+
After reviewing the “What Speed is” (Section 1.4) and “What Speed is Not” (Section 1.5), request
+ After reviewing the “What Speed is” (Section 1.4) and “What Speed is Not” (Section 1.5), request
access to the “Speed” cluster by emailing: rt-ex-hpc AT encs.concordia.ca. Faculty
and staff may request the access directly. Students must include the following in their
message:
@@ -263,22 +265,22 @@
+
In these instructions, anything bracketed like so, <>, indicates a label/value to be replaced (the entire
+ In these instructions, anything bracketed like so, <>, indicates a label/value to be replaced (the entire
bracketed term needs replacement).
-
+
Before getting started, please review the “What Speed is” (Section 1.4) and “What Speed is Not”
+ Before getting started, please review the “What Speed is” (Section 1.4) and “What Speed is Not”
(Section 1.5). Once your GCS ENCS account has been granted access to “Speed”, use
your GCS ENCS account credentials to create an SSH connection to speed (an alias for
speed-submit.encs.concordia.ca).
-
+
Requirements to create connections to Speed:
+ Requirements to create connections to Speed:
@@ -289,7 +291,7 @@ Open up a terminal window and type in the following SSH command being sure to replace
+ Open up a terminal window and type in the following SSH command being sure to replace
<ENCSusername> with your ENCS account’s username.
@@ -298,18 +300,18 @@
- All users are expected to have a basic understanding of Linux and its commonly used
+
+ All users are expected to have a basic understanding of Linux and its commonly used
commands.
-
+
After creating an SSH connection to “Speed”, you will need to source the “Altair Grid Engine
+ After creating an SSH connection to “Speed”, you will need to source the “Altair Grid Engine
(AGE)” scheduler’s settings file. Sourcing the settings file will set the environment variables required
to execute scheduler commands.
- Based on the UNIX shell type, choose one of the following commands to source the settings
+ Based on the UNIX shell type, choose one of the following commands to source the settings
file.
- csh/tcsh:
+ csh/tcsh:
@@ -317,8 +319,8 @@
- Bourne shell/bash:
+
+ Bourne shell/bash:
@@ -326,8 +328,8 @@
- In order to set up the default ENCS bash shell, executing the following command is also
+
+ In order to set up the default ENCS bash shell, executing the following command is also
required:
@@ -336,10 +338,10 @@
- To verify that you have access to the scheduler commands execute qstat -f -u "*". If an error is
+
+ To verify that you have access to the scheduler commands execute qstat -f -u "*". If an error is
returned, attempt sourcing the settings file again.
- The next step is to copy a job template to your home directory and to set up your cluster-specific
+ The next step is to copy a job template to your home directory and to set up your cluster-specific
storage. Execute the following command from within your home directory. (To move to your home
directory, type cd at the Linux prompt and press Enter.)
@@ -349,15 +351,15 @@
- Tip: Add the source command to your shell-startup script.
- Tip: the default shell for GCS ENCS users is tcsh. If you would like to use bash, please contact
+
+ Tip: Add the source command to your shell-startup script.
+ Tip: the default shell for GCS ENCS users is tcsh. If you would like to use bash, please contact
rt-ex-hpc AT encs.concordia.ca.
- For new ENCS Users, and/or those who don’t have a shell-startup script, based on your shell
+ For new ENCS Users, and/or those who don’t have a shell-startup script, based on your shell
type use one of the following commands to copy a start up script from nul-uge’s. home directory to
your home directory. (To move to your home directory, type cd at the Linux prompt and press
Enter.)
- csh/tcsh:
+ csh/tcsh:
@@ -365,8 +367,8 @@
- Bourne shell/bash:
+
+ Bourne shell/bash:
@@ -374,11 +376,11 @@
- Users who already have a shell-startup script, use a text editor, such as vim or emacs, to add the
+
+ Users who already have a shell-startup script, use a text editor, such as vim or emacs, to add the
source request to your existing shell-startup environment (i.e., to the .tcshrc file in your home
directory).
- csh/tcsh: Sample .tcshrc file:
+ csh/tcsh: Sample .tcshrc file:
@@ -389,8 +391,8 @@
- Bourne shell/bash: Sample .bashrc file:
+
+ Bourne shell/bash: Sample .bashrc file:
@@ -402,34 +404,34 @@
- Note that you will need to either log out and back in, or execute a new shell, for the environment
+
+ Note that you will need to either log out and back in, or execute a new shell, for the environment
changes in the updated .tcshrc or .bashrc file to be applied (important).
-
+
Preparing your job for submission is fairly straightforward. Editing a copy of the template.sh you
+ Preparing your job for submission is fairly straightforward. Editing a copy of the template.sh you
moved into your home directory during Section 2.1.2 is a good place to start. You can also use a job
script example from our GitHub’s (https://github.com/NAG-DevOps/speed-hpc) “src” directory
and base your job on it.
- Job scripts are broken into four main sections: Speed: The GCS ENCS Cluster
Concordia University
Montreal, Quebec, Canada
rt-ex-hpc~AT~encs.concordia.ca
-
Abstract
-
Contents
1.1 Resources
-
1.2 Team
-
1.3 What Speed Comprises
-
1.4 What Speed Is Ideal For
-
1.5 What Speed Is Not
-
1.6 Available Software
-
1.7 Requesting Access
-
2 Job Management
-
2.1 Getting Started
-
2.1.1 SSH Connections
-
2.1.2 Environment Set Up
-
2.2 Job Submission Basics
-
2.2.1 Directives
-
2.2.2 Module Loads
-
2.2.3 User Scripting
-
2.3 Sample Job Script
-
2.4 Common Job Management Commands Summary
-
2.5 Advanced qsub Options
-
2.6 Array Jobs
-
-
-
-
2.7 Requesting Multiple Cores (i.e., Multithreading Jobs)
-
2.8 Interactive Jobs
-
2.9 Scheduler Environment Variables
-
2.10 SSH Keys For MPI
-
2.11 Creating Virtual Environments
-
2.11.1 Anaconda
-
2.12 Example Job Script: Fluent
-
2.13 Example Job: efficientdet
-
2.14 Java Jobs
-
2.15 Scheduling On The GPU Nodes
-
2.15.1 CUDA
-
2.15.2 Special Notes for sending CUDA jobs to the GPU Queue
-
3 Conclusion
-
3.1 Important Limitations
-
3.2 Tips/Tricks
-
3.3 Use Cases
-
A History
-
A.1 Acknowledgments
-
A.2 Phase 3
-
A.3 Phase 2
-
A.4 Phase 1
-
B Frequently Asked Questions
-
B.1 Where do I learn about Linux?
-
B.2 How to use the “bash shell” on Speed?
-
B.2.1 How do I set bash as my login shell?
-
B.2.2 How do I move into a bash shell on Speed?
-
B.2.3 How do I run scripts written in bash on Speed?
-
B.3 How to resolve“Disk quota exceeded” errors?
-
B.3.1 Probable Cause
-
B.3.2 Possible Solutions
-
B.3.3 Example of setting working directories for COMSOL
-
B.3.4 Example of setting working directories for Python Modules
-
B.4 How do I check my job’s status?
-
B.5 Why is my job pending when nodes are empty?
-
B.5.1 Disabled nodes
-
B.5.2 Error in job submit request.
-
C Sister Facilities
-
Annotated Bibliography
+ 1 Introduction
+
1.1 Resources
+
1.2 Team
+
1.3 What Speed Comprises
+
1.4 What Speed Is Ideal For
+
1.5 What Speed Is Not
+
1.6 Available Software
+
1.7 Requesting Access
+
2 Job Management
+
2.1 Getting Started
+
2.1.1 SSH Connections
+
2.1.2 Environment Set Up
+
2.2 Job Submission Basics
+
2.2.1 Directives
+
2.2.2 Module Loads
+
2.2.3 User Scripting
+
2.3 Sample Job Script
+
2.4 Common Job Management Commands Summary
+
2.5 Advanced qsub Options
+
2.6 Array Jobs
+
+
+
+
2.7 Requesting Multiple Cores (i.e., Multithreading Jobs)
+
2.8 Interactive Jobs
+
2.9 Scheduler Environment Variables
+
2.10 SSH Keys For MPI
+
2.11 Creating Virtual Environments
+
2.11.1 Anaconda
+
2.12 Example Job Script: Fluent
+
2.13 Example Job: efficientdet
+
2.14 Java Jobs
+
2.15 Scheduling On The GPU Nodes
+
2.15.1 CUDA
+
2.15.2 Special Notes for sending CUDA jobs to the GPU Queue
+
2.15.3 OpenISS Examples
+
2.16 Singularity Containers
+
3 Conclusion
+
3.1 Important Limitations
+
3.2 Tips/Tricks
+
3.3 Use Cases
+
A History
+
A.1 Acknowledgments
+
A.2 Phase 3
+
A.3 Phase 2
+
A.4 Phase 1
+
B Frequently Asked Questions
+
B.1 Where do I learn about Linux?
+
B.2 How to use the “bash shell” on Speed?
+
B.2.1 How do I set bash as my login shell?
+
B.2.2 How do I move into a bash shell on Speed?
+
B.2.3 How do I run scripts written in bash on Speed?
+
B.3 How to resolve “Disk quota exceeded” errors?
+
B.3.1 Probable Cause
+
B.3.2 Possible Solutions
+
B.3.3 Example of setting working directories for COMSOL
+
B.3.4 Example of setting working directories for Python Modules
+
B.4 How do I check my job’s status?
+
B.5 Why is my job pending when nodes are empty?
+
B.5.1 Disabled nodes
+
B.5.2 Error in job submit request.
+
C Sister Facilities
+
Annotated Bibliography
1 Introduction
-1.1 Resources
@@ -121,9 +123,9 @@
-1.1
1.2 Team
@@ -138,8 +140,8 @@ 1.2
1.3 What Speed Comprises
@@ -155,7 +157,7 @@
1.3
-
1.4 What Speed Is Ideal For
@@ -163,7 +165,7 @@
-
partial data sets.
1.6 Available Software
-
-1.6
1.6
-
1.7 Requesting Access
-1.7
2 Job Management
-2.1 Getting Started
-2.1.1 SSH Connections
-2.1.1
VPN requires a Concordia netname.
2.1.1
ssh <ENCSusername>@speed.encs.concordia.ca
-2.1.2 Environment Set Up
-2.
source /local/pkg/uge-8.6.3/root/default/common/settings.csh
-2.
. /local/pkg/uge-8.6.3/root/default/common/settings.sh
-2.
printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
-2.
cp /home/n/nul-uge/template.sh . && mkdir /speed-scratch/$USER
-2.
cp /home/n/nul-uge/.tcshrc .
-2.
cp /home/n/nul-uge/.bashrc .
-2.
source /local/pkg/uge-8.6.3/root/default/common/settings.csh
endif
-
2.
printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
fi
-
2.2 Job Submission Basics
-
Job scripts are broken into four main sections:
+
Directives are comments included at the beginning of a job script that set the shell and the options for +
Directives are comments included at the beginning of a job script that set the shell and the options for the job scheduler. -
The shebang directive is always the first line of a script. In your job script, this directive sets +
The shebang directive is always the first line of a script. In your job script, this directive sets which shell your script’s commands will run in. On “Speed”, we recommend that your script use a shell from the /encs/bin directory. -
To use the tcsh shell, start your script with: #!/encs/bin/tcsh -
For bash, start with: #!/encs/bin/bash -
Directives that start with "#$", set the options for the cluster’s “Altair Grid Engine (AGE)” +
To use the tcsh shell, start your script with: #!/encs/bin/tcsh +
For bash, start with: #!/encs/bin/bash +
Directives that start with "#$", set the options for the cluster’s “Altair Grid Engine (AGE)” scheduler. The script template, template.sh, provides the essentials: @@ -442,15 +444,15 @@
-
Replace, <jobname>, with the name that you want your cluster job to have; -cwd, makes the +
+
Replace, <jobname>, with the name that you want your cluster job to have; -cwd, makes the current working directory the “job working directory”, and your standard output file will appear here; -m bea, provides e-mail notifications (begin/end/abort); replace, <corecount>, with the degree of (multithreaded) parallelism (i.e., cores) you attach to your job (up to 32), be sure to delete or comment out the #$ -pe smp parameter if it is not relevant; replace, <memory>, with the value (in GB), that you want your job’s memory space to be (up to 500), and all jobs MUST have a memory-space assignment. -
If you are unsure about memory footprints, err on assigning a generous memory space to your job +
If you are unsure about memory footprints, err on assigning a generous memory space to your job so that it does not get prematurely terminated (the value given to h_vmem is a hard memory ceiling). You can refine h_vmem values for future jobs by monitoring the size of a job’s active memory space on speed-submit with: @@ -461,17 +463,17 @@
qstat -j <jobID> | grep maxvmem-
-
Memory-footprint values are also provided for completed jobs in the final e-mail notification (as, +
+
Memory-footprint values are also provided for completed jobs in the final e-mail notification (as, “Max vmem”). -
Jobs that request a low-memory footprint are more likely to load on a busy cluster. -
+
Jobs that request a low-memory footprint are more likely to load on a busy cluster. +
As your job will run on a compute or GPU “Speed” node, and not the submit node, any software that +
As your job will run on a compute or GPU “Speed” node, and not the submit node, any software that is needed must be loaded by the job script. Software is loaded within the script just as it would be from the command line. -
To see a list of which modules are available, execute the following from the command line on +
To see a list of which modules are available, execute the following from the command line on speed-submit. @@ -480,8 +482,8 @@
module avail-
-
To list for a particular program (matlab, for example): +
+
To list for a particular program (matlab, for example): @@ -489,8 +491,8 @@
module -t avail matlab-
-
Which, of course, can be shortened to match all that start with a particular letter: +
+
Which, of course, can be shortened to match all that start with a particular letter: @@ -498,8 +500,8 @@
module -t avail m-
-
Insert the following in your script to load the matlab/R2020a) module: +
+
Insert the following in your script to load the matlab/R2020a) module: @@ -507,9 +509,9 @@
module load matlab/R2020a/default-
-
Use, unload, in place of, load, to remove a module from active use. -
To list loaded modules: +
+
Use, unload, in place of, load, to remove a module from active use. +
To list loaded modules: @@ -517,8 +519,8 @@
module list-
-
To purge all software in your working environment: +
+
To purge all software in your working environment: @@ -526,28 +528,28 @@
module purge-
-
Typically, only the module load command will be used in your script. -
+
+
Typically, only the module load command will be used in your script. +
The last part the job script is the scripting that will be executed by the job. This part of +
The last part the job script is the scripting that will be executed by the job. This part of the job script includes all commands required to set up and execute the task your script has been written to do. Any Linux command can be used at this step. This section can be a simple call to an executable or a complex loop which iterates through a series of commands. -
Every software program has a unique execution framework. It is the responsibility of the script’s +
Every software program has a unique execution framework. It is the responsibility of the script’s author (e.g., you) to know what is required for the software used in your script by reviewing the software’s documentation. Regardless of which software your script calls, your script should be written so that the software knows the location of the input and output files as well as the degree of parallelism. Note that the cluster-specific environment variable, NSLOTS, resolves to the value provided to the scheduler in the -pe smp option. -
Jobs which touch data-input and data-output files more than once, should make use of TMPDIR, a +
Jobs which touch data-input and data-output files more than once, should make use of TMPDIR, a scheduler-provided working space almost 1 TB in size. TMPDIR is created when a job starts, and exists on the local disk of the compute node executing your job. Using TMPDIR results in faster I/O operations than those to and from shared storage (which is provided over NFS). -
An sample job script using TMPDIR is available at /home/n/nul-uge/templateTMPDIR.sh: the job +
An sample job script using TMPDIR is available at /home/n/nul-uge/templateTMPDIR.sh: the job is instructed to change to $TMPDIR, to make the new directory input, to copy data from $SGE_O_WORKDIR/references/ to input/ ($SGE_O_WORKDIR represents the current working directory), to make the new directory results, to execute the program (which takes input from @@ -555,10 +557,10 @@
+
Now, let’s look at a basic job script, tcsh.sh in Figure 1 (you can copy it from our GitHub page or +
Now, let’s look at a basic job script, tcsh.sh in Figure 1 (you can copy it from our GitHub page or from /home/n/nul-uge).
-The first line is the shell declaration (also know as a shebang) and sets the shell to tcsh. The lines +
The first line is the shell declaration (also know as a shebang) and sets the shell to tcsh. The lines that begin with #$ are directives for the scheduler.
The script then: +
The script then:
The scheduler command, qsub, is used to submit (non-interactive) jobs. From an ssh session on +
The scheduler command, qsub, is used to submit (non-interactive) jobs. From an ssh session on speed-submit, submit this job with qsub ./tcsh.sh. You will see, "Your job X ("qsub-test") has been submitted". The command, qstat, can be used to look at the status of the cluster: qstat -f -u "*". You will see something like this: @@ -659,25 +661,25 @@
+
-
Remember that you only have 30 seconds before the job is essentially over, so if you do not see a +
Remember that you only have 30 seconds before the job is essentially over, so if you do not see a similar output, either adjust the sleep time in the script, or execute the qstat statement more quickly. The qstat output listed above shows you that your job is running on node speed-05, that it has a job number of 144, that it was started at 16:39:30 on 12/03/2018, and that it is a single-core job (the default). -
Once the job finishes, there will be a new file in the directory that the job was started from, with +
Once the job finishes, there will be a new file in the directory that the job was started from, with the syntax of, "job name".o"job number", so in this example the file is, qsub test.o144. This file represents the standard output (and error, if there is any) of the job in question. If you look at the contents of your newly created file, you will see that it contains the output of the, module list command. Important information is often written to this file. -
Congratulations on your first job! +
Congratulations on your first job!
Here are useful job-management commands: +
Here are useful job-management commands:
+
In addition to the basic qsub options presented earlier, there are a few additional options that are +
In addition to the basic qsub options presented earlier, there are a few additional options that are generally useful:
+
Array jobs are those that start a batch job or a parallel job multiple times. Each iteration of the job +
Array jobs are those that start a batch job or a parallel job multiple times. Each iteration of the job array is called a task and receives a unique job ID. -
To submit an array job, use the t option of the qsub command as follows: +
To submit an array job, use the t option of the qsub command as follows: @@ -744,15 +746,15 @@
qsub -t n[-m[:s]] <batch_script>-
-
-t Option Syntax:
++
-t Option Syntax:
Examples:
+Examples:
Output files for Array Jobs: -
The default and output and error-files are job_name.[o|e]job_id and
job_name.[o|e]job_id.task_id. This means that Speed creates an output and an error-file for each
+
Output files for Array Jobs: +
The default and output and error-files are job_name.[o|e]job_id and
job_name.[o|e]job_id.task_id. This means that Speed creates an output and an error-file for each
task generated by the array-job as well as one for the super-ordinate array-job. To alter this behavior
use the -o and -e option of qsub.
-
For more details about Array Job options, please review the manual pages for qsub by executing +
For more details about Array Job options, please review the manual pages for qsub by executing the following at the command line on speed-submit man qsub. -
+
For jobs that can take advantage of multiple machine cores, up to 32 cores (per job) can be requested +
For jobs that can take advantage of multiple machine cores, up to 32 cores (per job) can be requested in your script with: @@ -779,26 +781,26 @@
-
Do not request more cores than you think will be useful, as larger-core jobs +
+
Do not request more cores than you think will be useful, as larger-core jobs are more difficult to schedule. On the flip side, though, if you are going to be running a program that scales out to the maximum single-machine core count available, please (please) request 32 cores, to avoid node oversubscription (i.e., to avoid overloading the CPUs). -
Core count associated with a job appears under, “states”, in the, qstat -f -u "*", +
Core count associated with a job appears under, “states”, in the, qstat -f -u "*", output. -
+
Job sessions can be interactive, instead of batch (script) based. Such sessions can be useful for testing +
Job sessions can be interactive, instead of batch (script) based. Such sessions can be useful for testing and optimising code and resource requirements prior to batch submission. To request an interactive job session, use, qlogin [options], similarly to a qsub command-line job (e.g., qlogin -N qlogin-test -l h_vmem=1G). Note that the options that are available for qsub are not necessarily available for qlogin, notably, -cwd, and, -v. -
+
The scheduler presents a number of environment variables that can be used in your jobs. Three of the +
The scheduler presents a number of environment variables that can be used in your jobs. Three of the more useful are TMPDIR, SGE_O_WORKDIR, and NSLOTS:
In Figure 2 is a sample script, using all three. +
In Figure 2 is a sample script, using all three.
+
The following documentation is specific to the Speed HPC Facility at the Gina Cody School of +
The following documentation is specific to the Speed HPC Facility at the Gina Cody School of Engineering and Computer Science. -
+
To create an anaconda environment in your speed-scratch directory, use the prefix option when +
To create an anaconda environment in your speed-scratch directory, use the prefix option when executing conda create. For example, to create an anaconda environment for ai_user, execute the following at the command line: @@ -876,11 +878,11 @@
-
Note: Without the prefix option, the conda create command creates the environment in +
+
Note: Without the prefix option, the conda create command creates the environment in texttta_user’s home directory by default.
-List Environments. To view your conda environments, type: conda info --envs @@ -892,9 +894,9 @@
+
-
Activate an Environment. Activate the environment speedscratcha_usermyconda as follows @@ -903,7 +905,7 @@
After activating your environment, add pip to your environment by using +
After activating your environment, add pip to your environment by using @@ -911,10 +913,10 @@
This will install pip and pip’s dependencies, including python, into the environment. -
Important Note: pip (and pip3) are used to install modules from the python distribution while +
This will install pip and pip’s dependencies, including python, into the environment. +
Important Note: pip (and pip3) are used to install modules from the python distribution while conda install installs modules from anaconda’s repository. -
+