diff --git a/README.md b/README.md index 78c25fd..bc81afc 100644 --- a/README.md +++ b/README.md @@ -11,18 +11,18 @@ Speed: Gina Cody School HPC Facility: Scripts, Tools, and Refs * [Overview Slides](https://docs.google.com/presentation/d/1zu4OQBU7mbj0e34Wr3ILXLPWomkhBgqGZ8j8xYrLf44/edit?usp=sharing) * [AITS Service Desk](https://www.concordia.ca/ginacody/aits.html) -## Examples ## +## Examples * [`src/`](src/) -- sample job scripts * [`doc/`](doc/) -- user manual sources -## Contributing and TODO ## +## Contributing and TODO * [Public issue tracker](https://github.com/NAG-DevOps/speed-hpc/issues) * [Contributions (pull requests)](https://github.com/NAG-DevOps/speed-hpc/pulls) are welcome for your sample job scripts or links/references (subject to reviews) * For Internal access and support requests, please see the GCS Speed Facility link above -### Contributors ### +### Contributors * See the overall contributors [here](https://github.com/NAG-DevOps/speed-hpc/graphs/contributors) * [Serguei A. Mokhov](https://github.com/smokhov) -- project lead @@ -30,13 +30,13 @@ Speed: Gina Cody School HPC Facility: Scripts, Tools, and Refs * [Anh H Nguyen](https://github.com/aaanh) contributed the [HTML](https://nag-devops.github.io/speed-hpc/) version of the manual and its generation off our LaTeX sources as well as the corresponding [devcontainer](https://github.com/NAG-DevOps/speed-hpc/tree/master/doc/.devcontainer) environment * The initial Grid Engine V6 manual was written by Dr. Scott Bunnell -## References ## +## References -### Conferences ### +### Conferences * Serguei Mokhov, Jonathan Llewellyn, Carlos Alarcon Meza, Tariq Daradkeh, and Gillian Roper. 2023. **The use of Containers in OpenGL, ML and HPC for Teaching and Research Support.** In ACM SIGGRAPH 2023 Posters (SIGGRAPH '23). Association for Computing Machinery, New York, NY, USA, Article 49, 1–2. [DOI: 10.1145/3588028.3603676](https://doi.org/10.1145/3588028.3603676) -### Related Repositories ### +### Related Repositories * [OpenISS Dockerfiles](https://github.com/NAG-DevOps/openiss-dockerfiles) -- the source of the Docker containers for the above poster as well as Singularity images based off it for Speed * Sample complete more complex projects' repos than baby jobs based on the work of students and their theses: @@ -44,10 +44,19 @@ Speed: Gina Cody School HPC Facility: Scripts, Tools, and Refs * https://github.com/NAG-DevOps/openiss-reid-tfk * https://github.com/NAG-DevOps/kg-recommendation-framework -### Technical ### +### Technical + +* [Slurm Workload Manager](https://en.wikipedia.org/wiki/Slurm_Workload_Manager) +* [Linux and other tutorials from Software Carpentry](https://software-carpentry.org/lessons/) +* [Digital Research Alliance of Canada SLURM Examples](https://docs.alliancecan.ca/wiki/Running_jobs) +* Concordia's subscription to [Udemy resources](https://www.concordia.ca/it/services/udemy.html) +* [NVIDIA Tesla P6](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/Tesla-P6-Product-Brief.pdf) +* [AMD Tonga FirePro S7100X](https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#FirePro_Server_Series_(S000x/Sxx_000)) + +### Legacy + +Speed no longer runs Grid Engine; these are provided for reference only. * [Altair Grid Engine (AGE)](https://www.altair.com/grid-engine/) (formely [Univa Grid Engine (UGE)](https://en.wikipedia.org/wiki/Univa_Grid_Engine)) * [UGE User Guide for version 8.6.3 (current version running on speed)](https://github.com/NAG-DevOps/speed-hpc/blob/master/doc/UsersGuideGE.pdf) * [Altair product documentation](https://community.altair.com/community?id=altair_product_documentation) -* [NVIDIA Tesla P6](https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/solutions/resources/documents1/Tesla-P6-Product-Brief.pdf) -* [AMD Tonga FirePro S7100X](https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#FirePro_Server_Series_(S000x/Sxx_000)) diff --git a/SECURITY.md b/SECURITY.md index 0ded7a1..b58364a 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -4,8 +4,9 @@ | Version | Supported | | ------- | ------------------ | -| 6.6.x | :white_check_mark: | -| 6.5.x | :x: | +| 7.x | :white_check_mark: | +| 6.6.x | :x: | +| 6.5.x | :x: | | < 6.5 | :x: | ## Reporting a Vulnerability diff --git a/doc/GE/AdminsGuideGE.pdf b/doc/GE/AdminsGuideGE.pdf deleted file mode 100644 index 5e6a47a..0000000 Binary files a/doc/GE/AdminsGuideGE.pdf and /dev/null differ diff --git a/doc/GE/IntroductionGE.pdf b/doc/GE/IntroductionGE.pdf deleted file mode 100644 index abbb61e..0000000 Binary files a/doc/GE/IntroductionGE.pdf and /dev/null differ diff --git a/doc/GE/ManpageReferenceGE.pdf b/doc/GE/ManpageReferenceGE.pdf deleted file mode 100644 index 6ca866f..0000000 Binary files a/doc/GE/ManpageReferenceGE.pdf and /dev/null differ diff --git a/doc/GE/TroubleShootingQuickReferenceGE.pdf b/doc/GE/TroubleShootingQuickReferenceGE.pdf deleted file mode 100644 index 4b38b4a..0000000 Binary files a/doc/GE/TroubleShootingQuickReferenceGE.pdf and /dev/null differ diff --git a/doc/GE/UsersGuideGE.pdf b/doc/GE/UsersGuideGE.pdf deleted file mode 100644 index 6d8c6d5..0000000 Binary files a/doc/GE/UsersGuideGE.pdf and /dev/null differ diff --git a/doc/images/pycharm.png b/doc/images/pycharm.png new file mode 100644 index 0000000..d595105 Binary files /dev/null and b/doc/images/pycharm.png differ diff --git a/doc/images/rosetta-mapping.png b/doc/images/rosetta-mapping.png new file mode 100644 index 0000000..a1150b9 Binary files /dev/null and b/doc/images/rosetta-mapping.png differ diff --git a/doc/images/slurm-arch.png b/doc/images/slurm-arch.png new file mode 100644 index 0000000..0a1f3fc Binary files /dev/null and b/doc/images/slurm-arch.png differ diff --git a/doc/images/speed-pics.png b/doc/images/speed-pics.png new file mode 100644 index 0000000..080ead2 Binary files /dev/null and b/doc/images/speed-pics.png differ diff --git a/doc/scheduler-directives.tex b/doc/scheduler-directives.tex index 4665a38..61d6e0d 100644 --- a/doc/scheduler-directives.tex +++ b/doc/scheduler-directives.tex @@ -1,51 +1,119 @@ -% ------------------------------------------------------------------------------ +% ------------------------------------------------------------------------------ \subsubsection{Directives} +\label{sect:directives} Directives are comments included at the beginning of a job script that set the shell and the options for the job scheduler. - +% The shebang directive is always the first line of a script. In your job script, this directive sets which shell your script's commands will run in. On ``Speed'', we recommend that your script use a shell from the \texttt{/encs/bin} directory. -To use the \texttt{tcsh} shell, start your script with: \verb|#!/encs/bin/tcsh| +To use the \texttt{tcsh} shell, start your script with \verb|#!/encs/bin/tcsh|. +% +For \texttt{bash}, start with \verb|#!/encs/bin/bash|. +% +Directives that start with \verb|#SBATCH|, set the options for the cluster's +SLURM scheduler. The script template, \texttt{template.sh}, +provides the essentials: -For \texttt{bash}, start with: \verb|#!/encs/bin/bash| +%\begin{verbatim} +%#$ -N +%#$ -cwd +%#$ -m bea +%#$ -pe smp +%#$ -l h_vmem=G +%\end{verbatim} +\begin{verbatim} +#SBATCH --job-name=tmpdir ## Give the job a name +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER@encs.concordia.ca +#SBATCH --chdir=./ ## Use current directory as working directory +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task= ## Request, e.g. 8 cores +#SBATCH --mem= ## Assign, e.g., 32G memory per node +\end{verbatim} -Directives that start with \verb|"#$"|, set the options for the cluster's -``Altair Grid Engine (AGE)'' scheduler. The script template, \file{template.sh}, -provides the essentials: +and its short option equivalents: \begin{verbatim} -#$ -N -#$ -cwd -#$ -m bea -#$ -pe smp -#$ -l h_vmem=G +#SBATCH -J tmpdir ## Give the job a name +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER@encs.concordia.ca +#SBATCH --chdir=./ ## Use current directory as working directory +#SBATCH -N 1 +#SBATCH -n 8 ## Request 8 cores +#SBATCH --mem=32G ## Assign 32G memory per node \end{verbatim} Replace, \verb++, with the name that you want your cluster job to have; -\option{-cwd}, makes the current working directory the ``job working directory'', -and your standard output file will appear here; \option{-m bea}, provides e-mail -notifications (begin/end/abort); replace, \verb++, with the degree of -(multithreaded) parallelism (i.e., cores) you attach to your job (up to 32), -be sure to delete or comment out the \verb| #$ -pe smp | parameter if it -is not relevant; replace, \verb++, with the value (in GB), that you want -your job's memory space to be (up to 500), and all jobs MUST have a memory-space -assignment. +\option{--chdir}, makes the current working directory the ``job working directory'', +and your standard output file will appear here; \option{--mail-type}, provides e-mail +notifications (success, error, etc. or all); replace, \verb++, with the degree of +(multithreaded) parallelism (i.e., cores) you attach to your job (up to 32 by default). +%be sure to delete or comment out the \verb| #$ -pe smp | parameter if it +%is not relevant; +Replace, \verb++, with the value (in GB), that you want +your job's memory space to be (up to 500 depending on the node), and all jobs MUST have a memory-space +assignment. +% If you are unsure about memory footprints, err on assigning a generous -memory space to your job so that it does not get prematurely terminated -(the value given to \api{h\_vmem} is a hard memory ceiling). You can refine -\api{h\_vmem} values for future jobs by monitoring the size of a job's active +memory space to your job, so that it does not get prematurely terminated. +%(the value given to \api{h\_vmem} is a hard memory ceiling). +You can refine +%\api{h\_vmem} +\option{--mem} +values for future jobs by monitoring the size of a job's active memory space on \texttt{speed-submit} with: +%\begin{verbatim} +%qstat -j | grep maxvmem +%\end{verbatim} + \begin{verbatim} -qstat -j | grep maxvmem +sacct -j +sstat -j \end{verbatim} -Memory-footprint values are also provided for completed jobs in the final -e-mail notification (as, ``Max vmem''). +\noindent +This can be customized to show specific columns: + +\begin{verbatim} +sacct -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j +sstat -o jobid,maxvmsize,ntasks%7,tresusageouttot%25 -j +\end{verbatim} +Memory-footprint values are also provided for completed jobs in the final +e-mail notification (as, ``maxvmsize''). +% \emph{Jobs that request a low-memory footprint are more likely to load on a busy cluster.} + +Other essential options are \option{-t} and \option{-A}. +% +\begin{itemize} +\item +\option{-t} -- is the time estimate how long your job may run. This is +used in scheduling priority of your job. The maximum already mentioned +is 7 days for batch and 24 hours for interactive. Specifying lesser +time may have your job scheduled sooner. The ``best'' value for this +does not exist and is often determined empirically from the past runs. + +\item +\option{-A} -- to what projects/associations attribute the accounting to. This is usually +your research or supervisor group or a project or some kind of +association. When moving from GE to SLURM we ported most users to +two default accounts \texttt{speed1} and \texttt{speed2}. These +are generic catch-all accounts if you are unsure what to use. +Normally we tell in our intro email which one to use, which may +be your default account. For example, +\texttt{aits}, +\texttt{vidpro}, +\texttt{gipsy}, +\texttt{ai2}, +\texttt{mpackir}, +\texttt{cmos}, among others. + +\end{itemize} diff --git a/doc/scheduler-env.tex b/doc/scheduler-env.tex index 2707e55..830f589 100644 --- a/doc/scheduler-env.tex +++ b/doc/scheduler-env.tex @@ -1,34 +1,42 @@ -% ------------------------------------------------------------------------------ +% ------------------------------------------------------------------------------ \subsubsection{Environment Set Up} \label{sect:envsetup} -After creating an SSH connection to ``Speed'', you will need to source -the ``Altair Grid Engine (AGE)'' scheduler's settings file. -Sourcing the settings file will set the environment variables required to -execute scheduler commands. - -Based on the UNIX shell type, choose one of the following commands to source -the settings file. - -csh/\tool{tcsh}: -\begin{verbatim} -source /local/pkg/uge-8.6.3/root/default/common/settings.csh -\end{verbatim} - -Bourne shell/\tool{bash}: -\begin{verbatim} -. /local/pkg/uge-8.6.3/root/default/common/settings.sh -\end{verbatim} - -In order to set up the default ENCS bash shell, executing the following command -is also required: -\begin{verbatim} -printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile -\end{verbatim} - -To verify that you have access to the scheduler commands execute -\texttt{qstat -f -u "*"}. If an error is returned, attempt sourcing -the settings file again. +After creating an SSH connection to Speed, you will need to +make sure the \tool{srun}, \tool{sbatch}, and \tool{salloc} +commands are available to you. +Type the command name at the command prompt and press enter. +If the command is not available, e.g., (``command not found'') is returned, +you need to make sure your \api{\$PATH} has \texttt{/local/bin} in it. +To view your \api{\$PATH} type \texttt{echo \$PATH} at the prompt. +% +%source +%the ``Altair Grid Engine (AGE)'' scheduler's settings file. +%Sourcing the settings file will set the environment variables required to +%execute scheduler commands. +% +%Based on the UNIX shell type, choose one of the following commands to source +%the settings file. +% +%csh/\tool{tcsh}: +%\begin{verbatim} +%source /local/pkg/uge-8.6.3/root/default/common/settings.csh +%\end{verbatim} +% +%Bourne shell/\tool{bash}: +%\begin{verbatim} +%. /local/pkg/uge-8.6.3/root/default/common/settings.sh +%\end{verbatim} +% +%In order to set up the default ENCS bash shell, executing the following command +%is also required: +%\begin{verbatim} +%printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile +%\end{verbatim} +% +%To verify that you have access to the scheduler commands execute +%\texttt{qstat -f -u "*"}. If an error is returned, attempt sourcing +%the settings file again. The next step is to copy a job template to your home directory and to set up your cluster-specific storage. Execute the following command from within your @@ -39,50 +47,50 @@ \subsubsection{Environment Set Up} cp /home/n/nul-uge/template.sh . && mkdir /speed-scratch/$USER \end{verbatim} -\textbf{Tip:} Add the source command to your shell-startup script. +%\textbf{Tip:} Add the source command to your shell-startup script. \textbf{Tip:} the default shell for GCS ENCS users is \tool{tcsh}. If you would like to use \tool{bash}, please contact \texttt{rt-ex-hpc AT encs.concordia.ca}. -For \textbf{new ENCS Users}, and/or those who don't have a shell-startup script, -based on your shell type use one of the following commands to copy a start up script -from \texttt{nul-uge}'s. home directory to your home directory. (To move to your home -directory, type \tool{cd} at the Linux prompt and press \texttt{Enter}.) - -csh/\tool{tcsh}: -\begin{verbatim} -cp /home/n/nul-uge/.tcshrc . -\end{verbatim} - -Bourne shell/\tool{bash}: -\begin{verbatim} -cp /home/n/nul-uge/.bashrc . -\end{verbatim} - -Users who already have a shell-startup script, use a text editor, such as -\tool{vim} or \tool{emacs}, to add the source request to your existing -shell-startup environment (i.e., to the \file{.tcshrc} file in your home directory). - -csh/\tool{tcsh}: -Sample \file{.tcshrc} file: -\begin{verbatim} -# Speed environment set up -if ($HOSTNAME == speed-submit.encs.concordia.ca) then - source /local/pkg/uge-8.6.3/root/default/common/settings.csh -endif -\end{verbatim} - -Bourne shell/\tool{bash}: -Sample \file{.bashrc} file: -\begin{verbatim} -# Speed environment set up -if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then - . /local/pkg/uge-8.6.3/root/default/common/settings.sh - printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile -fi -\end{verbatim} - -Note that you will need to either log out and back in, or execute a new shell, -for the environment changes in the updated \file{.tcshrc} or \file{.bashrc} file to be applied -(\textbf{important}). +%For \textbf{new GCS ENCS Users}, and/or those who don't have a shell-startup script, +%based on your shell type use one of the following commands to copy a start up script +%from \texttt{nul-uge}'s home directory to your home directory. (To move to your home +%directory, type \tool{cd} at the Linux prompt and press \texttt{Enter}.) + +%csh/\tool{tcsh}: +%\begin{verbatim} +%cp /home/n/nul-uge/.tcshrc . +%\end{verbatim} + +%Bourne shell/\tool{bash}: +%\begin{verbatim} +%cp /home/n/nul-uge/.bashrc . +%\end{verbatim} + +%Users who already have a shell-startup script, can use a text editor, such as +%\tool{vim} or \tool{emacs}, to add the source request to your existing +%shell-startup environment (i.e., to the \file{.tcshrc} file in your home directory). + +%csh/\tool{tcsh}: +%Sample \file{.tcshrc} file: +%\begin{verbatim} +%# Speed environment set up +%if ($HOSTNAME == speed-submit.encs.concordia.ca) then + %source /local/pkg/uge-8.6.3/root/default/common/settings.csh +%endif +%\end{verbatim} +% +%Bourne shell/\tool{bash}: +%Sample \file{.bashrc} file: +%\begin{verbatim} +%# Speed environment set up +%if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then + %. /local/pkg/uge-8.6.3/root/default/common/settings.sh + %printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile +%fi +%\end{verbatim} + +Note, if you are getting ``command not found'' error(s) when logging in, you +probably have old Grid Engine environment commands. Remove them +as per \xa{appdx:uge-to-slurm}. diff --git a/doc/scheduler-faq.tex b/doc/scheduler-faq.tex index 3d0eed5..f19cca6 100644 --- a/doc/scheduler-faq.tex +++ b/doc/scheduler-faq.tex @@ -1,9 +1,10 @@ -% ------------------------------------------------------------------------------ +% ------------------------------------------------------------------------------ \section{Frequently Asked Questions} \label{sect:faqs} % ------------------------------------------------------------------------------ \subsection{Where do I learn about Linux?} +\label{sect:faqs-linux} All Speed users are expected to have a basic understanding of Linux and its commonly used commands. @@ -33,7 +34,8 @@ \subsection{How to use the ``bash shell'' on Speed?} \subsubsection{How do I set bash as my login shell?} In order to set your login shell to bash on Speed, your login shell on all GCS servers must be changed to bash. -To make this change, create a ticket with the Service Desk (or email help at concordia.ca) to request that bash become your default login shell for your ENCS user account on all GCS servers. +To make this change, create a ticket with the Service Desk (or email \texttt{help at concordia.ca}) to +request that bash become your default login shell for your ENCS user account on all GCS servers. % ------------------------------------------------------------------------------ \subsubsection{How do I move into a bash shell on Speed?} @@ -48,18 +50,33 @@ \subsubsection{How do I move into a bash shell on Speed?} Note how the command prompt changed from \verb![speed-submit] [/home/a/a_user] >! to \verb!bash-4.4$! after entering the bash shell. +% ------------------------------------------------------------------------------ +\subsubsection{How do I use the bash shell in an interactive session on Speed?} +% The language here is unclear. TODO Update for clarity and provide examples +If you use one of the below commands (make sure job request settings such +as memory, cores, etc are set), they will allocate your interactive +job sessions with \tool{bash} as a shell on the compute nodes: + +\begin{itemize} + \item \texttt{salloc ... /encs/bin/bash} + \item \texttt{srun ... --pty /encs/bin/bash} +\end{itemize} + % ------------------------------------------------------------------------------ \subsubsection{How do I run scripts written in bash on Speed?} To execute bash scripts on Speed: \begin{enumerate} \item -Ensure that the shebang of your bash job script is \verb!#!/encs/bin/bash! +Ensure that the shebang of your bash job script is \verb+#!/encs/bin/bash+ \item -Use the qsub command to submit your job script to the scheduler. +Use the \tool{sbatch} command to submit your job script to the scheduler. \end{enumerate} -The Speed GitHub contains a sample \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/bash.sh}{bash job script}. +The Speed GitHub contains a sample +\href + {https://github.com/NAG-DevOps/speed-hpc/blob/master/src/bash.sh} + {bash job script}. % ------------------------------------------------------------------------------ \subsection{How to resolve ``Disk quota exceeded'' errors?} @@ -67,12 +84,15 @@ \subsection{How to resolve ``Disk quota exceeded'' errors?} % ------------------------------------------------------------------------------ \subsubsection{Probable Cause} -The \texttt{``Disk quota exceeded''} Error occurs when your application has run out of disk space to write to. On Speed this error can be returned when: +The ``\texttt{Disk quota exceeded}'' Error occurs when your application has +run out of disk space to write to. On Speed this error can be returned when: + \begin{enumerate} \item -The \texttt{/tmp} directory on the speed node your application is running on is full and cannot be written to. - \item Your NFS-provided home is full and cannot be written to. +You can verify this using \tool{quota} and \tool{bigfiles} commands. + \item +The \texttt{/tmp} directory on the speed node your application is running on is full and cannot be written to. \end{enumerate} % ------------------------------------------------------------------------------ @@ -80,13 +100,14 @@ \subsubsection{Possible Solutions} \begin{enumerate} \item -Use the \textbf{-cwd} job script option to set the directory that the job +Use the \option{--chdir} job script option to set the directory that the job script is submitted from the \texttt{job working directory}. The \texttt{job working directory} is the directory that the job will write output files in. \item -The use local disk space is generally recommended for IO intensive operations. However, as the size of \texttt{/tmp} on speed nodes -is \texttt{1GB} it can be necessary for scripts to store temporary data -elsewhere. +The use local disk space is generally recommended for IO intensive operations. +However, as the size of \texttt{/tmp} on speed nodes +is \texttt{1TB} it can be necessary for scripts to store temporary data +elsewhere. Review the documentation for each module called within your script to determine how to set working directories for that application. The basic steps for this solution are: @@ -97,7 +118,7 @@ \subsubsection{Possible Solutions} \item Create a working directory in speed-scratch for output files. For example, this command will create a subdirectory called \textbf{output} - in your \verb!speed-scratch! directory: + in your \verb!speed-scratch! directory: \begin{verbatim} mkdir -m 750 /speed-scratch/$USER/output \end{verbatim} @@ -107,7 +128,8 @@ \subsubsection{Possible Solutions} mkdir -m 750 /speed-scratch/$USER/recovery \end{verbatim} \item - Update the job script to write output to the subdirectories you created in your \verb!speed-scratch! directory, e.g., \verb!/speed-scratch/$USER/output!. + Update the job script to write output to the subdirectories you created in + your \verb!speed-scratch! directory, e.g., \verb!/speed-scratch/$USER/output!. \end{itemize} \end{enumerate} In the above example, \verb!$USER! is an environment variable containing your ENCS username. @@ -131,6 +153,7 @@ \subsubsection{Example of setting working directories for \tool{COMSOL}} -configuration/speed-scratch/$USER/comsol/config \end{verbatim} \end{itemize} + In the above example, \verb!$USER! is an environment variable containing your ENCS username. % ------------------------------------------------------------------------------ @@ -159,9 +182,14 @@ \subsubsection{Example of setting working directories for \tool{Python Modules}} % ------------------------------------------------------------------------------ \subsection{How do I check my job's status?} -When a job with a job id of 1234 is running, the status of that job can be tracked using \verb!`qstat -j 1234`!. -Likewise, if the job is pending, the \verb!`qstat -j 1234`! command will report as to why the job is not scheduled or running. -Once the job has finished, or has been killed, the \textbf{qacct} command must be used to query the job's status, e.g., \verb!`qaact -j [jobid]`!. +%When a job with a job id of 1234 is running, the status of that job can be tracked using \verb!`qstat -j 1234`!. +%Likewise, if the job is pending, the \verb!`qstat -j 1234`! command will report as to why the job is not scheduled or running. +%Once the job has finished, or has been killed, the \textbf{qacct} command must be used to query the job's status, e.g., \verb!`qaact -j [jobid]`!. +When a job with a job id of 1234 is running or terminated, the status of that job can be tracked using `\verb!sacct -j 1234!'. +\texttt{squeue -j 1234} can show while the job is sitting in the queue as well. +Long term statistics on the job after its terminated can be found using +\texttt{sstat -j 1234} after \tool{slurmctld} purges it its tracking state +into the database. % ------------------------------------------------------------------------------ \subsection{Why is my job pending when nodes are empty?} @@ -169,35 +197,81 @@ \subsection{Why is my job pending when nodes are empty?} % ------------------------------------------------------------------------------ \subsubsection{Disabled nodes} -It is possible that a (or a number of) the Speed nodes are disabled. Nodes are disabled if they require maintenance. -To verify if Speed nodes are disabled, request the current list of disabled nodes from qstat. - +It is possible that one or a number of the Speed nodes are disabled. Nodes are disabled if they require maintenance. +To verify if Speed nodes are disabled, see if they are in a draining or drained state: + +%\begin{verbatim} +%qstat -f -qs d +%queuename qtype resv/used/tot. load_avg arch states +%--------------------------------------------------------------------------------- +%g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.27 lx-amd64 d +%--------------------------------------------------------------------------------- +%s.q@speed-07.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d +%--------------------------------------------------------------------------------- +%s.q@speed-10.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d +%--------------------------------------------------------------------------------- +%s.q@speed-16.encs.concordia.ca BIP 0/0/32 0.02 lx-amd64 d +%--------------------------------------------------------------------------------- +%s.q@speed-19.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 d +%--------------------------------------------------------------------------------- +%s.q@speed-24.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d +%--------------------------------------------------------------------------------- +%s.q@speed-36.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 d +%\end{verbatim} + +\scriptsize \begin{verbatim} -qstat -f -qs d -queuename qtype resv/used/tot. load_avg arch states ---------------------------------------------------------------------------------- -g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.27 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-07.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-10.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-16.encs.concordia.ca BIP 0/0/32 0.02 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-19.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-24.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 d ---------------------------------------------------------------------------------- -s.q@speed-36.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 d +[serguei@speed-submit src] % sinfo --long --Node +Thu Oct 19 21:25:12 2023 +NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON +speed-01 1 pa idle 32 2:16:1 257458 0 1 gpu16 none +speed-03 1 pa idle 32 2:16:1 257458 0 1 gpu32 none +speed-05 1 pg idle 32 2:16:1 515490 0 1 gpu16 none +speed-07 1 ps* mixed 32 2:16:1 515490 0 1 cpu32 none +speed-08 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-09 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-10 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-11 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none +speed-12 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-15 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-16 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-17 1 pg drained 32 2:16:1 515490 0 1 gpu16 UGE +speed-19 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none +speed-20 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-21 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-22 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-23 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none +speed-24 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none +speed-25 1 pg idle 32 2:16:1 257458 0 1 gpu32 none +speed-25 1 pa idle 32 2:16:1 257458 0 1 gpu32 none +speed-27 1 pg idle 32 2:16:1 257458 0 1 gpu32 none +speed-27 1 pa idle 32 2:16:1 257458 0 1 gpu32 none +speed-29 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none +speed-30 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-31 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-32 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-33 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none +speed-34 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none +speed-35 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-36 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE +speed-37 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-38 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-39 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-40 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-41 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-42 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-43 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none \end{verbatim} +\normalsize -Note how the all of the Speed nodes in the above list have a state of \textbf{d}, or disabled. +Note which nodes are in the state of \textbf{drained}. Why the state is drained can be found in the reason column. -Your job will run once the maintenance has been completed and the disabled nodes have been enabled. +Your job will run once an occupied node becomes availble or the maintenance has been completed and the disabled nodes have a state of \textbf{idle}. % ------------------------------------------------------------------------------ \subsubsection{Error in job submit request.} It is possible that your job is pending, because the job requested resources that are not available within Speed. -To verify why pending job with job id 1234 is not running, execute \verb!`qstat -j 1234`! -and review the messages in the \textbf{scheduling info:} section. +To verify why job id 1234 is not running, execute `\verb!sacct -j 1234!'. +A summary of the reasons is available via the \tool{squeue} command. +%and review the messages in the \textbf{scheduling info:} section. diff --git a/doc/scheduler-job-examples.tex b/doc/scheduler-job-examples.tex index 7bc25aa..95e23d9 100644 --- a/doc/scheduler-job-examples.tex +++ b/doc/scheduler-job-examples.tex @@ -1,4 +1,4 @@ -% ------------------------------------------------------------------------------ +% ------------------------------------------------------------------------------ \subsection{Example Job Script: Fluent} \begin{figure}[htpb] @@ -8,13 +8,16 @@ \subsection{Example Job Script: Fluent} \end{figure} The job script in \xf{fig:fluent.sh} runs Fluent in parallel over 32 cores. -Of note, we have requested e-mail notifications (\texttt{-m}), are defining the -parallel environment for, \tool{fluent}, with, \texttt{-sgepe smp} (\textbf{very +%Of note, we have requested e-mail notifications (\texttt{-m}), are defining the +Of note, we have requested e-mail notifications (\texttt{--mail-type}), are defining the +%parallel environment for, \tool{fluent}, with, \texttt{-sgepe smp} (\textbf{very +parallel environment for, \tool{fluent}, with, \texttt{-t\$SLURM\_NTASKS} and \texttt{-g-cnf=\$FLUENTNODES} (\textbf{very important}), and are setting \api{\$TMPDIR} as the in-job location for the ``moment'' \file{rfile.out} file (in-job, because the last line of the script copies everything from \api{\$TMPDIR} to a directory in the user's NFS-mounted home). Job progress can be monitored by examining the standard-out file (e.g., -\file{flu10000.o249}), and/or by examining the ``moment'' file in +%\file{flu10000.o249}), and/or by examining the ``moment'' file in +\texttt{slurm-249.out}), and/or by examining the ``moment'' file in \texttt{/disk/nobackup/} (hint: it starts with your job-ID) on the node running the job. \textbf{Caveat:} take care with journal-file file paths. @@ -26,14 +29,22 @@ \subsection{Example Job: efficientdet} \begin{itemize} \item - Enter your ENCS user account's speed-scratch directory + Enter your ENCS user account's speed-scratch directory\\ \verb!cd /speed-scratch/! \item + Next + \begin{itemize} + \item load python \verb!module load python/3.8.3! + \item create virtual environment \verb!python3 -m venv ! + \item activate virtual environment \verb!source /bin/activate.csh! + \item install DL packages for Efficientdet + \end{itemize} \end{itemize} +\small \begin{verbatim} pip install tensorflow==2.7.0 pip install lxml>=4.6.1 @@ -50,19 +61,24 @@ \subsection{Example Job: efficientdet} pip install Cython>=0.29.13 pip install git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI \end{verbatim} +\normalsize % ------------------------------------------------------------------------------ \subsection{Java Jobs} +\label{sect:java} Jobs that call \tool{java} have a memory overhead, which needs to be taken -into account when assigning a value to \api{h\_vmem}. Even the most basic +%into account when assigning a value to \api{h\_vmem}. Even the most basic +into account when assigning a value to \option{--mem}. Even the most basic \tool{java} call, \texttt{java -Xmx1G -version}, will need to have, -\texttt{-l h\_vmem=5G}, with the 4-GB difference representing the memory overhead. +%\texttt{-l h\_vmem=5G}, with the 4-GB difference representing the memory overhead. +\texttt{--mem=5G}, with the 4-GB difference representing the memory overhead. Note that this memory overhead grows proportionally with the value of \texttt{-Xmx}. To give you an idea, when \texttt{-Xmx} has a value of 100G, -\api{h\_vmem} has to be at least 106G; for 200G, at least 211G; for 300G, at least 314G. +%\api{h\_vmem} has to be at least 106G; for 200G, at least 211G; for 300G, at least 314G. +\option{--mem} has to be at least 106G; for 200G, at least 211G; for 300G, at least 314G. -% TODO: add a MARF Java job +% TODO: add MARF and GIPSY Java jobs % ------------------------------------------------------------------------------ \subsection{Scheduling On The GPU Nodes} @@ -72,50 +88,63 @@ \subsection{Scheduling On The GPU Nodes} is mainly a single-precision card, so unless you need the GPU double precision, double-precision calculations will be faster on a CPU node. -Job scripts for the GPU queue differ in that they do not need these -statements: - -\begin{verbatim} -#$ -pe smp -#$ -l h_vmem=G -\end{verbatim} - -But do need this statement, which attaches either a single GPU, or, two +Job scripts for the GPU queue differ in that they +%do not need these +%statements: +% +%\begin{verbatim} +%#$ -pe smp +%#$ -l h_vmem=G +%\end{verbatim} +% +%But do +need this statement, which attaches either a single GPU, or, two GPUs, to the job: +%\begin{verbatim} +%#$ -l gpu=[1|2] +%\end{verbatim} \begin{verbatim} -#$ -l gpu=[1|2] +#SBATCH --gpus=[1|2] \end{verbatim} -Single-GPU jobs are granted 5~CPU cores and 80GB of system memory, and -dual-GPU jobs are granted 10~CPU cores and 160GB of system memory. A -total of \emph{four} GPUs can be actively attached to any one user at any given -time. +% TODO: verify accuracy +% Single-GPU jobs are granted 5~CPU cores and 80GB of system memory, and +% dual-GPU jobs are granted 10~CPU cores and 160GB of system memory. A +% total of \emph{four} GPUs can be actively attached to any one user at any given +% time. -Once that your job script is ready, you can submit it to the GPU queue +Once that your job script is ready, you can submit it to the GPU partition (queue) with: +%\begin{verbatim} +%qsub -q g.q ./.sh +%\end{verbatim} \begin{verbatim} -qsub -q g.q ./.sh +sbatch -p pg ./.sh \end{verbatim} And you can query \tool{nvidia-smi} on the node that is running your job with: \begin{verbatim} -ssh @speed[-05|-17] nvidia-smi +ssh @speed[-05|-17|37-43] nvidia-smi \end{verbatim} Status of the GPU queue can be queried with: +%\begin{verbatim} +%qstat -f -u "*" -q g.q +%\end{verbatim} \begin{verbatim} -qstat -f -u "*" -q g.q +sinfo -p pg --long --Node \end{verbatim} +\noindent \textbf{Very important note} regarding TensorFlow and PyTorch: if you are planning to run TensorFlow and/or PyTorch multi-GPU jobs, -do not use the \api{tf.distribute} and/or\\ +\textbf{do not} use the \api{tf.distribute} and/or\\ \api{torch.nn.DataParallel} -functions, as they will crash the compute node (100\% certainty). +functions on \textbf{speed-01,05,17}, as they will crash the compute node (100\% certainty). This appears to be the current hardware's architecture's defect. % The workaround is to either @@ -128,58 +157,112 @@ \subsection{Scheduling On The GPU Nodes} \textbf{Important} \vspace{10pt} -Users without permission to use the GPU nodes can submit jobs to the \texttt{g.q} -queue but those jobs will hang and never run. - -There are two GPUs in both \texttt{speed-05} and \texttt{speed-17}, and one -in \texttt{speed-19}. Their availability is seen with, \texttt{qstat -F g} -(note the capital): - -\small +%Users without permission to use the GPU nodes can submit jobs to the \texttt{g.q} +Users without permission to use the GPU nodes can submit jobs to the \texttt{pg} +partition, but those jobs will hang and never run. +% +%There are two GPUs in both \texttt{speed-05} and \texttt{speed-17}, and one +%in \texttt{speed-19}. +Their availability is seen with: +%, \texttt{qstat -F g} (note the capital): +% +%\small +%\begin{verbatim} +%queuename qtype resv/used/tot. load_avg arch states +%--------------------------------------------------------------------------------- +%... +%--------------------------------------------------------------------------------- +%g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.04 lx-amd64 + %hc:gpu=6 +%--------------------------------------------------------------------------------- +%g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 + %hc:gpu=6 +%--------------------------------------------------------------------------------- +%... +%--------------------------------------------------------------------------------- +%s.q@speed-19.encs.concordia.ca BIP 0/32/32 32.37 lx-amd64 + %hc:gpu=1 +%--------------------------------------------------------------------------------- +%etc. +%\end{verbatim} +%\normalsize + +\scriptsize \begin{verbatim} -queuename qtype resv/used/tot. load_avg arch states ---------------------------------------------------------------------------------- -... ---------------------------------------------------------------------------------- -g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.04 lx-amd64 - hc:gpu=6 ---------------------------------------------------------------------------------- -g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 - hc:gpu=6 ---------------------------------------------------------------------------------- -... ---------------------------------------------------------------------------------- -s.q@speed-19.encs.concordia.ca BIP 0/32/32 32.37 lx-amd64 - hc:gpu=1 ---------------------------------------------------------------------------------- -etc. +[serguei@speed-submit src] % sinfo -p pg --long --Node +Thu Oct 19 22:31:04 2023 +NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON +speed-05 1 pg idle 32 2:16:1 515490 0 1 gpu16 none +speed-17 1 pg drained 32 2:16:1 515490 0 1 gpu16 UGE +speed-25 1 pg idle 32 2:16:1 257458 0 1 gpu32 none +speed-27 1 pg idle 32 2:16:1 257458 0 1 gpu32 none +[serguei@speed-submit src] % sinfo -p pt --long --Node +Thu Oct 19 22:32:39 2023 +NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON +speed-37 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-38 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-39 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-40 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-41 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-42 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none +speed-43 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none \end{verbatim} -\normalsize -This status demonstrates that all five are available (i.e., have not been +This status demonstrates that most are available (i.e., have not been requested as resources). To specifically request a GPU node, add, -\texttt{-l g=[\#GPUs]}, to your \tool{qsub} (statement/script) or -\tool{qlogin} (statement) request. For example, -\texttt{qsub -l h\_vmem=1G -l g=1 ./count.sh}. You -will see that this job has been assigned to one of the GPU nodes: - -\small +%\texttt{-l g=[\#GPUs]}, to your \tool{qsub} (statement/script) or +\texttt{--gpus=[\#GPUs]}, to your \tool{sbatch} (statement/script) or +%\tool{qlogin} (statement) request. For example, +\tool{salloc} (statement) request. For example, +%\texttt{qsub -l h\_vmem=1G -l g=1 ./count.sh}. You +\texttt{sbatch -t 10 --mem=1G --gpus=1 -p pg ./tcsh.sh}. You +will see that this job has been assigned to one of the GPU nodes. + +%\small +%\begin{verbatim} +%queuename qtype resv/used/tot. load_avg arch states +%--------------------------------------------------------------------------------- +%g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 hc:gpu=6 +%--------------------------------------------------------------------------------- +%g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 hc:gpu=6 +%--------------------------------------------------------------------------------- +%s.q@speed-19.encs.concordia.ca BIP 0/1/32 0.04 lx-amd64 hc:gpu=0 (haff=1.000000) + %538 100.00000 count.sh sbunnell r 03/07/2019 02:39:39 1 +%--------------------------------------------------------------------------------- +%etc. +%\end{verbatim} +%\normalsize + +%\small +%\begin{verbatim} +%queuename qtype resv/used/tot. load_avg arch states +%--------------------------------------------------------------------------------- +%g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 hc:gpu=6 +%--------------------------------------------------------------------------------- +%g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 hc:gpu=6 +%--------------------------------------------------------------------------------- +%s.q@speed-19.encs.concordia.ca BIP 0/1/32 0.04 lx-amd64 hc:gpu=0 (haff=1.000000) + %538 100.00000 count.sh sbunnell r 03/07/2019 02:39:39 1 +%--------------------------------------------------------------------------------- +%etc. +%\end{verbatim} +%\normalsize + +\scriptsize \begin{verbatim} -queuename qtype resv/used/tot. load_avg arch states ---------------------------------------------------------------------------------- -g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 hc:gpu=6 ---------------------------------------------------------------------------------- -g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 hc:gpu=6 ---------------------------------------------------------------------------------- -s.q@speed-19.encs.concordia.ca BIP 0/1/32 0.04 lx-amd64 hc:gpu=0 (haff=1.000000) - 538 100.00000 count.sh sbunnell r 03/07/2019 02:39:39 1 ---------------------------------------------------------------------------------- -etc. +[serguei@speed-submit src] % squeue -p pg -o "%15N %.6D %7P %.11T %.4c %.8z %.6m %.8d %.6w %.8f %20G %20E" +NODELIST NODES PARTITI STATE MIN_ S:C:T MIN_ME MIN_TMP_ WCKEY FEATURES GROUP DEPENDENCY +speed-05 1 pg RUNNING 1 *:*:* 1G 0 (null) (null) 11929 (null) +[serguei@speed-submit src] % sinfo -p pg -o "%15N %.6D %7P %.11T %.4c %.8z %.6m %.8d %.6w %.8f %20G %20E" +NODELIST NODES PARTITI STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE GRES REASON +speed-17 1 pg drained 32 2:16:1 515490 0 1 gpu16 gpu:6 UGE +speed-05 1 pg mixed 32 2:16:1 515490 0 1 gpu16 gpu:6 none +speed-[25,27] 2 pg idle 32 2:16:1 257458 0 1 gpu32 gpu:2 none \end{verbatim} \normalsize -And that there are no more GPUs available on that node (\texttt{hc:gpu=0}). Note -that no more than two GPUs can be requested for any one job. +%And that there are no more GPUs available on that node (\texttt{hc:gpu=0}). +%Note that no more than two GPUs can be requested for any one job. % ------------------------------------------------------------------------------ \subsubsection{CUDA} @@ -187,7 +270,7 @@ \subsubsection{CUDA} When calling \tool{CUDA} within job scripts, it is important to create a link to the desired \tool{CUDA} libraries and set the runtime link path to the same libraries. For example, to use the \texttt{cuda-11.5} libraries, specify the following in -your Makefile. +your \texttt{Makefile}. \begin{verbatim} -L/encs/pkg/cuda-11.5/root/lib64 -Wl,-rpath,/encs/pkg/cuda-11.5/root/lib64 @@ -202,11 +285,13 @@ \subsubsection{CUDA} % ------------------------------------------------------------------------------ \subsubsection{Special Notes for sending CUDA jobs to the GPU Queue} -It is not possible to create a \texttt{qlogin} session on to a node in the -\textbf{GPU Queue} (\texttt{g.q}). As direct logins to these nodes is not -available, jobs must be submitted to the \textbf{GPU Queue} in order to compile +%It is not possible to create a \texttt{qlogin} session on to a node in the +%\textbf{GPU Queue} (\texttt{g.q}). As direct logins to these nodes is not +%available, +Interactive +jobs (\xs{sect:interactive-jobs}) must be submitted to the \textbf{GPU partition} in order to compile and link. - +% We have several versions of CUDA installed in: \begin{verbatim} /encs/pkg/cuda-11.5/root/ @@ -214,7 +299,7 @@ \subsubsection{Special Notes for sending CUDA jobs to the GPU Queue} /encs/pkg/cuda-9.2/root \end{verbatim} -For CUDA to compile properly for the GPU queue, edit your \texttt{Makefile} +For CUDA to compile properly for the GPU partition, edit your \texttt{Makefile} replacing \option{\/usr\/local\/cuda} with one of the above. % ------------------------------------------------------------------------------ @@ -328,7 +413,7 @@ \subsection{Singularity Containers} Some of them can be ran in both batch or interactive mode, some make more sense to run interactively. They cover some basics with CUDA, OpenGL rendering, and computer vision tasks as examples from the OpenISS -library and other libraries, including the base images that use diffrent +library and other libraries, including the base images that use different distros. We also include Jupyter notebook example with Conda support. \begin{verbatim} @@ -387,7 +472,7 @@ \subsection{Singularity Containers} \small \begin{verbatim} -qlogin +salloc --gpus=1 -n8 -t60 cd /speed-scratch/$USER/ singularity pull openiss-cuda-devicequery.sif docker://openiss/openiss-cuda-devicequery INFO: Converting OCI blobs to SIF format diff --git a/doc/scheduler-scripting.tex b/doc/scheduler-scripting.tex index df086d5..5281031 100644 --- a/doc/scheduler-scripting.tex +++ b/doc/scheduler-scripting.tex @@ -1,5 +1,6 @@ -% ------------------------------------------------------------------------------ +% ------------------------------------------------------------------------------ \subsubsection{User Scripting} +\label{sect:scripting} The last part the job script is the scripting that will be executed by the job. This part of the job script includes all commands required to set up and @@ -7,13 +8,18 @@ \subsubsection{User Scripting} at this step. This section can be a simple call to an executable or a complex loop which iterates through a series of commands. +Any compute heavy step is preferably should be prefixed by \tool{srun} +as the best practice. + Every software program has a unique execution framework. It is the responsibility of the script's author (e.g., you) to know what is required for the software used in your script by reviewing the software's documentation. Regardless of which software your script calls, your script should be written so that the software knows the -location of the input and output files as well as the degree of parallelism. -Note that the cluster-specific environment variable, \api{NSLOTS}, resolves -to the value provided to the scheduler in the \option{-pe smp} option. +location of the input and output files as well as the degree of parallelism. +% +% GE: +%Note that the cluster-specific environment variable, \api{NSLOTS}, resolves +%to the value provided to the scheduler in the \option{-pe smp} option. Jobs which touch data-input and data-output files more than once, should make use of \api{TMPDIR}, a scheduler-provided working space almost 1~TB in size. @@ -23,12 +29,15 @@ \subsubsection{User Scripting} An sample job script using \api{TMPDIR} is available at \texttt{/home/n/nul-uge/templateTMPDIR.sh}: the job is instructed to change to \api{\$TMPDIR}, to make the new directory \texttt{input}, to copy data from -\texttt{\$SGE\_O\_WORKDIR/references/} to \texttt{input/} (\texttt{\$SGE\_O\_WORKDIR} represents the +%\texttt{\$SGE\_O\_WORKDIR/references/} to \texttt{input/} (\texttt{\$SGE\_O\_WORKDIR} represents the +\texttt{\$SLURM\_SUBMIT\_DIR/references/} to \texttt{input/} (\texttt{\$SLURM\_SUBMIT\_DIR} represents the current working directory), to make the new directory \texttt{results}, to execute the program (which takes input from \texttt{\$TMPDIR/input/} and writes output to \texttt{\$TMPDIR/results/}), and finally to copy the total end results to an existing directory, \texttt{processed}, that is located in the current -working directory. TMPDIR only exists for the duration of the job, though, +working directory. +% TODO: verify: +TMPDIR only exists for the duration of the job, though, so it is very important to copy relevant results from it at job's end. % ------------------------------------------------------------------------------ @@ -44,12 +53,17 @@ \subsection{Sample Job Script} \end{figure} The first line is the shell declaration (also know as a shebang) and sets the shell to \emph{tcsh}. -The lines that begin with \texttt{\#\$} are directives for the scheduler. +%The lines that begin with \texttt{\#\$} are directives for the scheduler. +The lines that begin with \texttt{\#SBATCH} are directives for the scheduler. \begin{itemize} - \item \texttt{-N} sets \emph{qsub-test} as the jobname - \item \texttt{-cwd} tells the scheduler to execute the job from the current working directory - \item \texttt{-l h\_vmem=1GB} requests and assigns 1GB of memory to the job. CPU jobs \emph{require} the \texttt{-l h\_vmem} option to be set. + %\item \texttt{-N} sets \emph{qsub-test} as the jobname + \item \texttt{-J} (or \option{--job-name}) sets \emph{tcsh-test} as the job name + %\item \texttt{-cwd} tells the scheduler to execute the job from the current working directory + \item \texttt{--chdir} tells the scheduler to execute the job from the current working directory + %\item \texttt{-l h\_vmem=1GB} requests and assigns 1GB of memory to the job. CPU jobs \emph{require} the \texttt{-l h\_vmem} option to be set. + \item \texttt{--mem=1GB} requests and assigns 1GB of memory to the job. + Generally jobs \emph{require} the \texttt{--mem} option to be set. \end{itemize} The script then: @@ -60,80 +74,109 @@ \subsection{Sample Job Script} \item Prints the list of loaded modules into a file \end{itemize} -The scheduler command, \tool{qsub}, is used to submit (non-interactive) jobs. -From an ssh session on speed-submit, submit this job with \texttt{qsub ./tcsh.sh}. You will see, -\texttt{"Your job X ("qsub-test") has been submitted"}. The command, \tool{qstat}, can be used -to look at the status of the cluster: \texttt{qstat -f -u "*"}. You will see -something like this: - +%The scheduler command, \tool{qsub}, is used to submit (non-interactive) jobs. +The scheduler command, \tool{sbatch}, is used to submit (non-interactive) jobs. +%From an ssh session on speed-submit, submit this job with \texttt{qsub ./tcsh.sh}. +From an ssh session on speed-submit, submit this job with \texttt{sbatch ./tcsh.sh}. +%You will see, \texttt{"Your job X ("qsub-test") has been submitted"}. +You will see, \texttt{"Submitted batch job 2653"} where $2653$ is a job ID assigned. +%The command, \tool{qstat}, can be used +The commands, \tool{squeue} and \tool{sinfo} can be used +%to look at the status of the cluster: \texttt{qstat -f -u "*"}. +to look at the status of the cluster: \texttt{squeue -l}. +You will see something like this: + +%\small +%\begin{verbatim} +%queuename qtype resv/used/tot. load_avg arch states +%--------------------------------------------------------------------------------- +%a.q@speed-01.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +%--------------------------------------------------------------------------------- +%a.q@speed-03.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +%--------------------------------------------------------------------------------- +%a.q@speed-25.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +%--------------------------------------------------------------------------------- +%a.q@speed-27.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +%--------------------------------------------------------------------------------- +%g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.02 lx-amd64 + %144 100.00000 qsub-test nul-uge r 12/03/2018 16:39:30 1 + %62624 0.09843 case_talle x_yzabc r 11/09/2021 16:50:09 32 +%--------------------------------------------------------------------------------- +%g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +%--------------------------------------------------------------------------------- +%s.q@speed-07.encs.concordia.ca BIP 0/0/32 0.04 lx-amd64 +%--------------------------------------------------------------------------------- +%s.q@speed-08.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +%--------------------------------------------------------------------------------- +%s.q@speed-09.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 +%--------------------------------------------------------------------------------- +%s.q@speed-10.encs.concordia.ca BIP 0/32/32 32.72 lx-amd64 + %62624 0.09843 case_talle x_yzabc r 11/09/2021 16:50:09 32 +%--------------------------------------------------------------------------------- +%s.q@speed-11.encs.concordia.ca BIP 0/32/32 32.08 lx-amd64 + %62679 0.14212 CWLR_DF a_bcdef r 11/10/2021 17:25:19 32 +%--------------------------------------------------------------------------------- +%s.q@speed-12.encs.concordia.ca BIP 0/32/32 32.10 lx-amd64 + %62749 0.09000 CLOUDY z_abc r 11/11/2021 21:58:12 32 +%--------------------------------------------------------------------------------- +%s.q@speed-15.encs.concordia.ca BIP 0/4/32 0.03 lx-amd64 + %62753 82.47478 matlabLDPa b_bpxez r 11/12/2021 08:49:52 4 +%--------------------------------------------------------------------------------- +%s.q@speed-16.encs.concordia.ca BIP 0/32/32 32.31 lx-amd64 + %62751 0.09000 CLOUDY z_abc r 11/12/2021 06:03:54 32 +%--------------------------------------------------------------------------------- +%s.q@speed-19.encs.concordia.ca BIP 0/32/32 32.22 lx-amd64 +%--------------------------------------------------------------------------------- +%... +%--------------------------------------------------------------------------------- +%s.q@speed-35.encs.concordia.ca BIP 0/32/32 2.78 lx-amd64 + %62754 7.22952 qlogin-tes a_tiyuu r 11/12/2021 10:31:06 32 +%--------------------------------------------------------------------------------- +%s.q@speed-36.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 +%etc. \small \begin{verbatim} -queuename qtype resv/used/tot. load_avg arch states ---------------------------------------------------------------------------------- -a.q@speed-01.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -a.q@speed-03.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -a.q@speed-25.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -a.q@speed-27.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -g.q@speed-05.encs.concordia.ca BIP 0/0/32 0.02 lx-amd64 - 144 100.00000 qsub-test nul-uge r 12/03/2018 16:39:30 1 - 62624 0.09843 case_talle x_yzabc r 11/09/2021 16:50:09 32 ---------------------------------------------------------------------------------- -g.q@speed-17.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -s.q@speed-07.encs.concordia.ca BIP 0/0/32 0.04 lx-amd64 ---------------------------------------------------------------------------------- -s.q@speed-08.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -s.q@speed-09.encs.concordia.ca BIP 0/0/32 0.01 lx-amd64 ---------------------------------------------------------------------------------- -s.q@speed-10.encs.concordia.ca BIP 0/32/32 32.72 lx-amd64 - 62624 0.09843 case_talle x_yzabc r 11/09/2021 16:50:09 32 ---------------------------------------------------------------------------------- -s.q@speed-11.encs.concordia.ca BIP 0/32/32 32.08 lx-amd64 - 62679 0.14212 CWLR_DF a_bcdef r 11/10/2021 17:25:19 32 ---------------------------------------------------------------------------------- -s.q@speed-12.encs.concordia.ca BIP 0/32/32 32.10 lx-amd64 - 62749 0.09000 CLOUDY z_abc r 11/11/2021 21:58:12 32 ---------------------------------------------------------------------------------- -s.q@speed-15.encs.concordia.ca BIP 0/4/32 0.03 lx-amd64 - 62753 82.47478 matlabLDPa b_bpxez r 11/12/2021 08:49:52 4 ---------------------------------------------------------------------------------- -s.q@speed-16.encs.concordia.ca BIP 0/32/32 32.31 lx-amd64 - 62751 0.09000 CLOUDY z_abc r 11/12/2021 06:03:54 32 ---------------------------------------------------------------------------------- -s.q@speed-19.encs.concordia.ca BIP 0/32/32 32.22 lx-amd64 ---------------------------------------------------------------------------------- -... ---------------------------------------------------------------------------------- -s.q@speed-35.encs.concordia.ca BIP 0/32/32 2.78 lx-amd64 - 62754 7.22952 qlogin-tes a_tiyuu r 11/12/2021 10:31:06 32 ---------------------------------------------------------------------------------- -s.q@speed-36.encs.concordia.ca BIP 0/0/32 0.03 lx-amd64 -etc. +[serguei@speed-submit src] % squeue -l +Thu Oct 19 11:38:54 2023 +JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) + 2641 ps interact b_user RUNNING 19:16:09 1-00:00:00 1 speed-07 + 2652 ps interact a_user RUNNING 41:40 1-00:00:00 1 speed-07 + 2654 ps tcsh-tes serguei RUNNING 0:01 7-00:00:00 1 speed-07 +[serguei@speed-submit src] % sinfo +PARTITION AVAIL TIMELIMIT NODES STATE NODELIST +ps* up 7-00:00:00 14 drain speed-[08-10,12,15-16,20-22,30-32,35-36] +ps* up 7-00:00:00 1 mix speed-07 +ps* up 7-00:00:00 7 idle speed-[11,19,23-24,29,33-34] +pg up 1-00:00:00 1 drain speed-17 +pg up 1-00:00:00 3 idle speed-[05,25,27] +pt up 7-00:00:00 7 idle speed-[37-43] +pa up 7-00:00:00 4 idle speed-[01,03,25,27] \end{verbatim} \normalsize Remember that you only have 30 seconds before the job is essentially over, so if you do not see a similar output, either adjust the sleep time in the -script, or execute the \tool{qstat} statement more quickly. The \tool{qstat} +%script, or execute the \tool{qstat} statement more quickly. The \tool{qstat} +script, or execute the \tool{sbatch} statement more quickly. The \tool{squeue} output listed above shows you that your job is -running on node \texttt{speed-05}, that it has a job number of 144, that it -was started at 16:39:30 on 12/03/2018, and that it is a single-core job (the -default). +running on node \texttt{speed-07}, that it has a job number of 2654, +its time limit of 7 days, etc. +% TODO +%, that it +%was started at 16:39:30 on 12/03/2018, and that it is a single-core job (the +%default). Once the job finishes, there will be a new file in the directory that the job -was started from, with the syntax of, \texttt{"job name".o"job number"}, so -in this example the file is, qsub \file{test.o144}. This file represents the +%was started from, with the syntax of, \texttt{"job name".o"job number"}, so +was started from, with the syntax of, \texttt{slurm-"job id".out}, so +%in this example the file is, qsub \file{test.o144}. This file represents the +in this example the file is, \file{slurm-2654.out}. This file represents the standard output (and error, if there is any) of the job in question. If you look at the contents of your newly created file, you will see that it contains the output of the, \texttt{module list} command. Important information is often written to this file. - -Congratulations on your first job! +% +%Congratulations on your first job! % ------------------------------------------------------------------------------ \subsection{Common Job Management Commands Summary} @@ -142,81 +185,160 @@ \subsection{Common Job Management Commands Summary} Here are useful job-management commands: \begin{itemize} +%\item +%\texttt{qsub ./.sh}: once that your job script is ready, +%on \texttt{speed-submit} you can submit it using this \item -\texttt{qsub ./.sh}: once that your job script is ready, +\texttt{sbatch -A --t --mem=20G -p ./.sh}: once that your job script is ready, on \texttt{speed-submit} you can submit it using this \item -\texttt{qstat -f -u }: you can check the status of your job(s) +%\texttt{qstat -f -u }: you can check the status of your job(s) +\texttt{squeue -u }: you can check the status of your job(s) \item -\texttt{qstat -f -u "*"}: display cluster status for all users. +%\texttt{qstat -f -u "*"}: display cluster status for all users. +\texttt{squeue}: display cluster status for all users. +\option{-A} shows per account (e.g., \texttt{vidpro}, \texttt{gipsy}, +\texttt{speed1}, \texttt{ai2}, \texttt{aits}, etc.), +\option{-p} per partition (\texttt{ps}, \texttt{pg}, \texttt{pt}, \texttt{pa}), +and others. \texttt{man squeue} for details. \item -\texttt{qstat -j [job-ID]}: display job information for [job-ID] (said job may be actually running, or waiting in the queue). +%\texttt{qstat -j [job-ID]}: display job information for [job-ID] (said job may be actually running, or waiting in the queue). +\texttt{squeue --job [job-ID]}: display job information for [job-ID] (said job may be actually running, or waiting in the queue). \item -\texttt{qdel [job-ID]}: delete job [job-ID]. +\texttt{squeue -las}: displays individual job steps (for debugging +easier to see which step failed if you used \tool{srun}). \item -\texttt{qhold [job-ID]}: hold queued job, [job-ID], from running. +\verb+watch -n 1 "sinfo -Nel -pps,pt,pg,pa && squeue -la"+: view \tool{sinfo} information and watch the queue for your job(s). +%\item +%\texttt{qdel [job-ID]}: delete job [job-ID]. \item -\texttt{qrls [job-ID]}: release held job [job-ID]. +\texttt{scancel [job-ID]}: cancel job [job-ID]. +%\item +%\texttt{qhold [job-ID]}: hold queued job, [job-ID], from running. \item -\texttt{qacct -j [job-ID]}: get job stats. for completed job [job-ID]. \api{maxvmem} is one of the more useful stats. +\texttt{scontrol hold [job-ID]}: hold queued job, [job-ID], from running. + +%\item +%\texttt{qrls [job-ID]}: release held job [job-ID]. +\item +\texttt{scontrol release [job-ID]}: release held job [job-ID]. + +\item +%\texttt{qacct -j [job-ID]}: get job stats. for completed job [job-ID]. \api{maxvmem} is one of the more useful stats. +\texttt{sacct -j [job-ID]}: get job stats. +%for completed job [job-ID]. +\api{maxvmem} is one of the more useful stats that you can elect to display +as a format option. + +\small +\begin{verbatim} +% sacct -j 2654 +JobID JobName Partition Account AllocCPUS State ExitCode +------------ ---------- ---------- ---------- ---------- ---------- -------- +2654 tcsh-test ps speed1 1 COMPLETED 0:0 +2654.batch batch speed1 1 COMPLETED 0:0 +2654.extern extern speed1 1 COMPLETED 0:0 +% sacct -j 2654 -o jobid,user,account,MaxVMSize,Reason%10,TRESUsageOutMax%30 +JobID User Account MaxVMSize Reason TRESUsageOutMax +------------ --------- ---------- ---------- ---------- ---------------------- +2654 serguei speed1 None +2654.batch speed1 296840K energy=0,fs/disk=1975 +2654.extern speed1 296312K energy=0,fs/disk=343 +\end{verbatim} +\normalsize + +See \texttt{man sacct} or \texttt{sacct -e} for details of the +available formatting options. You can define your preferred +default format in the \api{SACCT\_FORMAT} environment variable +in your \texttt{.cshrc} or \texttt{.bashrc} files. + \end{itemize} % ------------------------------------------------------------------------------ -\subsection{Advanced \tool{qsub} Options} +%\subsection{Advanced \tool{qsub} Options} +\subsection{Advanced \tool{sbatch} Options} +\label{sect:submit-options} \label{sect:qsub-options} -In addition to the basic \tool{qsub} options presented earlier, there are a +In addition to the basic \tool{sbatch} options presented earlier, there are a few additional options that are generally useful: \begin{itemize} \item -\texttt{-m bea}: requests that the scheduler e-mail you when a job (b)egins; -(e)nds; (a)borts. Mail is sent to the default address of, -\texttt{"username@encs.concordia.ca"}, unless a different address is supplied (see, -\texttt{-M}). The report sent when a job ends includes job +%\texttt{-m bea}: requests that the scheduler e-mail you when a job (b)egins; +%(e)nds; (a)borts. Mail is sent to the default address of, +%\texttt{"username@encs.concordia.ca"}, unless a different address is supplied (see, +%\texttt{-M}). The report sent when a job ends includes job +%runtime, as well as the maximum memory value hit (\api{maxvmem}). +\texttt{--mail-type=TYPE}: requests that the scheduler e-mail you when a job changes +state. Where \texttt{TYPE} is \texttt{ALL}, \texttt{BEGIN}, \texttt{END}, or \texttt{FAIL}. +% TODO: verify +Mail is sent to the default address of, \\ +\texttt{"@encs.concordia.ca"}, which you can consult +via \texttt{webmail.encs} via the VPN, on login.encs via \tool{alpine} +or setup forwarding to @concordia.ca address or offsite, +unless a different address is supplied (see, \texttt{--mail-user}). +% TODO: double-check +The report sent when a job ends includes job runtime, as well as the maximum memory value hit (\api{maxvmem}). \item -\texttt{-M email@domain.com}: requests that the scheduler use this e-mail -notification address, rather than the default (see, \texttt{-m}). +%\texttt{-M email@domain.com}: requests that the scheduler use this e-mail +%notification address, rather than the default (see, \texttt{-m}). +\texttt{--mail-user email@domain.com}: requests that the scheduler use this e-mail +notification address, rather than the default (see, \texttt{--mail-type}). \item -\texttt{-v variable[=value]}: exports an environment variable that can be used by the script. +%\texttt{-v variable[=value]}: exports an environment variable that can be used by the script. +\texttt{--export=[ALL | NONE | variables]}: exports environment variable(s) that can be used by the script. \item -\texttt{-l h\_rt=[hour]:[min]:[sec]}: sets a job runtime of HH:MM:SS. Note -that if you give a single number, that represents \emph{seconds}, not hours. +%\texttt{-l h\_rt=[hour]:[min]:[sec]}: sets a job runtime of HH:MM:SS. Note +%that if you give a single number, that represents \emph{seconds}, not hours. +\texttt{-t [min]} or \texttt{DAYS-HH:MM:SS}: sets a job runtime of min or HH:MM:SS. Note +that if you give a single number, that represents \emph{minutes}, not hours. \item -\texttt{-hold\_jid [job-ID]}: run this job only when job [job-ID] finishes. Held jobs appear in the queue. -The many \tool{qsub} options available are read with, \texttt{man qsub}. Also -note that \tool{qsub} options can be specified during the job-submission -command, and these \emph{override} existing script options (if present). The -syntax is, \texttt{qsub [options] PATHTOSCRIPT}, but unlike in the script, -the options are specified without the leading \verb+#$+ -(e.g., \texttt{qsub -N qsub-test -cwd -l h\_vmem=1G ./tcsh.sh}). +%\texttt{-hold\_jid [job-ID]}: run this job only when job [job-ID] finishes. Held jobs appear in the queue. +\texttt{--depend=[state:job-ID]}: run this job only when job [job-ID] finishes. Held jobs appear in the queue. \end{itemize} +The many \tool{sbatch} options available are read with, \texttt{man sbatch}. Also +note that \tool{sbatch} options can be specified during the job-submission +command, and these \emph{override} existing script options (if present). The +syntax is, \texttt{sbatch [options] PATHTOSCRIPT}, but unlike in the script, +the options are specified without the leading \verb+#SBATCH+ +(e.g., \texttt{sbatch -J sub-test --chdir=./ --mem=1G ./tcsh.sh}). + + % ------------------------------------------------------------------------------ \subsection{Array Jobs} +\label{sect:array-jobs} Array jobs are those that start a batch job or a parallel job multiple times. Each iteration of the job array is called a task and receives a unique job ID. +Only supported for batch jobs; submit time $< 1$ second, compared to repeatedly +submitting the same regular job over and over even from a script. -To submit an array job, use the \texttt{\-t} option of the \texttt{qsub} +%To submit an array job, use the \texttt{\-t} option of the \texttt{qsub} +%command as follows: +To submit an array job, use the \option{--array} option of the \texttt{sbatch} command as follows: +%\begin{verbatim} +%qsub -t n[-m[:s]] +%\end{verbatim} \begin{verbatim} -qsub -t n[-m[:s]] +sbatch --array n-m[:s]] \end{verbatim} \textbf{-t Option Syntax:} @@ -232,27 +354,38 @@ \subsection{Array Jobs} \textbf{Examples:} \begin{itemize} \item -\texttt{qsub -t 10 array.sh}: submits a job with 1 task where the task-id is 10. +\verb+sbatch --array=1-50000 -N1 -i my_in_%a -o my_out_%a array.sh+: submits a job with 50000 elements, +\%a maps to the task-id between 1 and 50K. \item -\texttt{qsub -t 1-10 array.sh}: submits a job with 10 tasks numbered consecutively from 1 to 10. +%\texttt{qsub -t 10 array.sh}: submits a job with 1 task where the task-id is 10. +\texttt{sbatch --array=10 array.sh}: submits a job with 1 task where the task-id is 10. \item -\texttt{qsub -t 3-15:3 array.sh}: submits a jobs with 5 tasks numbered consecutively with step size 3 +%\texttt{qsub -t 1-10 array.sh}: submits a job with 10 tasks numbered consecutively from 1 to 10. +\texttt{sbatch --array=1-10 array.sh}: submits a job with 10 tasks numbered consecutively from 1 to 10. +\item +%\texttt{qsub -t 3-15:3 array.sh}: submits a jobs with 5 tasks numbered consecutively with step size 3 +\texttt{sbatch --array=3-15:3 array.sh}: submits a jobs with 5 tasks numbered consecutively with step size 3 (task-ids 3,6,9,12,15). \end{itemize} \textbf{Output files for Array Jobs:} -The default and output and error-files are \option{job\_name.[o|e]job\_id} and\\ -\option{job\_name.[o|e]job\_id.task\_id}. +The default and output and error-files are +%\option{job\_name.[o|e]job\_id} and\\ +%\option{job\_name.[o|e]job\_id.task\_id}. +\texttt{slurm-job\_id\_task\_id.out}. % This means that Speed creates an output and an error-file for each task generated by the array-job as well as one for the super-ordinate array-job. To alter this behavior use the \option{-o} and \option{-e} option of -\tool{qsub}. +%\tool{qsub}. +\tool{sbatch}. For more details about Array Job options, please review the manual pages for -\option{qsub} by executing the following at the command line on speed-submit -\tool{man qsub}. +%\option{qsub} by executing the following at the command line on speed-submit +%\tool{man qsub}. +\tool{sbatch} by executing the following at the command line on speed-submit +\texttt{man sbatch}. % ------------------------------------------------------------------------------ \subsection{Requesting Multiple Cores (i.e., Multithreading Jobs)} @@ -260,56 +393,235 @@ \subsection{Requesting Multiple Cores (i.e., Multithreading Jobs)} For jobs that can take advantage of multiple machine cores, up to 32 cores (per job) can be requested in your script with: +%\begin{verbatim} +%#$ -pe smp [#cores] +%\end{verbatim} \begin{verbatim} -#$ -pe smp [#cores] +#SBATCH -n [#cores] \end{verbatim} +Both \tool{sbatch} and \tool{salloc} support \option{-n} on the command line, +and it should always be used either in the script or on the command line as the +default $n=1$. \textbf{Do not request more cores than you think will be useful}, as larger-core jobs are more difficult to schedule. On the flip side, though, if you are going to be running a program that scales out to the maximum single-machine core count available, please (please) request 32 cores, to avoid node oversubscription (i.e., to avoid overloading the CPUs). -Core count associated with a job appears under, ``states'', in the, -\texttt{qstat -f -u "*"}, output. +\textbf{Important} note about \option{--ntasks} or \option{--ntasks-per-node} +(\option{-n}) talks about processes (usually the ones ran with \tool{srun}). +\option{--cpus-per-task} (\option{-c}) corresponds to threads per process. +Some programs consider them equivalent, some don't. Fluent for example +uses \option{--ntasks-per-node=8} and \option{--cpus-per-task=1}, +some just set \option{--cpus-per-task=8} and \option{--ntasks-per-node=1}. +If one of them is not $1$ then some applications need to be told to +use $n*c$ total cores. + + +Core count associated with a job appears under, +%``states'', in the, \texttt{qstat -f -u "*"}, output. +``AllocCPUS'', in the, \texttt{qacct -j}, output. + +\small +\begin{verbatim} +[serguei@speed-submit src] % squeue -l +Thu Oct 19 20:32:32 2023 +JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) + 2652 ps interact a_user RUNNING 9:35:18 1-00:00:00 1 speed-07 +[serguei@speed-submit src] % sacct -j 2652 +JobID JobName Partition Account AllocCPUS State ExitCode +------------ ---------- ---------- ---------- ---------- ---------- -------- +2652 interacti+ ps speed1 20 RUNNING 0:0 +2652.intera+ interacti+ speed1 20 RUNNING 0:0 +2652.extern extern speed1 20 RUNNING 0:0 +2652.0 gydra_pmi+ speed1 20 COMPLETED 0:0 +2652.1 gydra_pmi+ speed1 20 COMPLETED 0:0 +2652.2 gydra_pmi+ speed1 20 FAILED 7:0 +2652.3 gydra_pmi+ speed1 20 FAILED 7:0 +2652.4 gydra_pmi+ speed1 20 COMPLETED 0:0 +2652.5 gydra_pmi+ speed1 20 COMPLETED 0:0 +2652.6 gydra_pmi+ speed1 20 COMPLETED 0:0 +2652.7 gydra_pmi+ speed1 20 COMPLETED 0:0 +\end{verbatim} +\normalsize % ------------------------------------------------------------------------------ \subsection{Interactive Jobs} +\label{sect:interactive-jobs} Job sessions can be interactive, instead of batch (script) based. Such -sessions can be useful for testing and optimising code and resource -requirements prior to batch submission. To request an interactive job -session, use, \texttt{qlogin [options]}, similarly to a -\tool{qsub} command-line job (e.g., \texttt{qlogin -N qlogin-test -l h\_vmem=1G}). -Note that the options that are available for \tool{qsub} are not necessarily -available for \tool{qlogin}, notably, \texttt{-cwd}, and, \texttt{-v}. +sessions can be useful for testing, debugging, and optimising code and resource +requirements, conda or python virtual environments setup, or any likewise +preparatory work prior to batch submission. + +% ------------------------------------------------------------------------------ +\subsubsection{Command Line} + +To request an interactive job +%session, use, \texttt{qlogin [options]}, similarly to a +%\tool{qsub} command-line job (e.g., \texttt{qlogin -N qlogin-test -l h\_vmem=1G}). +session, use, \texttt{salloc [options]}, similarly to a +\tool{sbatch} command-line job, e.g., +% +\begin{verbatim} +salloc -J interactive-test --mem=1G -p ps -n 8 +\end{verbatim} +% +%Note that the options that are available for \tool{qsub} are not necessarily +%available for \tool{qlogin}, notably, \texttt{-cwd}, and, \texttt{-v}. +% +Inside the allocated \tool{salloc} session you can run shell +commands as usual; it is recommended to use \tool{srun} for +the heavy compute steps inside \tool{salloc}. +% +If it is a quick a short job just to compile something, e.g., on +a GPU node you can use an interactive \tool{srun} directly +(note no \tool{srun} can run within \tool{srun}), e.g., a 1 hour +allocation: + +For \tool{tcsh}: +\begin{verbatim} +srun --pty -n 8 -p pg --gpus=1 -t 60 /encs/bin/tcsh +\end{verbatim} + +For \tool{bash}: +\begin{verbatim} +srun --pty -n 8 -p pg --gpus=1 -t 60 /encs/bin/bash +\end{verbatim} + +% ------------------------------------------------------------------------------ +\subsubsection{Graphical Applications} + +If you need to run an on-Speed graphical-based UI application (e.g., MALTLAB, +Abaqus CME, etc.), or an IDE (PyCharm, VSCode, Eclipse) +to develop and test your job's code interactively you need to enable +X11-forwarding from your client machine to speed then to the compute node. +To do so: + +\begin{enumerate} +\item +you need to run an X server on your client machine, such as, +\begin{itemize} +\item on Windows: MobaXterm with X turned on, or Xming + PuTTY with X11 forwarding, or XOrg under Cygwin +\item on macOS: XQuarz -- use its \tool{xterm} and \texttt{ssh -X} +\item on Linux just use \texttt{ssh -X speed.encs.concordia.ca} +\end{itemize} + +See \url{https://www.concordia.ca/ginacody/aits/support/faq/xserver.html} +for details. + +\item +verify your X connection was properly forwarded by printing the \api{DISPLAY} variable: + +\verb+echo $DISPLAY+ +If it has no output, then your X forwarding is not on and you may need to re-login to Speed. + +\item +Use the \option{--x11} with \tool{salloc} or \tool{srun}: + +\verb+salloc ... --x11=first ...+ + +\item +Once landed on a compute node, verify \api{DISPLAY} again. + +\item +While running under scheduler, unset \api{XDG\_RUNTIME\_DIR}. + +\item +Launch your graphical application: + +\texttt{module load} the required version, then +\tool{matlab}, or \tool{abaqus cme}, etc. +\end{enumerate} + +Here's an example of starting PyCharm, of which we made a sample local installation. +You can make a similar install under your own directory. If using VSCode, it's +currently only supported with the \tool{--no-sandbox} option. + +\scriptsize +\begin{verbatim} +bash-3.2$ ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too) +serguei@speed's password: +[serguei@speed-submit ~] % echo $DISPLAY +localhost:14.0 +[serguei@speed-submit ~] % srun -p ps --pty --x11=first --mem 4000 -t 0-06:00 /encs/bin/bash +bash-4.4$ echo $DISPLAY +localhost:77.0 +bash-4.4$ hostname +speed-01.encs.concordia.ca +bash-4.4$ unset XDG_RUNTIME_DIR +bash-4.4$ /speed-scratch/nag-public/bin/pycharm.sh +\end{verbatim} +\normalsize + +\begin{figure}[htpb] + \includegraphics[width=\columnwidth]{images/pycharm} + \caption{PyCharm Starting up on a Speed Node} + \label{fig:pycharm} +\end{figure} + % ------------------------------------------------------------------------------ \subsection{Scheduler Environment Variables} +\label{sect:env-vars} The scheduler presents a number of environment variables that can be used in -your jobs. Three of the more useful are \api{TMPDIR}, \api{SGE\_O\_WORKDIR}, -and \api{NSLOTS}: +your jobs. You can invoke \tool{env} or \tool{printenv} in your +job to know what hose are (most begin with the prefix \texttt{SLURM}). +% +Some of the more useful ones are: +%\api{TMPDIR}, \api{SGE\_O\_WORKDIR}, and \api{NSLOTS}: \begin{itemize} \item -\api{\$TMPDIR}=the path to the job's temporary space on the node. It +% TODO: verify temporal existence +\api{\$TMPDIR} -- the path to the job's temporary space on the node. It \emph{only} exists for the duration of the job, so if data in the temporary space are important, they absolutely need to be accessed before the job terminates. +%\item +%\api{\$SGE\_O\_WORKDIR}=the path to the job's working directory (likely an +%NFS-mounted path). If, \texttt{-cwd}, was stipulated, that path is taken; if not, +%the path defaults to your home directory. \item -\api{\$SGE\_O\_WORKDIR}=the path to the job's working directory (likely an -NFS-mounted path). If, \texttt{-cwd}, was stipulated, that path is taken; if not, +\api{\$SLURM\_SUBMIT\_DIR} -- the path to the job's working directory (likely an +NFS-mounted path). If, \option{--chdir}, was stipulated, that path is taken; if not, +% TODO: verify if home or current: the path defaults to your home directory. +% TODO: SLURM does not appear to have this +% SLURM_NTASKS +%\item +%\api{\$NSLOTS}=the number of cores requested for the job. This variable can +%be used in place of hardcoded thread-request declarations. + +\item +\api{\$SLURM\_JOBID} -- your current jobs ID, useful for some manipulation +and reporting. + \item -\api{\$NSLOTS}=the number of cores requested for the job. This variable can -be used in place of hardcoded thread-request declarations. +\api{\$SLURM\_JOB\_NODELIST}=nodes participating in your job. + +\item +\api{\$SLURM\_ARRAY\_TASK\_ID}=for array jobs (see \xs{sect:array-jobs}). + +\item +See a more complete list here: + +\small +\begin{itemize} +\item +\url{https://slurm.schedmd.com/srun.html#SECTION_INPUT-ENVIRONMENT-VARIABLES} +\item +\url{https://slurm.schedmd.com/srun.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES} +\end{itemize} +\normalsize \end{itemize} \noindent -In \xf{fig:tmpdir.sh} is a sample script, using all three. +In \xf{fig:tmpdir.sh} is a sample script, using some of these. \begin{figure}[htpb] \lstinputlisting[language=csh,frame=single,basicstyle=\footnotesize\ttfamily]{tmpdir.sh} diff --git a/doc/scheduler-tips.tex b/doc/scheduler-tips.tex index 2d49108..a9ebee2 100644 --- a/doc/scheduler-tips.tex +++ b/doc/scheduler-tips.tex @@ -1,31 +1,38 @@ -% ------------------------------------------------------------------------------ +% ------------------------------------------------------------------------------ \subsection{Tips/Tricks} \label{sect:tips} \begin{itemize} \item Files/scripts must have Linux line breaks in them (not Windows ones). +Use \tool{file} command to verify; and \tool{dos2unix} command +to convert. + \item -Use \tool{rsync}, not \tool{scp}, when moving data around. +Use \tool{rsync}, not \tool{scp}, when moving a lot of data around. + \item If you are going to move many many files between NFS-mounted storage and the cluster, \tool{tar} everything up first. + \item If you intend to use a different shell (e.g., \tool{bash}~\cite{aosa-book-vol1-bash}), -you will need to source a different scheduler file, and will need to -change the shell declaration in your script(s). -\item -The load displayed in \tool{qstat} by default is \api{np\_load}, which is -load/\#cores. That means that a load of, ``1'', which represents a fully active -core, is displayed as $0.03$ on the node in question, as there are 32 cores -on a node. To display load ``as is'' (such that a node with a fully active -core displays a load of approximately $1.00$), add the following to your -\file{.tcshrc} file: \texttt{setenv SGE\_LOAD\_AVG load\_avg} +%you will need to source a different scheduler file, and +you will need to change the shell declaration in your script(s). + +% TODO: +%\item +%The load displayed in \tool{qstat} by default is \api{np\_load}, which is +%load/\#cores. That means that a load of, ``1'', which represents a fully active +%core, is displayed as $0.03$ on the node in question, as there are 32 cores +%on a node. To display load ``as is'' (such that a node with a fully active +%core displays a load of approximately $1.00$), add the following to your +%\file{.tcshrc} file: \texttt{setenv SGE\_LOAD\_AVG load\_avg} \item -Try to request resources that closely match what your job will use: +\textbf{Try to request resources that closely match what your job will use: requesting many more cores or much more memory than will be needed makes a -job more difficult to schedule when resources are scarce. +job more difficult to schedule when resources are scarce.} \item E-mail, \texttt{rt-ex-hpc AT encs.concordia.ca}, with any concerns/questions. diff --git a/doc/speed-manual.bib b/doc/speed-manual.bib index a610a82..5559909 100644 --- a/doc/speed-manual.bib +++ b/doc/speed-manual.bib @@ -35615,3 +35615,43 @@ @inproceedings address = {Aberdeen, UK}, note = {\url{https://arxiv.org/abs/2309.05829} and \url{https://github.com/goutamyg/MVT}} } + +@article +{ + cfd-modeling-turbine-2023, + author = {Belabes, Belkacem and Paraschivoiu, Marius}, + title = {{CFD} modeling of vertical-axis wind turbine wake interaction}, + journal = {Transactions of the Canadian Society for Mechanical Engineering}, + volume = {}, + number = {}, + pages = {1--10}, + year = 2023, + doi = {10.1139/tcsme-2022-0149}, + note = {\url{https://doi.org/10.1139/tcsme-2022-0149}} +} + +@inproceedings +{ + cfd-vaxis-turbine-wake-2022, + author = {Belabes, Belkacem and Paraschivoiu, Marius}, + title = {{CFD} Study of the aerodynamic performance of a Vertical Axis Wind Turbine in the wake of another turbine}, + booktitle = {Proceedings of the CSME International Congress}, + pages = {}, + year = 2022, + doi = {10.7939/r3-rker-1746}, + note = {\url{https://doi.org/10.7939/r3-rker-1746}} +} + +@article +{ + numerical-turbulence-vawt-2021, + author = {Belkacem Belabes and Marius Paraschivoiu}, + title = {Numerical study of the effect of turbulence intensity on {VAWT} performance}, + journal = {Energy}, + volume = {233}, + pages = {121139}, + year = 2021, + issn = {0360-5442}, + doi = {10.1016/j.energy.2021.121139}, + note = {\url{https://doi.org/10.1016/j.energy.2021.121139}} +} diff --git a/doc/speed-manual.cfg b/doc/speed-manual.cfg index 82a9037..2047b67 100644 --- a/doc/speed-manual.cfg +++ b/doc/speed-manual.cfg @@ -10,5 +10,12 @@ } } } + +% https://tex.stackexchange.com/questions/698669 +\makeatletter + \def\Hy@PageAnchorSlidesPlain{}% + \def\Hy@PageAnchorSlide{}% +\makeatother + \begin{document} -\EndPreamble \ No newline at end of file +\EndPreamble diff --git a/doc/speed-manual.pdf b/doc/speed-manual.pdf index ecac9c3..e544037 100644 Binary files a/doc/speed-manual.pdf and b/doc/speed-manual.pdf differ diff --git a/doc/speed-manual.tex b/doc/speed-manual.tex index e5e9e5b..5e0ad3a 100644 --- a/doc/speed-manual.tex +++ b/doc/speed-manual.tex @@ -35,7 +35,9 @@ % Previously VI %\date{Version 6.5} %\date{\textbf{Version 6.6-dev-07}} -\date{\textbf{Version 6.6} (final GE version)} +%\date{\textbf{Version 6.6} (final GE version)} +%\date{\textbf{Version 7.0-dev-01}} +\date{\textbf{Version 7.0}} % Authors are joined by \and and their affiliations are on the % subsequent lines separated by \\ just like the article class @@ -46,7 +48,10 @@ \and Gillian A. Roper \and - Network, Security and HPC Group\footnote{The group acknowledges the initial manual version VI produced by Dr.~Scott Bunnell while with us.}\\ + Carlos Alarcón Meza +\and + Network, Security and HPC Group\footnote{The group acknowledges the initial manual version VI produced by Dr.~Scott Bunnell while with us + as well as Dr.~Tariq Daradkeh for his instructional support of the users and contribution of examples.}\\ \affiliation{Gina Cody School of Engineering and Computer Science}\\ \affiliation{Concordia University}\\ \affiliation{Montreal, Quebec, Canada}\\ @@ -56,9 +61,10 @@ % \authorrunning{} has to be set for the shorter version of the authors' names; % otherwise a warning will be rendered in the running heads. % -\authorrunning{Mokhov, Roper, NAG/HPC, GCS ENCS} +\authorrunning{Mokhov, Roper, Alarcón Meza, NAG/HPC, GCS ENCS} \indexedauthor{Mokhov, Serguei} \indexedauthor{Roper, Gillian} +\indexedauthor{Alarcón Meza, Carlos} \indexedauthor{NAG/HPC} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% @@ -67,12 +73,10 @@ % ------------------------------------------------------------------------------ \begin{abstract} -This document primarily presents a quick start -guide to the usage of the Gina Cody School of -Engineering and Computer Science compute server farm -called ``Speed'' -- the GCS ENCS Speed cluster, -managed by HPC/NAG of GCS ENCS, Concordia University, -Montreal, Canada. +This document presents a quick start guide to the usage of the Gina Cody School +of Engineering and Computer Science compute server farm called ``Speed'' -- the +GCS Speed cluster, managed by the HPC/NAG group of the Academic Information +Technology Services (AITS) at GCS, Concordia University, Montreal, Canada. \end{abstract} % ------------------------------------------------------------------------------ @@ -84,9 +88,21 @@ \section{Introduction} This document contains basic information required to use ``Speed'' as well as tips and tricks, examples, and references to projects and papers that have used Speed. -User contributions of sample jobs and/or references are welcome. +User contributions of sample jobs and/ or references are welcome. Details are sent to the \texttt{hpc-ml} mailing list. +\textbf{Note:} On October 20, 2023 with workshops prior, we have completed migration to SLURM (see \xf{fig:slurm-arch}) +from Grid Engine (UGE/AGE) as our job scheduler, so this manual has been ported to use SLURM's +syntax and commands. If you are a long-time GE user, see \xa{appdx:uge-to-slurm} key highlights +of the move needed to translate your GE jobs to SLURM as well as environment changes. +These are also elaborated throughout this document and our examples as well in case you +desire to re-read it. + +If you wish to cite this work in your acknowledgements, you can use +our general DOI found on our GitHub page +\url{https://dx.doi.org/10.5281/zenodo.5683642} or a specific +version of the manual and scripts from that link individually. + % ------------------------------------------------------------------------------ \subsection{Resources} @@ -112,74 +128,109 @@ \subsection{Resources} All Speed users are subscribed to the \texttt{hpc-ml} mailing list. -\item -\href - {https://docs.google.com/presentation/d/1zu4OQBU7mbj0e34Wr3ILXLPWomkhBgqGZ8j8xYrLf44} - {Speed Server Farm Presentation 2022}~\cite{speed-intro-preso}. +% TODO: for now comment out for 7.0; if when we update that +% preso, we will re-link it here. However, keep the citation. +\nocite{speed-intro-preso} +%\item +%\href +% {https://docs.google.com/presentation/d/1zu4OQBU7mbj0e34Wr3ILXLPWomkhBgqGZ8j8xYrLf44} +% {Speed Server Farm Presentation 2022}~\cite{speed-intro-preso}. \end{itemize} % ------------------------------------------------------------------------------ \subsection{Team} +\label{sect:speed-team} + +Speed is supported by: \begin{itemize} \item -Serguei Mokhov, PhD, Manager, Networks, Security and HPC +Serguei Mokhov, PhD, Manager, Networks, Security and HPC, AITS \item -Gillian Roper, Senior Administrator, System, Information Technology +Gillian Roper, Senior Systems Administrator, HPC, AITS \item -Carlos Alarcón Meza, Administrator, System, High Performance Computing and Networking, Information Technology - \item -Tariq Daradkeh, PhD, IT Instructional Specialist, Information Technology +Carlos Alarcón Meza, Systems Administrator, HPC and Networking, AITS + %\item +%Tariq Daradkeh, PhD, IT Instructional Specialist, Information Technology \end{itemize} -We receive support from the rest of AITS teams, such -as NAG, SAG, FIS, and DOG. +\noindent +We receive support from the rest of AITS teams, such as NAG, SAG, FIS, and DOG.\\ +% +\url{https://www.concordia.ca/ginacody/aits.html} % ------------------------------------------------------------------------------ -\subsection{What Speed Comprises} +\subsection{What Speed Consists of} +\label{sect:speed-arch} \begin{itemize} \item Twenty four (24) 32-core compute nodes, each with 512~GB of memory and -approximately 1~TB of volatile-scratch disk space. +approximately 1~TB of local volatile-scratch disk space (pictured in \xf{fig:speed-pics}). + \item Twelve (12) NVIDIA Tesla P6 GPUs, with 16~GB of memory (compatible with the CUDA, OpenGL, OpenCL, and Vulkan APIs). \item -One AMD FirePro S7150 GPUs, with 8~GB of memory (compatible with the -Direct~X, OpenGL, OpenCL, and Vulkan APIs). +4 VIDPRO nodes, with 6 P6 cards, and 6 V100 cards (32GB), and +256GB of RAM. +\item +7 new SPEED2 servers with 64 CPU cores each 4x A100 80~GB GPUs, partitioned +into 4x 20GB each; larger local storage for TMPDIR. + +\item +One AMD FirePro S7150 GPU, with 8~GB of memory (compatible with the +Direct~X, OpenGL, OpenCL, and Vulkan APIs). \end{itemize} +\begin{figure}[htpb] +\includegraphics[width=\columnwidth]{images/speed-pics} +\caption{Speed} +\label{fig:speed-pics} +\end{figure} + +\begin{figure}[htpb] +\includegraphics[width=\columnwidth]{images/slurm-arch} +\caption{Speed SLURM Architecture} +\label{fig:slurm-arch} +\end{figure} + + % ------------------------------------------------------------------------------ \subsection{What Speed Is Ideal For} \label{sect:speed-is-for} \begin{itemize} \item -To design and develop, test and run parallel, batch, and other algorithms, scripts with partial data sets. +To design and develop, test and run parallel, batch, and other algorithms, +scripts with partial data sets. ``Speed'' has been optimised for compute jobs +that are multi-core aware, require a large memory space, or are iteration +intensive. \item Prepare them for big clusters: \begin{itemize} \item - Digital Alliance (Calcul Quebec and Compute Canada) + Digital Research Alliance of Canada (Calcul Quebec and Compute Canada) \item Cloud platforms \end{itemize} \item Jobs that are too demanding for a desktop. \item -Single-core batch jobs; multithreaded jobs up to 32 cores (i.e., a single machine). +Single-core batch jobs; multithreaded jobs typically up to 32 cores (i.e., a single machine). \item -Anything that can fit into a 500-GB memory space and a scratch space of approximately 1~TB. +Multi-node multi-core jobs (MPI). +\item +Anything that can fit into a 500-GB memory space and a \textbf{scratch} space of approximately 10~TB. \item CPU-based jobs. \item -CUDA GPU jobs (\texttt{speed-05}, \texttt{speed-17}). +CUDA GPU jobs (\texttt{speed-01|-03|-05}, \texttt{speed-17}, \texttt{speed-37}--\texttt{speed-43}). \item -Non-CUDA GPU jobs using OpenCL (\texttt{speed-19} and \texttt{speed-05|17}). +Non-CUDA GPU jobs using OpenCL (\texttt{speed-19} and \texttt{-01|03|05|17|25|27|37-43}). \end{itemize} % ------------------------------------------------------------------------------ @@ -188,9 +239,9 @@ \subsection{What Speed Is Not} \begin{itemize} \item Speed is not a web host and does not host websites. -\item Speed is not meant for CI automation deployments for Ansible or similar tools. +\item Speed is not meant for Continuous Integration (CI) automation deployments for Ansible or similar tools. \item Does not run Kubernetes or other container orchestration software. -\item Does not run Docker. (Note: Speed does run Singularity and many Docker containers can be converted to Singularity containers with a single command.) +\item Does not run Docker. (\textbf{Note:} Speed does run Singularity and many Docker containers can be converted to Singularity containers with a single command. See \xs{sect:singularity-containers}.) \item Speed is not for jobs executed outside of the scheduler. (Jobs running outside of the scheduler will be killed and all data lost.) \end{itemize} @@ -198,39 +249,41 @@ \subsection{What Speed Is Not} \subsection{Available Software} We have a great number of open-source software available and installed -on Speed~--~various Python, CUDA versions, {\cpp}/{\java} compilers, OpenGL, +on ``Speed''~--~various Python, CUDA versions, {\cpp}/{\java} compilers, OpenGL, OpenFOAM, OpenCV, TensorFlow, OpenMPI, OpenISS, {\marf}~\cite{marf}, etc. There are also a number of commercial packages, subject to licensing contributions, available, such as MATLAB~\cite{matlab,scholarpedia-matlab}, Abaqus~\cite{abaqus}, Ansys, Fluent~\cite{fluent}, etc. To see the packages available, run \texttt{ls -al /encs/pkg/} on \texttt{speed.encs}. - +% In particular, there are over 2200 programs available in \texttt{/encs/bin} and \texttt{/encs/pkg} under Scientific Linux 7 (EL7). +We are building an equivalent array of programs for the EL9 SPEED2 nodes. \begin{itemize} \item Popular concrete examples: \begin{itemize} \item -MATLAB (R2016b, R2018a, R2018b) +MATLAB (R2016b, R2018a, R2018b, ...) \item -Fluent (19.2) +Fluent (19.2, ...) \item -Singularity (Docker-like container), can run other OS's apps, like Ubuntu's, converted Docker containers. +Singularity containers (see \xs{sect:singularity-containers}) can run other +operating systems and Linux distributions, like Ubuntu's, as well as +converted Docker containers. \end{itemize} \item We do our best to accommodate custom software requests. -Python environments can be used to have user-custom installs -in the scratch directory. +Python environments can use user-custom installs +from within the scratch directory. \item -A number of specific environments are available, too. - \item -Popular examples mentioned (loaded with, \tool{module}): +A number of specific environments are available and +can be loaded using the \tool{module} command: \begin{itemize} \item -Python (2.3.0 - 3.5.1) +Python (2.3.x - 3.11.x) \item Gurobi (7.0.1, 7.5.0, 8.0.0, 8.1.0) \item @@ -246,12 +299,13 @@ \subsection{Available Software} % ------------------------------------------------------------------------------ \subsection{Requesting Access} +\label{sect:access} After reviewing the ``What Speed is'' (\xs{sect:speed-is-for}) and ``What Speed is Not'' (\xs{sect:speed-is-not}), request access to the ``Speed'' cluster by emailing: \texttt{rt-ex-hpc AT encs.concordia.ca}. % -Faculty and staff may request the access directly. +CGS ENCS faculty and staff may request access directly. Students must include the following in their message: \begin{itemize} @@ -260,12 +314,42 @@ \subsection{Requesting Access} \item Written request from the supervisor or instructor for the ENCS username to be granted access to ``Speed'' \end{itemize} +Non-GCS faculty / students need to get a ``sponsor'' within GCS, such that +your guest GCS ENCS account is created first. A sponsor can be any GCS Faculty member +you collaborate with. Failing that, request the approval from our Dean's Office; +via our Associate Deans Drs.\ Eddie Hoi Ng or Emad Shihab. +% +External entities to Concordia who collaborate with CGS Concordia researchers, +should also go through the Dean's office for approvals. +% +Non-GCS students taking a GCS course do have their GCS ENCS account created automatically, +but still need the course instructor's approval to use the service. + % ------------------------------------------------------------------------------ \section{Job Management} \label{sect:job-management} In these instructions, anything bracketed like so, \verb+<>+, indicates a label/value to be replaced (the entire bracketed term needs replacement). +% +We use SLURM as the Workload Manager. +It supports primarily two types of jobs: batch and interactive. +Batch jobs are used to run unattended tasks. + +TL;DR: +Job instructions in a script start with \verb+#SBATCH+ prefix, for example: +\begin{verbatim} +#SBATCH --account=speed1 --mem=100M -t 600 -J job-name +#SBATCH --gpus=2 --mail-type=ALL -t 600 --mail-user=$USER +\end{verbatim} +% +We use \tool{srun} for every complex compute step inside the script. +Use interactive jobs to set up virtual environments, compilation, and debugging. +\tool{salloc} is preferred; allows multiple steps. +\tool{srun} can start interactive jobs as well (see \xs{sect:interactive-jobs}). +Required and common job parameters: job-name (J), mail-type, mem, ntasks (n), +cpus-per-task, account, -p (partition). + % ------------------------------------------------------------------------------ \subsection{Getting Started} @@ -275,19 +359,26 @@ \subsection{Getting Started} Once your GCS ENCS account has been granted access to ``Speed'', use your GCS ENCS account credentials to create an SSH connection to \texttt{speed} (an alias for \texttt{speed-submit.encs.concordia.ca}). +% +All users are expected to have a basic understanding of +Linux and its commonly used commands (see \xa{sect:faqs-linux} for resources). % ------------------------------------------------------------------------------ \subsubsection{SSH Connections} +\label{sect:ssh} Requirements to create connections to Speed: \begin{enumerate} \item -An active \textbf{ENCS user account} which has permission to connect to Speed. +An active \textbf{GCS ENCS user account}, which has permission to connect to Speed +(see \xs{sect:access}). \item If you are off campus, an active connection to Concordia's VPN. Accessing Concordia's VPN requires a Concordia \textbf{netname}. \item -Windows systems require a terminal emulator such as PuTTY (or MobaXterm). +Windows systems require a terminal emulator such as PuTTY, Cygwin, or MobaXterm. + \item +macOS systems do have a Terminal app for this or \tool{xterm} that comes with XQuarz. \end{enumerate} Open up a terminal window and type in the following SSH command being sure to replace @@ -297,8 +388,12 @@ \subsubsection{SSH Connections} ssh @speed.encs.concordia.ca \end{verbatim} -All users are expected to have a basic understanding of Linux and its -commonly used commands. +\noindent +Read the AITS FAQ: +\href +{https://www.concordia.ca/ginacody/aits/support/faq/ssh-to-gcs.html} +{How do I securely connect to a GCS server?} + % ------------------------------------------------------------------------------ % TMP scheduler-specific section @@ -307,25 +402,47 @@ \subsubsection{SSH Connections} % ------------------------------------------------------------------------------ \subsection{Job Submission Basics} -Preparing your job for submission is fairly straightforward. Editing a copy -of the \file{template.sh} you moved into your home directory during -\xs{sect:envsetup} is a good place to start. You can also use a job script -example from our GitHub's (\url{https://github.com/NAG-DevOps/speed-hpc}) ``src'' -directory and base your job on it. - +Preparing your job for submission is fairly straightforward. +Start by basing your job script on one of the examples available in the \texttt{src/} +directory of our GitHub's (\url{https://github.com/NAG-DevOps/speed-hpc}). +% Job scripts are broken into four main sections: + \begin{itemize} \item Directives \item Module Loads \item User Scripting \end{itemize} +You can clone the tip of our repository to get the examples to start +with or download them individually via a browser or command line: + +\small +\begin{verbatim} +git clone --depth=1 https://github.com/NAG-DevOps/speed-hpc.git +cd speed-hpc/src +\end{verbatim} +\normalsize + +\noindent +Then to quickly run some sample jobs, you can: +\small +\begin{verbatim} +sbatch -p ps -t 10 bash.sh +sbatch -p ps -t 10 env.sh +sbatch -p ps -t 10 manual.sh +sbatch -p pg -t 10 lambdal-singularity.sh +\end{verbatim} +\normalsize + + % ------------------------------------------------------------------------------ % TMP scheduler-specific section \input{scheduler-directives} % ------------------------------------------------------------------------------ \subsubsection{Module Loads} +\label{sect:modules} As your job will run on a compute or GPU ``Speed'' node, and not the submit node, any software that is needed must be loaded by the job script. Software is loaded @@ -379,6 +496,7 @@ \subsubsection{Module Loads} % ------------------------------------------------------------------------------ \subsection{SSH Keys For MPI} +\label{sect:ssh-mpi} Some programs effect their parallel processing via MPI (which is a communication protocol). An example of such software is Fluent. MPI needs to @@ -405,13 +523,17 @@ \subsection{Creating Virtual Environments} The following documentation is specific to the \textbf{Speed} HPC Facility at the Gina Cody School of Engineering and Computer Science. +% +Virtual environments typically instantiated via Conda or Python. +Another option is Singularity detailed in \xs{sect:singularity-containers}. % ------------------------------------------------------------------------------ \subsubsection{Anaconda} +\label{sect:conda-venv} To create an anaconda environment in your speed-scratch directory, use the \texttt{\-\-prefix} option when executing \texttt{conda create}. For example, to create an anaconda environment for -\texttt{ai\_user}, execute the following at the command line: +\texttt{a\_user}, execute the following at the command line: \begin{verbatim} conda create --prefix /speed-scratch/a_user/myconda @@ -420,7 +542,7 @@ \subsubsection{Anaconda} \vspace{10pt} \noindent \textbf{Note:} Without the \texttt{\-\-prefix} option, the \texttt{conda create} command creates the -environment in texttt{a\_user}'s home directory by default. +environment in \texttt{a\_user}'s home directory by default. \vspace{10pt} % ------------------------------------------------------------------------------ @@ -456,6 +578,24 @@ \subsubsection{Anaconda} anaconda's repository. \vspace{10pt} +% ------------------------------------------------------------------------------ +\subsubsection{Python} +\label{sect:python-venv} + +Setting up a Python virtual environment is fairly straightforward. +We have a simple example that use a Python virtual environment: + +\begin{itemize} + \item + \href + {https://github.com/NAG-DevOps/speed-hpc/blob/master/src/gurobi-with-python.sh} + {\texttt{gurobi-with-python.sh}} + %\item + %\href + %{} + %{} +\end{itemize} + % ------------------------------------------------------------------------------ % TMP scheduler-specific section \input{scheduler-job-examples} @@ -476,11 +616,11 @@ \subsection{Important Limitations} \begin{itemize} \item New users are restricted to a total of 32 cores: write to \url{rt-ex-hpc@encs.concordia.ca} -if you need more temporarily (256 is the maximum possible, or, 8 jobs of 32 cores each). +if you need more temporarily (192 is the maximum, or, 6 jobs of 32 cores each). \item -Job sessions are a maximum of one week in length (only 24 hours, though, -for interactive jobs). +Batch job sessions are a maximum of one week in length (only 24 hours, though, +for interactive jobs, see \xs{sect:interactive-jobs}). \item Scripts can live in your NFS-provided home, but any substantial data need @@ -490,18 +630,18 @@ \subsection{Important Limitations} NFS is great for acute activity, but is not ideal for chronic activity. Any data that a job will read more than once should be copied at the start to the scratch disk of a -compute node using \api{\$TMPDIR} (and, perhaps, \api{\$SGE\_O\_WORKDIR}), +compute node using \api{\$TMPDIR} (and, perhaps, \api{\$SLURM\_SUBMIT\_DIR}), any intermediary job data should be produced in \api{\$TMPDIR}, and once a job is near to finishing, those data should be copied to your NFS-mounted home (or other NFS-mounted space) from \api{\$TMPDIR} (to, perhaps, -\api{\$SGE\_O\_WORKDIR}). In other words, IO-intensive operations should be effected +\api{\$SLURM\_SUBMIT\_DIR}). In other words, IO-intensive operations should be effected locally whenever possible, saving network activity for the start and end of jobs. \item Your current resource allocation is based upon past usage, which is an amalgamation of approximately one week's worth of past wallclock (i.e., time -spent on the node(s)) and CPU activity (on the node(s)). +spent on the node(s)) and compute activity (on the node(s)). \item Jobs should NEVER be run outside of the province of the scheduler. @@ -552,6 +692,12 @@ \subsection{Use Cases} \item \bibentry{Gopal2023Mob} \item +\bibentry{cfd-modeling-turbine-2023} +\item +\bibentry{cfd-vaxis-turbine-wake-2022} +\item +\bibentry{numerical-turbulence-vawt-2021} +\item \bibentry{niksirat2020} \item @@ -575,7 +721,7 @@ \section{History} % ------------------------------------------------------------------------------ \subsection{Acknowledgments} -\label{sect:scott-acks} +\label{sect:acks} \begin{itemize} \item @@ -585,22 +731,109 @@ \subsection{Acknowledgments} him for his contributions. \item The HTML version with devcontainer support was contributed by Anh H Nguyen. + \item +Dr.~Tariq Daradkeh, was our IT Instructional Specialist August 2022 to September 2023; +working on the scheduler, scheduling research, end user support, and integration of +examples, such as YOLOv3 in \xs{sect:openiss-yolov3} other tasks. We have a continued +collaboration on HPC/scheduling research. \end{itemize} % ------------------------------------------------------------------------------ -\subsection{Phase 3} +\subsection{Migration from UGE to SLURM} +\label{appdx:uge-to-slurm} + +For long term users who started off with Grid Engine here are some resources +to make a transition and mapping to the job submission process. + +\begin{itemize} +\item +Queues are called ``partitions'' in SLURM. Our mapping from the GE queues +to SLURM partitions is as follows: +\begin{verbatim} +GE => SLURM +s.q ps +g.q pg +a.q pa +\end{verbatim} +We also have a new partition \texttt{pt} that covers SPEED2 nodes, +which previously did not exist. + +\item +Commands and command options mappings are found in \xf{fig:rosetta-mappings} from\\ +\url{https://slurm.schedmd.com/rosetta.pdf}\\ +\url{https://slurm.schedmd.com/pdfs/summary.pdf}\\ +Other related helpful resources from similar organizations who either used +SLURM for awhile or also transitioned to it:\\ +\small +\url{https://docs.alliancecan.ca/wiki/Running_jobs}\\ +\url{https://www.depts.ttu.edu/hpcc/userguides/general_guides/Conversion_Table_1.pdf}\\ +\url{https://docs.mpcdf.mpg.de/doc/computing/clusters/aux/migration-from-sge-to-slurm} +\normalsize + +\begin{figure}[htpb] +\includegraphics[width=\columnwidth]{images/rosetta-mapping} +\caption{Rosetta Mappings of Scheduler Commands from SchedMD} +\label{fig:rosetta-mappings} +\end{figure} + +\item +\noindent +\textbf{NOTE:} If you have used UGE commands in the past you probably still have these +lines there; \textbf{they should now be removed}, as they have no use in SLURM and +will start giving ``command not found'' errors on login when the software is removed: + +csh/\tool{tcsh}: +Sample \file{.tcshrc} file: +\begin{verbatim} +# Speed environment set up +if ($HOSTNAME == speed-submit.encs.concordia.ca) then + source /local/pkg/uge-8.6.3/root/default/common/settings.csh +endif +\end{verbatim} + +Bourne shell/\tool{bash}: +Sample \file{.bashrc} file: +\begin{verbatim} +# Speed environment set up +if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then + . /local/pkg/uge-8.6.3/root/default/common/settings.sh + printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile +fi +\end{verbatim} + +Note that you will need to either log out and back in, or execute a new shell, +for the environment changes in the updated \file{.tcshrc} or \file{.bashrc} file to be applied +(\textbf{important}). + + +\end{itemize} + +% ------------------------------------------------------------------------------ +\subsection{Phases} +\label{sect:phases} + +Brief summary of Speed evolution phases. + +% ------------------------------------------------------------------------------ +\subsubsection{Phase 4} + +Phase 4 had 7 SuperMicro servers with 4x A100 80GB GPUs each added, +dubbed as ``SPEED2''. We also moved from Grid Engine to SLURM. + +% ------------------------------------------------------------------------------ +\subsubsection{Phase 3} Phase 3 had 4 vidpro nodes added from Dr.~Amer totalling 6x P6 and 6x V100 GPUs added. % ------------------------------------------------------------------------------ -\subsection{Phase 2} +\subsubsection{Phase 2} Phase 2 saw 6x NVIDIA Tesla P6 added and 8x more compute nodes. The P6s replaced 4x of FirePro S7150. % ------------------------------------------------------------------------------ -\subsection{Phase 1} +\subsubsection{Phase 1} Phase 1 of Speed was of the following configuration: @@ -665,7 +898,13 @@ \section{Sister Facilities} Contact Thomas Beaudry for details and how to obtain access. \item Digital Research Alliance Canada (Compute Canada / Calcul Quebec),\\ -\url{https://alliancecan.ca/} +\url{https://alliancecan.ca/}. Follow +\href +{https://alliancecan.ca/en/services/advanced-research-computing/account-management/apply-account} +{this link} +on the information how to obtain access (students need to be sponsored +by their supervising faculty members, who should create accounts +first). Their SLURM examples are here: \url{https://docs.alliancecan.ca/wiki/Running_jobs} \end{itemize} diff --git a/src/README.md b/src/README.md index 996578b..0d040cc 100644 --- a/src/README.md +++ b/src/README.md @@ -27,31 +27,36 @@ These are examples either trivial or some are more elaborate. Some are described # Creating Environments and Compiling Code on Speed ## Correct Procedure + ### Overview of preparing environments, compiling code and testing -- Create a qlogin session to the queue you wish to run your jobs -(e.g. qlogin -q g.q -l gpu=1 for GPU jobs ) -- Within the qlogin session, create and activate an Anaconda environment in -your /speed-scratch/ directory using the instructions found in Section 2.11.1 of the manual: + +- Create an `salloc` session to the queue you wish to run your jobs +(e.g., `alloc -p pg --gpus=1` for GPU jobs) +- Within the `salloc` session, create and activate an Anaconda environment in +your `/speed-scratch/` directory using the instructions found in Section 2.11.1 of the manual: https://nag-devops.github.io/speed-hpc/#creating-virtual-environments - Compile your code within the environment. - Test your code with a limited data set. -- Once you are satisfied with your test results, exit your qlogin session. +- Once you are satisfied with your test results, exit your `salloc` session. ### Once your environment and code have been tested + - Create a job script. (see https://nag-devops.github.io/speed-hpc/#job-submission-basics) - Remember to Activate your Anaconda environment in the user scripting section -- Use the qsub command to submit your job script to the correct queue +- Use the `sbatch` command to submit your job script to the correct partition and account ### Do not use the submit node to create environments or compile code -- Speed-submit is a virtual machine intended to submit user jobs to -the grid engine's scheduler. It is not intended to compile or run code. -- Importantly, speed-submit does not have GPU drivers. This means that code compiled on speed-submit will not be compiled against GPU drivers. -- Processes run outside of the scheduler on speed-submit will be killed and you will lose your work. -## PIP -By default, pip installs packages to a system-wide default location. +- `speed-submit` is a virtual machine intended to submit user jobs to +the job scheduler. It is not intended to compile or run code. +- **Importantly**, `speed-submit` does not have GPU drivers. This means that code compiled on `speed-submit` will not be compiled against proper GPU drivers. +- Processes run outside of the scheduler on `speed-submit` will be killed and you will lose your work. + +### `pip` + +By default, `pip` installs packages to a system-wide default location. -Creating environments via pip shound NOT be done outside of an Anaconda environment. +Creating environments via `pip` shound NOT be done outside of an Anaconda environment. Why you should create an Anaconda environment and not use pip directly from the command line: @@ -67,6 +72,7 @@ Virtual Environment Creation documentation. The following documentation is speci ### Anaconda #### Load the Anaconda module + To view the Anaconda modules available, run `module avail anaconda` @@ -158,7 +164,7 @@ cd /speed-scratch/$USER/ ### Speed Setup and Development Environment Preperation The pre-requisites to prepare the virtual development environment using anaconda is explained in [speed manual](https://github.com/NAG-DevOps/speed-hpc/blob/master/doc/speed-manual.pdf) section 3, please check that for more inforamtion. -1. Make sure you are in speed-scratch directory. Then Download OpenISS yolo3 project from [Github website](https://github.com/NAG-DevOps/openiss-yolov3) to your speed-scratch proper diectory. +1. Make sure you are in speed-scratch directory. Then Download OpenISS yolo3 project from [Github](https://github.com/NAG-DevOps/openiss-yolov3) to your speed-scratch proper diectory. ``` cd /speed-scratch/$USER/ git clone --depth=1 https://github.com/NAG-DevOps/openiss-yolov3.git @@ -199,60 +205,45 @@ conda env remove -p /speed-scratch/$USER/YOLO ### Run Interactive Script File `openiss-yolo-interactive.sh` is the speed script to run video example to run it you follow these steps: -1. Run interactive job we need to keep `ssh -X` option enabled and `xming` server in your windows working. -2. The `qsub` is not the proper command since we have to keep direct ssh connection to the computational node, so `qlogin` will be used. -3. Enter `qlogin` in the `speed-submit`. The `qlogin` will find an approriate computational node then it will allow you to have direct `ssh -X` login to that node. Make sure you are in the right directory and activate conda environment again. +1. Run interactive job we need to keep `ssh -X` option enabled and `Xming` server in your windows is working (MobaXterm provides an alternative; on macOS use XQuartz). +2. The `sbatch` is not the proper command since we have to keep direct ssh connection to the computational node, so `salloc` will be used. +3. Enter `salloc` in the `speed-submit`. The `salloc` will find an approriate computational node then it will allow you to have direct `ssh -X` login to that node. Make sure you are in the right directory and activate conda environment again. ``` -qlogin +salloc --x11=first -t 60 -n 16 --mem=40G -p pg cd /speed-scratch/$USER/openiss-yolov3 conda activate /speed-scratch/$USER/YOLO ``` 4. Before you run the script you need to add permission access to the project files, then start run the script `./openiss-yolo-interactive.sh` ``` -chmod +rwx * +chmod u+x *.sh ./openiss-yolo-interactive.sh ``` 5. A pop up window will show a classifed live video. -Please note that since we have limited number of node with GPU support `qlogin` is not allowed to direct you to login to these servers you will be directed to the available computation nodes in the cluster with CPU support only. +Please note that since we have limited number of nodes with GPU support `salloc` the interactive sessions are time-limited to max 24h. ### Run Non-interactive Script + Before you run the script you need to add permission access to the project files using `chmod` command. ``` -chmod +rwx * +chmod u+x *.sh ``` -To run the script you will use `qsub`, you can run the task on CPU or gpu computation node as follwoing: +To run the script you will use `sbatch`, you can run the task on CPU or GPU compute nodes as follwoing: 1. For CPU nodes use `openiss-yolo-cpu.sh` file ``` - qsub ./openiss-yolo-cpu.sh -``` - -2. For GPU nodes use `openiss-yolo-gpu.sh` file with option -q to specify only gpu queue (g.q) submission. -``` -qsub -q g.q ./openiss-yolo-gpu.sh +sbatch ./openiss-yolo-cpu.sh ``` -3. Once your job is allocated to a note, activate your conda environment -``` -qlogin -cd /speed-scratch/$USER/SpeedYolo -conda activate /speed-scratch/$USER/YOLOInteractive +2. For GPU nodes use `openiss-yolo-gpu.sh` file with option -p to specify a GPU partition (`pg`) for submission. ``` -4. Before you run the script you need to add permission access to the project files, then start run the script `./openiss-yolo-interactive.sh` -``` -chmod +rwx * -./openiss-yolo-interactive.sh +sbatch -p pg ./openiss-yolo-gpu.sh ``` -5. A pop up window will show a classifed live video. - -Please note that since we have limited number of node with GPU support `qlogin` is not allowed to direct you to login to these server you will be directed to the availabel computation nodes in the cluster with CPU support only. - For Tiny YOLOv3, just do in a similar way, just specify model path and anchor path with `--model model_file` and `--anchors anchor_file`. ### Performance comparison -Time is in minutes, run Yolo with different hardware configurations GPU types V100 and Tesla P6. Please note that there is an issue to run Yolo project on more than one GPU in case of teasla P6. The project use keras.utils library calling `multi_gpu_model()` function, which cause hardware faluts and force to restart the server. GPU name for V100 (gpu32), for P6 (gpu) you can find that in scripts shell. +Time is in minutes, run Yolo with different hardware configurations GPU types V100 and Tesla P6. Please note that there is an issue to run Yolo project on more than one GPU in case of teasla P6. The project use keras.utils library calling `multi_gpu_model()` function, which cause hardware faluts and force to restart the server. GPU name for V100 (gpu32), for P6 (gpu16) you can find that in scripts shell. | 1GPU-P6 | 1GPU-V100 | 2GPU-V100 | 32CPU | | --------------|-------------- |-------------- |----------------| @@ -261,13 +252,13 @@ Time is in minutes, run Yolo with different hardware configurations GPU types V1 | 22.18 | 17.18 | 23.13 | 60.47 | -## OpenISS-reid-tfk ## +## OpenISS-reid-tfk The following steps will provide the information required to execute the *OpenISS Person Re-Identification Baseline* Project (https://github.com/NAG-DevOps/openiss-reid-tfk) on *SPEED* -### Environment ### +### Environment -The pre-requisites to prepare the environment are located in `environment.yml`. (https://github.com/NAG-DevOps/openiss-reid-tfk) +The pre-requisites to prepare the environment are located in `environment.yml` (https://github.com/NAG-DevOps/openiss-reid-tfk). Using a test dataset (Market1501) and 120 epochs as an example, we ran the script and the results were the following: @@ -283,29 +274,29 @@ TEST DATASET: Market1501 ---- Gallery images: 15913 -### Configuration and execution ### +### Configuration and execution - Log into Speed, go to your speed-scratch directory: `cd /speed-scratch/$USER/` - Clone the repo from https://github.com/NAG-DevOps/openiss-reid-tfk -- Download the dataset: go to datasets/ and run get_dataset_market1501.sh -- In reid.py set the epochs (g_epochs=120 by default) -- Download openiss-reid-speed.sh from this repository -- On environment.yml comment or uncomment tensorflow accordingly (for CPU or GPU, GPU is default) -- On openiss-reid-speed.sh comment or uncomment the secction accordingly (for CPU or GPU) +- Download the dataset: go to `datasets/` and run `get_dataset_market1501.sh` +- In `reid.py` set the epochs (`g_epochs=120` by default) +- Download `openiss-reid-speed.sh` from this repository +- On `environment.yml` comment or uncomment tensorflow accordingly (for CPU or GPU, GPU is default) +- On `openiss-reid-speed.sh` comment or uncomment the secction accordingly (for CPU or GPU) - Submit the job: - On CPUs nodes: `qsub ./openiss-reid-speed.sh` + On CPUs nodes: `sbatch ./openiss-reid-speed.sh` - On GPUs nodes: `qsub -q g.q ./openiss-reid-speed.sh` + On GPUs nodes: `sbatch -p pg ./openiss-reid-speed.sh` **IMPORTANT** -Modify the script `openiss-reid-speed.sh` to setup the job to be ready for CPUs or GPUs nodes; h_vmem= and gpu= CAN'T be enabled at the same time, more information about these parameters on https://github.com/NAG-DevOps/speed-hpc/blob/master/doc/speed-manual.pdf +Modify the script `openiss-reid-speed.sh` to setup the job to be ready for CPUs or GPUs nodes; `--mem=` and `gpus=` in particular, see more information about these parameters on https://github.com/NAG-DevOps/speed-hpc/blob/master/doc/speed-manual.pdf -## CUDA ## +## CUDA -When calling CUDA within job scripts, it is important to create a link to the desired CUDA libraries and set the runtime link path to the same libraries. For example, to use the cuda-11.5 libraries, specify the following in your Makefile. +When calling CUDA within job scripts, it is important to create a link to the desired CUDA libraries and set the runtime link path to the same libraries. For example, to use the `cuda-11.5` libraries, specify the following in your `Makefile`. ``` -L/encs/pkg/cuda-11.5/root/lib64 -Wl,-rpath,/encs/pkg/cuda-11.5/root/lib64 ``` @@ -314,9 +305,9 @@ In your job script, specify the version of `gcc` to use prior to calling cuda. F or `module load gcc/9.3` -### Special Notes for sending CUDA jobs to the GPU Queue (`g.q`) +### Special Notes for sending CUDA jobs to the GPU Partition (`pg`) -It is not possible to create an interactive `qlogin` session to **GPU Queue** (`g.q`) nodes. As direct login to these nodes is not available, batch jobs must be submitted to the **GPU Queue** with `qsub` in order to compile and link. +Interactive jobs (easier to debug) should be submitted to the **GPU Queue** with `salloc` in order to compile and link CUDA code. We have several versions of CUDA installed in: ``` @@ -329,7 +320,7 @@ For CUDA to compile properly for the GPU queue, edit your `Makefile` replacing ` ## Python Modules -By default when adding a python module /tmp is used for the temporary repository of files downloaded. /tmp on speed_submit is too small for pytorch. +By default when adding a python module `/tmp` is used for the temporary repository of files downloaded. `/tmp` on speed-submit is too small for pytorch. To add a python module: @@ -339,4 +330,4 @@ To add a python module: - `setenv TMPDIR /speed-scratch/$USER/tmp` - Attempt the installation of pytorch -Where `$USER` is an environment variable containing your encs_username +Where `$USER` is an environment variable containing your GCS ENCS username diff --git a/src/array.sh b/src/array.sh new file mode 100755 index 0000000..e3f6a9e --- /dev/null +++ b/src/array.sh @@ -0,0 +1,17 @@ +#!/encs/bin/tcsh + +#SBATCH -J arrayexample +#SBATCH -c 1 +#SBATCH -N 1 +#SBATCH -t 0-2:00 +#SBATCH --array=1-30 +#SBATCH -o myprogram%A_%a.out +# %A" is replaced by the job ID and "%a" with the array index +#SBATCH -e myprogram%A_%a.err + +echo "Would be input shard: input$SLURM_ARRAY_TASK_ID.dat" +#/myprogram input$SLURM_ARRAY_TASK_ID.dat + +sleep 10 + +# EOF diff --git a/src/bash.sh b/src/bash.sh index 69201f5..4d40eed 100755 --- a/src/bash.sh +++ b/src/bash.sh @@ -1,8 +1,8 @@ #!/encs/bin/bash -#$ -N qsub-test -#$ -cwd -#$ -l h_vmem=1G +#SBATCH -J bash-test ## --job-name +#SBATCH --mem=1G ## memory per node +#SBATCH --chdir=./ ## Set current directory as working directory sleep 30 @@ -10,4 +10,4 @@ sleep 30 . /encs/pkg/modules-3.2.10/root/Modules/3.2.10/init/bash module load gurobi/8.1.0 -module list \ No newline at end of file +module list diff --git a/src/comsol.sh b/src/comsol.sh index 7137fef..8c71883 100755 --- a/src/comsol.sh +++ b/src/comsol.sh @@ -1,15 +1,21 @@ #!/encs/bin/tcsh ## -## Job Scheduler options +## SLURM options ## -#$ -N comsole_job # job name -#$ -cwd # Run from directory that script is in, e.g., your speed-scratch directory -#$ -m bea # Email notifications at job's start and end, or on abort -#$ -pe smp 8 # Request 8 slots from parellel environment 'smp' -#$ -l h_vmem=500G # set resource value h_vmem (hard virtual memory size) to 500G - +#SBATCH --job-name=comsol_job ## Give the job a name +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER@encs.concordia.ca +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=8 ## Request 8 cpus +#SBATCH --mem=500G ## Assign 500G memory per node + +# Note: +# By default, SLURM sets the working directory to the directory the job is executed from. +# To set a different working directory use the --chdir= SBATCH option. + ## ## Job to run ## @@ -25,11 +31,11 @@ setenv LMCOMSOL_LICENSE_FILE # Execute the comsole batch command # Note: review comsol batch -help for options available -comsol batch -inputfile \ +srun comsol batch -inputfile \ -outputfile \ -batchlog echo "$0 : Done!" date -#EOF +# EOF diff --git a/src/efficientdet.sh b/src/efficientdet.sh index 5891a93..6ad4021 100755 --- a/src/efficientdet.sh +++ b/src/efficientdet.sh @@ -1,7 +1,7 @@ #!/encs/bin/tcsh ## -## This script was submitted by a member of Dr. Amer's Research Group +## This script was initially submitted by a member of Dr. Amer's Research Group ## ## @@ -10,16 +10,23 @@ ## ## -## Job Scheduler options +## SLURM options ## -#$ -N efficientdet_pascal -#$ -cwd -#$ -pe smp 8 -#$ -l h_vmem=128G -#$ -l gpu=2 +#SBATCH --job-name=efficientdet_pascal +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER@encs.concordia.ca -cd /speed-scratch/ +# Request GPU in Dr. Amer's partition pa +#SBATCH --partition=pa +#SBATCH --nodes=1 +#SBATCH --cpus-per-task=8 +#SBATCH --ntasks=1 +#SBATCH --gpus-per-node=2 + +#SBATCH --mem=128G ## Assign memory per node + +cd /speed-scratch/$USER module load python/3.8.3 module load cuda/11.5 @@ -27,7 +34,7 @@ source envs/tf/bin/activate.csh cd code/automl/efficientdet -python3 main.py --mode=train_and_eval \ +srun python3 main.py --mode=train_and_eval \ --train_file_pattern=tfrecord/'pascal-*-of-00100.tfrecord' \ --val_file_pattern=tfrecord/'val-*-of-00032.tfrecord' \ --model_name='efficientdet-d0' \ diff --git a/src/env.sh b/src/env.sh new file mode 100755 index 0000000..41384d4 --- /dev/null +++ b/src/env.sh @@ -0,0 +1,20 @@ +#!/encs/bin/tcsh + +#SBATCH --job-name=envs ## Give the job a name +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER@encs.concordia.ca +#SBATCH --chdir=./ ## Use currect directory as working directory +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=1 ## Request 1 cpus +#SBATCH --mem=1G ## Assign 1G memory per node + +# Reset TMPDIR to a larger storage +mkdir -p /speed-scratch/$USER/tmp +setenv TMPDIR /speed-scratch/$USER/tmp + +date +srun env +date + +# EOF diff --git a/src/fluent.sh b/src/fluent.sh index 8001593..29f6d46 100755 --- a/src/fluent.sh +++ b/src/fluent.sh @@ -1,14 +1,33 @@ #!/encs/bin/tcsh -#$ -N flu10000 -#$ -cwd -#$ -m bea -#$ -pe smp 8 -#$ -l h_vmem=160G +#SBATCH --job-name=flu10000 ## Give the job a name +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER@encs.concordia.ca +#SBATCH --chdir=./ ## Use currect directory as working directory +#SBATCH --nodes=1 ## Number of nodes to run on +#SBATCH --ntasks-per-node=32 ## Number of cores +#SBATCH --cpus-per-task=1 ## Number of MPI threads +#SBATCH --mem=160G ## Assign 160G memory per node -module load ansys/19.0/default +date + +module avail ansys + +module load ansys/19.2/default cd $TMPDIR -fluent 3ddp -g -i $SGE_O_WORKDIR/fluentdata/info.jou -sgepe smp > call.txt +set FLUENTNODES = "`scontrol show hostnames`" +set FLUENTNODES = `echo $FLUENTNODES | tr ' ' ','` + +date + +srun fluent 3ddp \ + -g -t$SLURM_NTASKS \ + -g-cnf=$FLUENTNODES \ + -i $SLURM_SUBMIT_DIR/fluentdata/info.jou > call.txt + +date + +srun rsync -av $TMPDIR/ $SLURM_SUBMIT_DIR/fluentparallel/ -rsync -av $TMPDIR/ $SGE_O_WORKDIR/fluentparallel/ +date diff --git a/src/gurobi-with-python.sh b/src/gurobi-with-python.sh index 32fe910..327c23e 100755 --- a/src/gurobi-with-python.sh +++ b/src/gurobi-with-python.sh @@ -9,29 +9,38 @@ ## or create it inside $TMPDIR on fly as a part of your job ##################################################################################################################### -#$ -N MY_JOB -#$ -cwd -#$ -m bea -#$ -pe smp 8 -#$ -l h_vmem=150G +## SLURM options + +#SBATCH --job-name=gurobi-with-python ## Give the job a name +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER@encs.concordia.ca +#SBATCH --chdir=./ ## Use currect directory as working directory (default) + ## stored as $SLURM_SUBMIT_DIR +#SBATCH --cpus-per-task=8 ## Request 8 cpus +#SBATCH --mem=150G ## Assign memory per node ##PUT YOUR MODULE LOADS HERE module load gurobi/9.0.2/default module load python/3.7.7/default +## Create environment variables +setenv workdir $PWD +mkdir -p /speed-scratch/$USER/tmp +setenv TMPDIR /speed-scratch/$USER/tmp ## Create a virtual Python environment (env) in $TMPDIR -python3.7 -m venv $TMPDIR/env +srun python3.7 -m venv $TMPDIR/env ## Activate the new environment source $TMPDIR/env/bin/activate.csh ## Install gurobipy module cd $GUROBI_HOME -python3.7 setup.py build --build-base /tmp/${USER} install +srun python3.7 setup.py build --build-base /tmp/${USER} install -## return to workDir -cd $SGE_O_WORKDIR +## return to working directory +cd $workdir ## Now, instead of using 'gurobi.sh MY_PYTHON_SCRIPT.py', you can use -python MY_PYTHON_SCRIPT.py +srun python MY_PYTHON_SCRIPT.py ## inside MY_PYTHON_SCRIPT.py, you can use ## from gurobipy import * ## import multiprocessing as mp + diff --git a/src/lambdal-singularity.sh b/src/lambdal-singularity.sh index 6532b42..73486f6 100755 --- a/src/lambdal-singularity.sh +++ b/src/lambdal-singularity.sh @@ -1,7 +1,7 @@ #!/encs/bin/bash # Serguei Mokhov -# UGE-based job invocation script +# SLURM-based job invocation script # Singulairy container for Lambda Labs Software Stack @@ -9,39 +9,30 @@ ## Job scheduler options ## -# Run from the current directory where this script is -#$ -cwd - -# How many GPUs (currently limit is set 2 max for Speed 5 and 17) -#$ -l gpu=2 - -# High value of memory requested -#$ -l h_vmem=20G -#$ -ac hv=8 - -# Number of cores requested (approx). -# Be conservative -#$ -pe smp 4 - -# Email notifications -#$ -m bea +#SBATCH --job-name=lambdal ## Give the job a name +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER@encs.concordia.ca +#SBATCH --chdir=./ ## Use currect directory as working directory (default) +## Any partition, usually on the command line that has GPUs +##SBATCH --partition=pg ## Use the GPU partition (specify here or at command line wirh -p option) +#SBATCH --gpus=1 ## How many GPUs (currently limit is set 2 max for Speed 5 and 17) +#SBATCH --mem=20G ## Assign memory +#SBATCH --export=ALL,hv=8 ## Export all environment variables and set a value for the hv variable ## ## Job to run ## -# -# Run on GPU nodes like, `qsub -q g.q ...' -# - echo "$0 : about to run gcs-lambdalabs-singularity on Speed..." date +env + # time will simply measure and print runtime # sigularity run -- running the image # then whatever script you need to run inside the container -SINGULARITY=/encs/pkg/singularity-3.7.0/root/bin/singularity +SINGULARITY=/encs/pkg/singularity-3.10.4/root/bin/singularity # bind mount the current directory, the user's speed-scratch # directory, nettemp @@ -51,14 +42,10 @@ SINGULARITY_BIND=$PWD:/speed-pwd,/speed-scratch/$USER:/my-speed-scratch,/nettemp echo "Singularity will bind mount: $SINGULARITY_BIND for user: $USER" - time \ - $SINGULARITY run --nv /speed-scratch/nag-public/gcs-lambdalabs-stack.sif \ + srun $SINGULARITY run --nv /speed-scratch/nag-public/gcs-lambdalabs-stack.sif \ /usr/bin/python3 -c 'import torch; print(torch.rand(5, 5).cuda()); print(\"I love Lambda Stack!\")' -time \ - $SINGULARITY exec touch /my-speed-scratch/test1 - echo "$0 : Done!" date diff --git a/src/manual.sh b/src/manual.sh index d57eb29..df56f3c 100755 --- a/src/manual.sh +++ b/src/manual.sh @@ -4,12 +4,13 @@ ## Job Scheduler options ## -#$ -N speed-manual # job name -#$ -cwd # Run from directory that script is in, e.g., your speed-scratch directory -#$ -m bea # Email notifications at job's start and end, or on abort -#$ -pe smp 2 # Request 2 slots from parellel environment 'smp' -#$ -l h_vmem=1G # set resource value h_vmem (hard virtual memory size) to 1G - +#SBATCH --job-name=speed-manual ## Give the job a name +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER@encs.concordia.ca +#SBATCH --chdir=./ ## Use currect directory as working directory +#SBATCH --cpus-per-task=2 ## Request 2 cpus +#SBATCH --mem=1G ## Assign memory per node + ## ## Job to run ## @@ -21,17 +22,19 @@ date # Pull speed-hpc sources latest commit only to avoid # downloading all the history. For fun time the longer # running commands. -time git clone --depth 1 --branch master https://github.com/NAG-DevOps/speed-hpc.git +time srun git clone --depth 1 --branch master https://github.com/NAG-DevOps/speed-hpc.git # We need to be in the doc directory cd speed-hpc/doc pwd # Generate PDF manual -time make +time srun make # Generate the HTML manual -time make html +# 2023 TeXLive HTML generation gives obscure error +setenv PATH "/encs/pkg/texlive-20220405/root/bin/x86_64-linux:$PATH" +time srun make html # Report generated files ls -al *.pdf web/* diff --git a/src/matlab-sge.sh b/src/matlab-sge.sh deleted file mode 100755 index a95939e..0000000 --- a/src/matlab-sge.sh +++ /dev/null @@ -1,27 +0,0 @@ -#!/encs/bin/tcsh - -# Assigns a name to the job -#$ -N matlab-job-test - -# Tells the scheduler to execute the job from the current working directory -#$ -cwd - -# Sends e-mail notications (begin/end/abort) -#$ -m bea - -# How many GPUs (currently limit is set 2 max for Speed 5 and 17) -#$ -l gpu=2 - -# Loads matlab module version -module load matlab/R2022a/default - -# Displays to the output file whether the module(s) is(are) loaded correctly -module list - -# Matlab parameters - add/modify/remove them according your script - -# -nodesktop -nodisplay: Launches Matlab in the terminal -# -nojvm : Tells Matlab to run without the Java Virtual Machine to reduce overhead -# test.m : Matlab commands file - -matlab -nodisplay -nodesktop -nojvm < test.m diff --git a/src/matlab-slurm.sh b/src/matlab-slurm.sh new file mode 100755 index 0000000..7157f6d --- /dev/null +++ b/src/matlab-slurm.sh @@ -0,0 +1,22 @@ +#!/encs/bin/tcsh + +#SBATCH --job-name=matlab-job ## Give the job a name +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER@encs.concordia.ca +#SBATCH --chdir=./ ## Use currect directory as working directory (default) +#SBATCH --partition=pg-gpu ## Use the GPU partition (specify here or at command line wirh -p option) +#SBATCH --gpus=2 ## How many GPUs (currently limit is set 2 max for Speed 5 and 17) + +# Loads matlab module version +module load matlab/R2022a/default + +# Displays to the output file whether the module(s) is(are) loaded correctly +module list + +# Matlab parameters - add/modify/remove them according your script + +# -nodesktop -nodisplay: Launches Matlab in the terminal +# -nojvm : Tells Matlab to run without the Java Virtual Machine to reduce overhead +# test.m : Matlab commands file + +srun matlab -nodisplay -nodesktop -nojvm < test.m diff --git a/src/msfp-speed-job.sh b/src/msfp-speed-job.sh index 9289710..49e08cf 100755 --- a/src/msfp-speed-job.sh +++ b/src/msfp-speed-job.sh @@ -1,26 +1,26 @@ #!/encs/bin/tcsh # Serguei Mokhov -# UGE-based job invocation script +# SLURM job invocation script # mac-spoofer-flucid-processor is a Perl script -- the actual job ## -## Job scheduler options +## SLURM SBATCH options ## -# Run from the current directory where this script is -#$ -cwd +#SBATCH --job-name=mac-spoofer-flucid-processor ## Set the job's name +#SBATCH --mem=20G ## Set memory per node +#SBATCH --chdir=./ ## Set current directory as working directory +## Export all SLURM_* environment variables and the explicitely defined hv variable +#SBATCH --export=ALL,hv=8 -# High value of memory requeted -#$ -l h_vmem=20G -#$ -ac hv=8 - -# Number of cores requested (approx). -# Includes 4 Perl/Java processes per claim type -#$ -pe smp 8 +## Includes 4 Perl/Java processes per claim type +## Note: SLURM default is one task per node. +#SBATCH --cpus-per-task=8 ## Allocate 8 cpus per task # Notifications -#$ -m bea +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER@encs.concordia.ca ## ## Job to run @@ -28,9 +28,10 @@ # # The $RT variable is initialized on the command line -# via -v to qsub, like, `qsub -v RT=123 ...' +# before calling this script. Example for tcsh +# setenv RT 123 echo "$0 : about to run mac-spoofer-flucid-processor on Speed" -mac-spoofer-flucid-processor $RT +srun mac-spoofer-flucid-processor $RT # EOF diff --git a/src/openiss-reid-speed.sh b/src/openiss-reid-speed.sh index c080e2b..11b0c50 100755 --- a/src/openiss-reid-speed.sh +++ b/src/openiss-reid-speed.sh @@ -1,28 +1,28 @@ #!/encs/bin/tcsh # Give job a name -#$ -N openiss-reid +#SBATCH -J openiss-reid # Send an email when the job starts, finishes or if it is aborted. -#$ -m bea +#SBATCH --mail-type=ALL # Specify the output file name -#$ -o reid-tfk.log +#SBATCH -o openiss-reid-tfk.log # Set output directory to current -#$ -cwd +#SBATCH --chdir=./ -# Request CPU - comment this section if the job WON'T use CPU -# #$ -pe smp 32 -# #$ -l h_vmem=32G +# Request CPU +#SBATCH -n 32 +#SBATCH --mem=32G # Request GPU - comment this section if the job WON'T use GPU -#$ -l gpu=1 +#SBATCH --gpus=1 # Execute the script module load anaconda/default conda env create -f environment.yml -p /speed-scratch/$USER/reid-venv conda activate /speed-scratch/$USER/reid-venv -python reid.py +srun python reid.py conda deactivate conda env remove -p /speed-scratch/$USER/reid-venv diff --git a/src/openiss-yolo-cpu.sh b/src/openiss-yolo-cpu.sh index f5f6aa9..5fdf023 100755 --- a/src/openiss-yolo-cpu.sh +++ b/src/openiss-yolo-cpu.sh @@ -1,28 +1,27 @@ #!/encs/bin/tcsh # Give job a name -#$ -N oi-yolo-batch-cpu +#SBATCH -J oi-yolo-batch-cpu # Set output directory to current -#$ -cwd +#SBATCH --chdir=./ # Send an email when the job starts, finishes or if it is aborted. -#$ -m bea +#SBATCH --mail-type=ALL # Request GPU -# #$ -l gpu=2 +# #SBATCH --gpus=2 # Request CPU with maximum memoy size = 80GB -#$ -l h_vmem=80G +#SBATCH --mem=80G # Request CPU slots -#$ -pe smp 16 +#SBATCH -n 16 #sleep 30 -# Specify the output file name in our case we commntes that system will genreate file with the same name of the job -# -o name.qlog - +# Specify the output file name +#SBATCH -o openiss-yolo-batch-cpu.log conda activate /speed-scratch/$USER/YOLO @@ -30,6 +29,6 @@ conda activate /speed-scratch/$USER/YOLO #python yolo_video.py --model model_data/yolo.h5 --classes model_data/coco_classes.txt --image --gpu_num 2 # Video example -python yolo_video.py --input video/v1.avi --output video/001.avi #--gpu_num 2 +srun python yolo_video.py --input video/v1.avi --output video/001.avi #--gpu_num 2 conda deactivate diff --git a/src/openiss-yolo-gpu.sh b/src/openiss-yolo-gpu.sh index 3c480d1..6815356 100755 --- a/src/openiss-yolo-gpu.sh +++ b/src/openiss-yolo-gpu.sh @@ -1,25 +1,27 @@ #!/encs/bin/tcsh # Give job a name -#$ -N oi-yolo-gpu +#SBATCH -J oi-yolo-gpu # Set output directory to current -#$ -cwd +#SBATCH --chdir=./ # Send an email when the job starts, finishes or if it is aborted. -#$ -m bea +#SBATCH --mail-type=ALL # Request GPU -#$ -l gpu=2 +#SBATCH --gpus=2 # Request CPU with maximum memoy size = 40GB -# #$ -l h_vmem=40G +#SBATCH --mem=40G -#sleep 30 +# Request CPU slots +#SBATCH -n 16 -# Specify the output file name in our case we commntes that system will genreate file with the same name of the job -# -o name.qlog +#sleep 30 +# Specify the output file name +#SBATCH -o openiss-yolo-batch-gpu.log conda activate /speed-scratch/$USER/YOLO @@ -27,6 +29,6 @@ conda activate /speed-scratch/$USER/YOLO #python yolo_video.py --model model_data/yolo.h5 --classes model_data/coco_classes.txt --image --gpu_num 2 # Video example -python yolo_video.py --input video/v1.avi --output video/002.avi --gpu_num 2 +srun python yolo_video.py --input video/v1.avi --output video/002.avi --gpu_num 2 conda deactivate diff --git a/src/openiss-yolo-interactive.sh b/src/openiss-yolo-interactive.sh index 66316f3..9e2534c 100755 --- a/src/openiss-yolo-interactive.sh +++ b/src/openiss-yolo-interactive.sh @@ -1,28 +1,7 @@ #!/encs/bin/tcsh -## since it is qlogin no need to configure cluster setting because qlogin choosed the proper computational node -# Give job a name -#$ -N oi-yolo-interactive - -# Set output directory to current -# #$ -cwd - -# Send an email when the job starts, finishes or if it is aborted. -# #$ -m bea - -# Request GPU -# #$ -l gpu=2 - -# Request CPU with maximum memoy size = 80GB -# #$ -l h_vmem=80G - -# Request CPU slots -# #$ -pe smp 16 - -#sleep 30 - -# Specify the output file name in our case we commntes that system will genreate file with the same name of the job -# # -o name.qlog +## since it is salloc no need to configure cluster setting because +## it would choose the proper computational node conda activate /speed-scratch/$USER/YOLO @@ -30,6 +9,6 @@ conda activate /speed-scratch/$USER/YOLO #python yolo_video.py --model model_data/yolo.h5 --classes model_data/coco_classes.txt --image # Video example -python yolo_video.py --input video/v1.avi --output video/003.avi --interactive +srun python yolo_video.py --input video/v1.avi --output video/003.avi --interactive conda deactivate diff --git a/src/tcsh.sh b/src/tcsh.sh index 62cc8ea..7eb1c26 100755 --- a/src/tcsh.sh +++ b/src/tcsh.sh @@ -1,8 +1,7 @@ #!/encs/bin/tcsh -#$ -N qsub-test -#$ -cwd -#$ -l h_vmem=1G +#SBATCH --job-name=tcsh-test +#SBATCH --mem=1G sleep 30 module load gurobi/8.1.0 diff --git a/src/tmpdir.sh b/src/tmpdir.sh index 2f3c361..b02ba23 100755 --- a/src/tmpdir.sh +++ b/src/tmpdir.sh @@ -1,13 +1,17 @@ #!/encs/bin/tcsh -#$ -N envs -#$ -cwd -#$ -pe smp 8 -#$ -l h_vmem=32G +#SBATCH --job-name=tmpdir ## Give the job a name +#SBATCH --mail-type=ALL ## Receive all email type notifications +#SBATCH --mail-user=$USER +#SBATCH --chdir=./ ## Use currect directory as working directory +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --cpus-per-task=8 ## Request 8 cores +#SBATCH --mem=32G ## Assign 32G memory per node cd $TMPDIR mkdir input -rsync -av $SGE_O_WORKDIR/references/ input/ +rsync -av $SLURM_SUBMIT_DIR/references/ input/ mkdir results -STAR --inFiles $TMPDIR/input --parallel $NSLOTS --outFiles $TMPDIR/results -rsync -av $TMPDIR/results/ $SGE_O_WORKDIR/processed/ +srun STAR --inFiles $TMPDIR/input --parallel $SRUN_CPUS_PER_TASK --outFiles $TMPDIR/results +rsync -av $TMPDIR/results/ $SLURM_SUBMIT_DIR/processed/