Skip to content

Commit

Permalink
Merge pull request #58 from NAG-DevOps/manual-release7.3
Browse files Browse the repository at this point in the history
Manual release 7.3 updates
  • Loading branch information
smokhov authored Dec 20, 2024
2 parents c15f027 + af95fb9 commit 204ea6e
Show file tree
Hide file tree
Showing 29 changed files with 4,230 additions and 2,615 deletions.
227 changes: 227 additions & 0 deletions doc/appendix/faq.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
% -----------------------------------------------------------------------------
% B Frequently Asked Questions
% -----------------------------------------------------------------------------
\section{Frequently Asked Questions}
\label{sect:faqs}

% B.1 Where do I learn about Linux?
% -------------------------------------------------------------
\subsection{Where do I learn about Linux?}
\label{sect:faqs-linux}

All Speed users are expected to have a basic understanding of Linux and its commonly used commands.
Here are some recommended resources:

\paragraph*{Software Carpentry}:
Software Carpentry provides free resources to learn software, including a workshop on the Unix shell.
Visit \href{https://software-carpentry.org/lessons/}{Software Carpentry Lessons} to learn more.

\paragraph*{Udemy}:
There are numerous Udemy courses, including free ones, that will help you learn Linux.
Active Concordia faculty, staff and students have access to Udemy courses.
A recommended starting point for beginners is the course ``Linux Mastery: Master the Linux Command Line in 11.5 Hours''.
Visit \href{https://www.concordia.ca/it/services/udemy.html}{Concordia's Udemy page} to learn how Concordians can access Udemy.

% B.2 How to bash shell on Speed?
% -------------------------------------------------------------
\subsection{How to use bash shell on Speed?}
\label{sect:faqs-bash}

This section provides comprehensive instructions on how to utilize the bash shell on the Speed cluster.

\subsubsection{How do I set bash as my login shell?}
To set your default login shell to bash on Speed, your login shell on all GCS servers must be changed to bash.
To make this change, create a ticket with the Service Desk (or email \texttt{help at concordia.ca}) to
request that bash become your default login shell for your ENCS user account on all GCS servers.

\subsubsection{How do I move into a bash shell on Speed?}
To move to the bash shell, type \textbf{bash} at the command prompt:
\begin{verbatim}
[speed-submit] [/home/a/a_user] > bash
bash-4.4$ echo $0
bash
\end{verbatim}
\noindent\textbf{Note} how the command prompt changes from
``\verb![speed-submit] [/home/a/a_user] >!'' to ``\verb!bash-4.4$!'' after entering the bash shell.

\subsubsection{How do I use the bash shell in an interactive session on Speed?}
Below are examples of how to use \tool{bash} as a shell in your interactive job sessions
with both the \tool{salloc} and \tool{srun} commands.
\begin{itemize}
\item \texttt{salloc -ppt --mem=100G -N 1 -n 10 /encs/bin/bash}
\item \texttt{srun --mem=50G -n 5 --pty /encs/bin/bash}
\end{itemize}
\noindent\textbf{Note:} Make sure the interactive job requests memory, cores, etc.

\subsubsection{How do I run scripts written in bash on \tool{Speed}?}
To execute bash scripts on Speed:
\begin{enumerate}
\item Ensure that the shebang of your bash job script is \verb+#!/encs/bin/bash+
\item Use the \tool{sbatch} command to submit your job script to the scheduler.
\end{enumerate}
\noindent Check Speed GitHub for a \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/bash.sh}{sample bash job script}.

% B.3 How to resolve “Disk quota exceeded” errors?
% -------------------------------------------------------------
\subsection{How to resolve ``Disk quota exceeded'' errors?}
\label{sect:quota-exceeded}

\subsubsection{Probable Cause}
The ``\texttt{Disk quota exceeded}'' error occurs when your application has
run out of disk space to write to. On \tool{Speed}, this error can be returned when:
\begin{enumerate}
\item The NFS-provided home is full and cannot be written to.
You can verify this using the \tool{quota} and \tool{bigfiles} commands.
\item The ``\texttt{/tmp}'' directory on the speed node where your application is running is full and cannot be written to.
\end{enumerate}

\subsubsection{Possible Solutions}
\begin{enumerate}
\item Use the \option{--chdir} job script option to set the job working directory.
This is the directory where the job will write output files.

\item Although local disk space is recommended for IO-intensive operations, the
`\texttt{/tmp}' directory on \tool{Speed} nodes is limited to 1TB, so it may be necessary
to store temporary data elsewhere. Review the documentation for each module
used in your script to determine how to set working directories.
The basic steps are:
\begin{itemize}
\item
Determine how to set working directories for each module used in your job script.
\item
Create a working directory in \tool{speed-scratch} for output files:
\begin{verbatim}
mkdir -m 750 /speed-scratch/$USER/output
\end{verbatim}
\item
Create a subdirectory for recovery files:
\begin{verbatim}
mkdir -m 750 /speed-scratch/$USER/recovery
\end{verbatim}
\item
Update the job script to write output to the directories created in your \tool{speed-scratch} directory,
e.g., \verb!/speed-scratch/$USER/output!.
\end{itemize}
\end{enumerate}
\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.

\subsubsection{Example of setting working directories for \tool{COMSOL}}
\begin{itemize}
\item Create directories for recovery, temporary, and configuration files.
\begin{verbatim}
mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config}
\end{verbatim}
\item Add the following command switches to the COMSOL command to use the directories created above:
\begin{verbatim}
-recoverydir /speed-scratch/$USER/comsol/recovery
-tmpdir /speed-scratch/$USER/comsol/tmp
-configuration/speed-scratch/$USER/comsol/config
\end{verbatim}
\end{itemize}
\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.

\subsubsection{Example of setting working directories for \tool{Python Modules}}
By default when adding a Python module, the \texttt{/tmp} directory is set as the temporary repository for files downloads.
The size of the \texttt{/tmp} directory on \verb!speed-submit! is too small for PyTorch.
To add a Python module
\begin{itemize}
\item Create your own tmp directory in your \verb!speed-scratch! directory:
\begin{verbatim}
mkdir /speed-scratch/$USER/tmp
\end{verbatim}
\item Use the temporary directory you created
\begin{verbatim}
setenv TMPDIR /speed-scratch/$USER/tmp
\end{verbatim}
\item Attempt the installation of PyTorch
\end{itemize}
\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.

% B.4 How do I check my job's status?
% -------------------------------------------------------------
\subsection{How do I check my job's status?}
\label{sect:faq-job-status}

When a job with a job ID of 1234 is running or terminated, you can track its status using the following commands to check its status:
\begin{itemize}
\item Use the ``sacct'' command to view the status of a job:
\begin{verbatim}
sacct -j 1234
\end{verbatim}
\item Use the ``squeue'' command to see if the job is sitting in the queue:
\begin{verbatim}
squeue -j 1234
\end{verbatim}
\item Use the ``sstat'' command to find long-term statistics on the job after it has terminated
and the \tool{slurmctld} has purged it from its tracking state into the database:
\begin{verbatim}
sstat -j 1234
\end{verbatim}
\end{itemize}

% B.5 Why is my job pending when nodes are empty?
% -------------------------------------------------------------
\subsection{Why is my job pending when nodes are empty?}

\subsubsection{Disabled nodes}
It is possible that one or more of the Speed nodes are disabled for maintenance.
To verify if Speed nodes are disabled, check if they are in a draining or drained state:

\small
\begin{verbatim}
[serguei@speed-submit src] % sinfo --long --Node
Thu Oct 19 21:25:12 2023
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
speed-01 1 pa idle 32 2:16:1 257458 0 1 gpu16 none
speed-03 1 pa idle 32 2:16:1 257458 0 1 gpu32 none
speed-05 1 pg idle 32 2:16:1 515490 0 1 gpu16 none
speed-07 1 ps* mixed 32 2:16:1 515490 0 1 cpu32 none
speed-08 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-09 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-10 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-11 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-12 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-15 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-16 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-17 1 pg drained 32 2:16:1 515490 0 1 gpu16 UGE
speed-19 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-20 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-21 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-22 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-23 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-24 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-25 1 pg idle 32 2:16:1 257458 0 1 gpu32 none
speed-25 1 pa idle 32 2:16:1 257458 0 1 gpu32 none
speed-27 1 pg idle 32 2:16:1 257458 0 1 gpu32 none
speed-27 1 pa idle 32 2:16:1 257458 0 1 gpu32 none
speed-29 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-30 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-31 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-32 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-33 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-34 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
speed-35 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-36 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
speed-37 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-38 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-39 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-40 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-41 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-42 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
speed-43 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
\end{verbatim}
\normalsize

\noindent Note which nodes are in the state of \textbf{drained}.
The reason for the drained state can be found in the \textbf{reason} column.
Your job will run once an occupied node becomes availble or the maintenance is completed,
and the disabled nodes have a state of \textbf{idle}.

\subsubsection{Error in job submit request.}
It is possible that your job is pending because it requested resources that are not available within Speed.
To verify why job ID 1234 is not running, execute:
\begin{verbatim}
sacct -j 1234
\end{verbatim}

\noindent A summary of the reasons can be obtained via the \tool{squeue} command.
116 changes: 116 additions & 0 deletions doc/appendix/history.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
% -----------------------------------------------------------------------------
% A History
% -----------------------------------------------------------------------------
\section{History}
\label{sect:history}

% A.1 Acknowledgments
% -------------------------------------------------------------
\subsection{Acknowledgments}
\label{sect:acks}

\begin{itemize}
\item
The first 6 to 6.5 versions of this manual and early UGE job script samples, Singularity testing,and user support
were produced/done by Dr.~Scott Bunnell during his time at Concordia as a part of the NAG/HPC group.
We thank him for his contributions.
\item
The HTML version with devcontainer support was contributed by Anh H Nguyen.
\item
Dr.~Tariq Daradkeh, was our IT Instructional Specialist from August 2022 to September 2023;
working on the scheduler, scheduling research, end user support, and integration of
examples, such as YOLOv3 in \xs{sect:openiss-yolov3} and other tasks. We have a continued
collaboration on HPC/scheduling research (see~\cite{job-failure-prediction-compsysarch2024}).
\end{itemize}

% A.2 Migration from UGE to SLURM
% -------------------------------------------------------------
\subsection{Migration from UGE to SLURM}
\label{appdx:uge-to-slurm}

For long term users who started off with Grid Engine here are some resources
to make a transition and mapping to the job submission process.

\begin{itemize}
\item
Queues are called ``partitions'' in SLURM. Our mapping from the GE queues to SLURM partitions is as follows:
\begin{verbatim}
GE => SLURM
s.q ps
g.q pg
a.q pa
\end{verbatim}
We also have a new partition \texttt{pt} that covers SPEED2 nodes, which previously did not exist.

\item
Commands and command options mappings are found in \xf{fig:rosetta-mappings} from:\\
\url{https://slurm.schedmd.com/rosetta.pdf}\\
\url{https://slurm.schedmd.com/pdfs/summary.pdf}\\
Other related helpful resources from similar organizations who either used SLURM for a while or also transitioned to it:\\
\url{https://docs.alliancecan.ca/wiki/Running_jobs}\\
\url{https://www.depts.ttu.edu/hpcc/userguides/general_guides/Conversion_Table_1.pdf}\\
\url{https://docs.mpcdf.mpg.de/doc/computing/clusters/aux/migration-from-sge-to-slurm}

\begin{figure}[htpb]
\includegraphics[width=\columnwidth]{images/rosetta-mapping}
\caption{Rosetta Mappings of Scheduler Commands from SchedMD}
\label{fig:rosetta-mappings}
\end{figure}

\item
\textbf{NOTE:} If you have used UGE commands in the past you probably still have these
lines there; \textbf{they should now be removed}, as they have no use in SLURM and
will start giving ``command not found'' errors on login when the software is removed:

csh/\tool{tcsh}: sample \file{.tcshrc} file:
\begin{verbatim}
# Speed environment set up
if ($HOSTNAME == speed-submit.encs.concordia.ca) then
source /local/pkg/uge-8.6.3/root/default/common/settings.csh
endif
\end{verbatim}

Bourne shell/\tool{bash}: sample \file{.bashrc} file:
\begin{verbatim}
# Speed environment set up
if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then
. /local/pkg/uge-8.6.3/root/default/common/settings.sh
printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
fi
\end{verbatim}

\textbf{IMPORTANT NOTE:} you will need to either log out and back in, or execute a new shell,
for the environment changes in the updated \file{.tcshrc} or \file{.bashrc} file to be applied.
\end{itemize}

% A.3 Phases
% -------------------------------------------------------------
\subsection{Phases}
\label{sect:phases}

Brief summary of Speed evolution phases:

\subsubsection{Phase 5}
Phase 5 saw incorporation of the Salus, Magic, and Nebular
subclusters (see \xf{fig:speed-architecture-full}).

\subsubsection{Phase 4}
Phase 4 had 7 SuperMicro servers with 4x A100 80GB GPUs each added,
dubbed as ``SPEED2''. We also moved from Grid Engine to SLURM.

\subsubsection{Phase 3}
Phase 3 had 4 vidpro nodes added from Dr.~Amer totalling 6x P6 and 6x V100
GPUs added.

\subsubsection{Phase 2}
Phase 2 saw 6x NVIDIA Tesla P6 added and 8x more compute nodes.
The P6s replaced 4x of FirePro S7150.

\subsubsection{Phase 1}
Phase 1 of Speed was of the following configuration:
\begin{itemize}
\item
Sixteen, 32-core nodes, each with 512~GB of memory and approximately 1~TB of volatile-scratch disk space.
\item
Five AMD FirePro S7150 GPUs, with 8~GB of memory (compatible with the Direct X, OpenGL, OpenCL, and Vulkan APIs).
\end{itemize}
Loading

0 comments on commit 204ea6e

Please sign in to comment.