forked from NAG-DevOps/speed-hpc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
speed-manual.tex
1162 lines (979 loc) · 43.2 KB
/
speed-manual.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{easychair}
%\documentclass[draft]{easychair}
% https://en.wikibooks.org/wiki/LaTeX/Source_Code_Listings
\usepackage{listings}
% For inline citations
\usepackage{bibentry}
\nobibliography*
% For multicolumn itemized lists
\usepackage{multicol}
% Down to the level of the paragraph (4)
\setcounter{secnumdepth}{4}
\setcounter{tocdepth}{4}
% Folders with images
\makeatletter
\providecommand*{\input@path}{}
\g@addto@macro\input@path{{../src/}{src/}}% append
\g@addto@macro\input@path{{../doc/images/}{images/}}% append
\makeatother
\input{commands}
%% Document
%%
\begin{document}
% ------------------------------------------------------------------------------
%% Front Matter
%%
% Regular title as in the article class.
%
\title{Speed: The GCS ENCS Cluster}
% \titlerunning{} has to be set to either the main title or its shorter
% version for the running heads. Use {\sf} for highlighting your system
% name, application, or a tool.
%
\titlerunning{Speed: The GCS ENCS Cluster}
% Previously VI
%\date{Version 6.5}
%\date{\textbf{Version 6.6-dev-07}}
%\date{\textbf{Version 6.6} (final GE version)}
%\date{\textbf{Version 7.0-dev-01}}
%\date{\textbf{Version 7.0}}
%\date{\textbf{Version 7.1}}
\date{\textbf{Version 7.2}}
% Authors are joined by \and and their affiliations are on the
% subsequent lines separated by \\ just like the article class
% allows.
%
\author{
Serguei A. Mokhov
\and
Gillian A. Roper
\and
Carlos Alarcón Meza
\and
Farah Salhany
\and
Network, Security and HPC Group\footnote{The group acknowledges the initial manual version VI produced by Dr.~Scott Bunnell while with us
as well as Dr.~Tariq Daradkeh for his instructional support of the users and contribution of examples.}\\
\affiliation{Gina Cody School of Engineering and Computer Science}\\
\affiliation{Concordia University}\\
\affiliation{Montreal, Quebec, Canada}\\
\affiliation{\url{rt-ex-hpc~AT~encs.concordia.ca}}\\
}
% \authorrunning{} has to be set for the shorter version of the authors' names;
% otherwise a warning will be rendered in the running heads.
%
\authorrunning{Mokhov, Roper, Alarcón Meza, Salhany, NAG/HPC, GCS ENCS}
\indexedauthor{Mokhov, Serguei}
\indexedauthor{Roper, Gillian}
\indexedauthor{Alarcón Meza, Carlos}
\indexedauthor{Salhany, Farah}
\indexedauthor{NAG/HPC}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\maketitle
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% ------------------------------------------------------------------------------
\begin{abstract}
This document serves as a quick start guide to using the Gina Cody School of Engineering and Computer Science (GCS ENCS)
compute server farm, known as ``Speed.'' Managed by the HPC/NAG group of the
Academic Information Technology Services (AITS) at GCS, Concordia University, Montreal, Canada.
\end{abstract}
% ------------------------------------------------------------------------------
\tableofcontents
\clearpage
% ------------------------------------------------------------------------------
% 1 Introduction
% ------------------------------------------------------------------------------
\section{Introduction}
\label{sect:introduction}
This document contains basic information required to use ``Speed'', along with tips,
tricks, examples, and references to projects and papers that have used Speed.
User contributions of sample jobs and/or references are welcome.\\
\noindent
\textbf{Note:} On October 20, 2023, we completed the migration to SLURM
from Grid Engine (UGE/AGE) as our job scheduler.
This manual has been updated to use SLURM's syntax and commands.
If you are a long-time GE user, refer to \xa{appdx:uge-to-slurm} for key highlights needed to
translate your GE jobs to SLURM as well as environment changes.
These changes are also elaborated throughout this document and our examples.
% ------------------------------------------------------------------------------
\subsection{Citing Us}
\label{sect:citing-speed-hpc}
If you wish to cite this work in your acknowledgements, you can use our general DOI found on our GitHub page
\url{https://dx.doi.org/10.5281/zenodo.5683642} or a specific version of the manual and scripts from that link individually.
You can also use the ``cite this repository'' feature of GitHub.
% ----------------------------- 1.1 Resources ----------------------------------
% ------------------------------------------------------------------------------
\subsection{Resources}
\label{sect:resources}
\begin{itemize}
\item
Public GitHub page where the manual and sample job scripts are maintained at\\
\url{https://github.com/NAG-DevOps/speed-hpc}
\begin{itemize}
\item Pull requests (PRs) are subject to review and are welcome:\\
\url{https://github.com/NAG-DevOps/speed-hpc/pulls}
\end{itemize}
\item
Speed Manual:
\begin{itemize}
\item PDF version of the manual:\\
\url{https://github.com/NAG-DevOps/speed-hpc/blob/master/doc/speed-manual.pdf}
\item HTML version of the manual:\\
\url{https://nag-devops.github.io/speed-hpc/}
\end{itemize}
\item
Concordia official page for ``Speed'' cluster, which includes access request instructions.
\url{https://www.concordia.ca/ginacody/aits/speed.html}
\item
All Speed users are subscribed to the \texttt{hpc-ml} mailing list.
\end{itemize}
% TODO: for now comment out for 7.0; if when we update that
% preso, we will re-link it here. However, keep the citation.
\nocite{speed-intro-preso}
%\item
%\href
% {https://docs.google.com/presentation/d/1zu4OQBU7mbj0e34Wr3ILXLPWomkhBgqGZ8j8xYrLf44}
% {Speed Server Farm Presentation 2022}~\cite{speed-intro-preso}.
% ----------------------------- 1.2 Team ---------------------------------------
% ------------------------------------------------------------------------------
\subsection{Team}
\label{sect:speed-team}
Speed is supported by:
\begin{itemize}
\item
Serguei Mokhov, PhD, Manager, Networks, Security and HPC, AITS
\item
Gillian Roper, Senior Systems Administrator, HPC, AITS
\item
Carlos Alarcón Meza, Systems Administrator, HPC and Networking, AITS
\item
Farah Salhany, IT Instructional Specialist, AITS
\end{itemize}
\noindent We receive support from the rest of AITS teams, such as NAG, SAG, FIS, and DOG.\\
\url{https://www.concordia.ca/ginacody/aits.html}
% ----------------------------- 1.3 What Speed Consists of ---------------------
% ------------------------------------------------------------------------------
\subsection{What Speed Consists of}
\label{sect:speed-arch}
\begin{itemize}
\item
Twenty four (24) 32-core compute nodes, each with 512~GB of memory and
approximately 1~TB of local volatile-scratch disk space (pictured in \xf{fig:speed-pics}).
\item
Twelve (12) NVIDIA Tesla P6 GPUs, with 16~GB of GPU memory (compatible with the
CUDA, OpenGL, OpenCL, and Vulkan APIs).
\item
4 VIDPRO nodes (ECE. Dr.~Amer), with 6 P6 cards, and 6 V100 cards (32GB), and
256GB of RAM.
\item
7 new SPEED2 servers with 256 CPU cores each 4x~A100 80~GB GPUs, partitioned
into 4x~20GB MIGs each; larger local storage for TMPDIR (see \xf{fig:speed-architecture-full}).
\item
One AMD FirePro S7150 GPU, with 8~GB of memory (compatible with the
Direct~X, OpenGL, OpenCL, and Vulkan APIs).
\item
Salus compute node (CSSE CLAC, Drs.~Bergler and Kosseim), 56 cores and 728GB of RAM, see \xf{fig:speed-architecture-full}.
\item
Magic subcluster partition (ECE, Dr.~Khendek, 11 nodes, see \xf{fig:speed-architecture-full}).
\item
Nebular subcluster partition (CIISE, Drs.~Yan, Assi, Ghafouri, et al., Nebulae GPU node with 2x RTX 6000 Ada 48GB cards,
Stellar compute node, and Matrix 177TB storage/compute node, see \xf{fig:speed-architecture-full}).
\end{itemize}
\begin{figure}[htpb]
\centering
\includegraphics[width=\columnwidth]{images/speed-pics}
\caption{Speed}
\label{fig:speed-pics}
\end{figure}
\begin{figure}[htpb]
\centering
\includegraphics[width=\columnwidth]{images/speed-architecture-full}
\caption{Speed Cluster Hardware Architecture}
\label{fig:speed-architecture-full}
\end{figure}
\begin{figure}[htpb]
\centering
\includegraphics[width=\columnwidth]{images/slurm-arch}
\caption{Speed SLURM Architecture}
\label{fig:slurm-arch}
\end{figure}
% ----------------------------- 1.4 What Speed Is Ideal For --------------------
% ------------------------------------------------------------------------------
\subsection{What Speed Is Ideal For}
\label{sect:speed-is-for}
\begin{itemize}
\item
Design, develop, test, and run parallel, batch, and other algorithms and scripts with partial data sets.
``Speed'' has been optimized for compute jobs that are multi-core aware,
require a large memory space, or are iteration intensive.
\item
Prepare jobs for large clusters such as:
\begin{itemize}
\item Digital Research Alliance of Canada (Calcul Quebec and Compute Canada)
\item Cloud platforms
\end{itemize}
\item
Jobs that are too demanding for a desktop.
\item
Single-core batch jobs; multithreaded jobs typically up to 32 cores (i.e., a single machine).
\item
Multi-node multi-core jobs (MPI).
\item
Anything that can fit into a 500-GB memory space and a \textbf{speed scratch} space of approximately 10~TB.
\item
CPU-based jobs.
\item
CUDA GPU jobs.
\item
Non-CUDA GPU jobs using OpenCL.
\end{itemize}
% ----------------------------- 1.5 What Speed Is Not --------------------------
% ------------------------------------------------------------------------------
\subsection{What Speed Is Not}
\label{sect:speed-is-not}
\begin{itemize}
\item Speed is not a web host and does not host websites.
\item Speed is not meant for Continuous Integration (CI) automation deployments for Ansible or similar tools.
\item Does not run Kubernetes or other container orchestration software.
\item Does not run Docker. (\textbf{Note:} Speed does run Singularity and many Docker containers can be converted to Singularity
containers with a single command. See \xs{sect:singularity-containers}.)
\item Speed is not for jobs executed outside of the scheduler. (Jobs running outside of the scheduler will be killed and all data lost.)
\end{itemize}
% ----------------------------- 1.6 Available Software -------------------------
% ------------------------------------------------------------------------------
\subsection{Available Software}
\label{sect:available-software}
There are a wide range of open-source and commercial software available and installed on ``Speed.''
This includes Abaqus~\cite{abaqus}, AllenNLP, Anaconda, ANSYS, Bazel,
COMSOL, CPLEX, CUDA, Eclipse, Fluent~\cite{fluent}, Gurobi, MATLAB~\cite{matlab,scholarpedia-matlab},
OMNeT++, OpenCV, OpenFOAM, OpenMPI, OpenPMIx, ParaView, PyTorch, QEMU, R, Rust, and Singularity among others.
Programming environments include various versions of Python, C++/Java compilers, TensorFlow, OpenGL, OpenISS, and {\marf}~\cite{marf}.\\
In particular, there are over 2200 programs available in \texttt{/encs/bin} and \texttt{/encs/pkg} under Scientific Linux 7 (EL7).
We are building an equivalent array of programs for the EL9 SPEED2 nodes. To see the packages available, run \texttt{ls -al /encs/pkg/} on \texttt{speed.encs}.
See a complete list in \xa{sect:software-details}.\\
\noindent
\textbf{Note:} We do our best to accommodate custom software requests. Python environments can use user-custom installs
from within the scratch directory.
% ----------------------------- 1.7 Requesting Access --------------------------
% ------------------------------------------------------------------------------
\subsection{Requesting Access}
\label{sect:access-requests}
After reviewing the ``What Speed is'' (\xs{sect:speed-is-for}) and
``What Speed is Not'' (\xs{sect:speed-is-not}), request access to the ``Speed''
cluster by emailing: \texttt{rt-ex-hpc AT encs.concordia.ca}.
\begin{itemize}
\item GCS ENCS faculty and staff may request access directly.
\item GCS students must include the following in their request message:
\begin{itemize}
\item GCS ENCS username
\item Name and email (CC) of the approver -- either a supervisor, course instructor,
or a department representative (e.g., in the case of undergraduate or M.Eng.\ students it
can be the Chair, associate chair, a technical officer, or a department administrator) for approval.
\item Written request from the
%supervisor or instructor
approver
for the GCS ENCS username to be granted access to ``Speed.''
\end{itemize}
\item Non-GCS students taking a GCS course will have their GCS ENCS account created automatically, but still need the course instructor's approval to use the service.
\item Non-GCS faculty and students need to get a ``sponsor'' within GCS, so that a guest GCS ENCS account is created first. A sponsor can be any GCS Faculty member
you collaborate with. Failing that, request the approval from our Dean's Office;
via our Associate Deans Drs.~Eddie Hoi Ng or Emad Shihab.
\item External entities collaborating with GCS Concordia researchers should also go through the Dean's Office for approvals.
\end{itemize}
% The web page is currently less detailed than the above.
%For detailed instructions, refer to the Concordia
%\href{https://www.concordia.ca/ginacody/aits/speed.html}{Computing (HPC) Facility: Speed} webpage.
% ------------------------------------------------------------------------------
% 2 Job Management
% ------------------------------------------------------------------------------
\section{Job Management}
\label{sect:job-management}
We use SLURM as the workload manager. It supports primarily two types of jobs: batch and interactive.
Batch jobs are used to run unattended tasks, whereas,
interactive jobs are are ideal for setting up virtual environments, compilation, and debugging.\\
\noindent \textbf{Note:} In the following instructions, anything bracketed like, \verb+<>+, indicates a
label/value to be replaced (the entire bracketed term needs replacement).\\
\noindent Job instructions in a script start with \verb+#SBATCH+ prefix, for example:
\begin{verbatim}
#SBATCH --mem=100M -t 600 -J <job-name> -A <slurm account>
#SBATCH -p pg --gpus=2 --mail-type=ALL
\end{verbatim}
%
For complex compute steps within a script, use \tool{srun}. We recommend using \tool{salloc} for interactive jobs as it supports multiple steps.
However, \tool{srun} can also be used to start interactive jobs (see \xs{sect:interactive-jobs}).
%
Common and required job parameters include:
%
\begin{multicols}{2}
\begin{itemize}
\item
memory (\option{--mem}),
\item
time (\option{-t}),
\item
\option{--job-name} (\option{-J}),
\item
slurm project account (\option{-A}),
\item
partition (\option{-p}),
\item
mail type (\option{--mail-type}),
\item
ntasks (\option{-n}),
\item
CPUs per task (\option{--cpus-per-task}).
\end{itemize}
\end{multicols}
% -------------- 2.1 Getting Started ------------------------
% -----------------------------------------------------------
\subsection{Getting Started}
\label{sect:getting-started}
Before getting started, please review the ``What Speed is'' (\xs{sect:speed-is-for})
and ``What Speed is Not'' (\xs{sect:speed-is-not}).
Once your GCS ENCS account has been granted access to ``Speed'',
use your GCS ENCS account credentials to create an SSH connection to
\texttt{speed} (an alias for \texttt{speed-submit.encs.concordia.ca}).\\
All users are expected to have a basic understanding of
Linux and its commonly used commands (see \xa{sect:faqs} for resources).
% 2.1.1 SSH Connections
% -----------------------
\subsubsection{SSH Connections}
\label{sect:ssh}
Requirements to create connections to ``Speed'':
\begin{enumerate}
\item \textbf{Active GCS ENCS user account:} Ensure you have an active GCS ENCS user account with
permission to connect to Speed (see \xs{sect:access-requests}).
\item \textbf{VPN Connection} (for off-campus access): If you are off-campus, you wil need to establish an active connection to Concordia's VPN,
which requires a Concordia netname.
\item \textbf{Terminal Emulator for Windows:} Windows systems use a terminal emulator such as PuTTY, Cygwin, or MobaXterm.
\item \textbf{Terminal for macOS:} macOS systems have a built-in Terminal app or \tool{xterm} that comes with XQuartz.
\end{enumerate}
\noindent To create an SSH connection to Speed, open a terminal window and type the following command, replacing \verb!<ENCSusername>! with your ENCS account's username:
\begin{verbatim}
ssh <ENCSusername>@speed.encs.concordia.ca
\end{verbatim}
\noindent For detailed instructions on securely connecting to a GCS server, refer to the AITS FAQ:
\href{https://www.concordia.ca/ginacody/aits/support/faq/ssh-to-gcs.html}{How do I securely connect to a GCS server?}
% 2.1.2 Environment Set Up
% --------------------------
% TMP scheduler-specific section
\subsubsection{Environment Set Up}
\label{sect:envsetup}
\input{scheduler-env}
% -------------- 2.2 Job Submission Basics ------------------
% -----------------------------------------------------------
\subsection{Job Submission Basics}
\label{sect:job-submission-basics}
Preparing your job for submission is fairly straightforward.
Start by basing your job script on one of the examples available in the \texttt{src/}
directory of our \href{https://github.com/NAG-DevOps/speed-hpc}{GitHub repository}.
You can clone the repository to get the examples to start with via the command line:
\begin{verbatim}
git clone --depth=1 https://github.com/NAG-DevOps/speed-hpc.git
cd speed-hpc/src
\end{verbatim}
\noindent The job script is a shell script that contains directives, module loads, and user scripting.
To quickly run some sample jobs, use the following commands:
\begin{verbatim}
sbatch -p ps -t 10 env.sh
sbatch -p ps -t 10 bash.sh
sbatch -p ps -t 10 manual.sh
sbatch -p pg -t 10 lambdal-singularity.sh
\end{verbatim}
% 2.2.1 Directives
% -------------------
% TMP scheduler-specific section
\subsubsection{Directives}
\label{sect:directives}
\input{scheduler-directives}
% 2.2.2 Module Loads
% -------------------
%\subsubsection{Module Loads}
\subsubsection{Working with Modules}
\label{sect:modules}
After setting the directives in your job script, the next section typically involves loading
the necessary software modules. The \tool{module} command is used to manage the user environment,
make sure to load all the modules your job depends on. You can check available modules with the
module avail command. Loading the correct modules ensures that your environment is properly
set up for execution.\\
\noindent To list for a particular program (\tool{matlab}, for example):
%
\small
\begin{verbatim}
module avail
module -t avail matlab ## show the list for a particular program (e.g., matlab)
module -t avail m ## show the list for all programs starting with m
\end{verbatim}
\normalsize
\noindent For example, insert the following in your script to load the \tool{matlab/R2023a} module:
\begin{verbatim}
module load matlab/R2023a/default
\end{verbatim}
\noindent
\textbf{Note:} you can remove a module from active use by replacing \option{load} by \option{unload}.\\
\noindent To list loaded modules:
\begin{verbatim}
module list
\end{verbatim}
\noindent To purge all software in your working environment:
\begin{verbatim}
module purge
\end{verbatim}
% 2.2.3 User Scripting
% -------------------
% TMP scheduler-specific section
\subsubsection{User Scripting}
\label{sect:scripting}
\input{scheduler-scripting}
% scheduler-scripting also includes:
% 2.3 Sample Job Script
% 2.4 Common Job Management Commands Summary
% 2.5 Advanced sbatch Options
% 2.6 Array Jobs
% 2.7 Requesting Multiple Cores
% 2.8 Interactive Jobs
% 2.8.1 Command Line
% 2.8.2 Graphical Applications
% 2.8.3 Jupyter Notebooks in Singularity
% 2.8.4 JupyterLab in Conda and Pytorch
% 2.8.5 JupyterLab + Pytorch in Python venv
% 2.8.6 Visual Studio Code
% -------------- 2.9 Scheduler Environment Variables ----------
% -------------------------------------------------------------
\subsection{Scheduler Environment Variables}
\label{sect:env-vars}
The scheduler provides several environment variables that can be useful in your job scripts.
These variables can be accessed within the job using commands like \tool{env} or \tool{printenv}.
Many of these variables start with the prefix \texttt{SLURM}.\\
\noindent Here are some of the most useful environment variables:
\begin{itemize}
\item
\api{\$TMPDIR} (and \api{\$SLURM\_TMPDIR}):
% TODO: verify temporal existence
This is the path to the job's temporary space on the node. It \emph{only} exists for the duration of the job.
If you need the data from this temporary space, ensure you copy it before the job terminates.
\item
\api{\$SLURM\_SUBMIT\_DIR}:
The path to the job's working directory (likely an NFS-mounted path).
If, \option{--chdir}, was stipulated, that path is taken; if not,
the path defaults to your home directory.
\item
\api{\$SLURM\_JOBID}:
This variable holds the current job's ID, which is useful for job
manipulation and reporting within the job's process.
\item
\api{\$SLURM\_NTASKS}: the number of cores requested for the job. This variable can
be used in place of hardcoded thread-request declarations, e.g., for
Fluent or similar.
\item
\api{\$SLURM\_JOB\_NODELIST}:
This lists the nodes participating in your job.
\item \api{\$SLURM\_ARRAY\_TASK\_ID}:
For array jobs, this variable represents the task ID
(refer to \xs{sect:array-jobs} for more details on array jobs).
\end{itemize}
\noindent
For a more comprehensive list of environment variables, refer to the SLURM documentation for
\href{https://slurm.schedmd.com/srun.html#SECTION_INPUT-ENVIRONMENT-VARIABLES}{Input Environment Variables} and
\href{https://slurm.schedmd.com/srun.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES}{Output Environment Variables}.\\
\noindent
An example script that utilizes some of these environment variables
is in \xf{fig:tmpdir.sh}.
\begin{figure}[htpb]
\lstinputlisting[language=csh,frame=single,basicstyle=\scriptsize\ttfamily]{tmpdir.sh}
\caption{Source code for \file{tmpdir.sh}}
\label{fig:tmpdir.sh}
\end{figure}
% -------------- 2.10 SSH Keys for MPI ------------------------
% -------------------------------------------------------------
\subsection{SSH Keys for MPI}
\label{sect:ssh-mpi}
Some programs, such as Fluent, utilize MPI (Message Passing Interface) for parallel processing.
MPI requires `passwordless login', which is achieved through SSH keys. Here are the steps to set up SSH keys for MPI:
\begin{itemize}
\item
Navigate to the \texttt{.ssh} directory
\begin{verbatim}
cd ~/.ssh
\end{verbatim}
\item
Generate a new SSH key pair (Accept the default location and leave the passphrase blank)
\begin{verbatim}
ssh-keygen -t ed25519
\end{verbatim}
\item
Authorize the Public Key:
\begin{verbatim}
cat id_ed25519.pub >> authorized_keys
\end{verbatim}
If the \texttt{\href{https://www.ssh.com/academy/ssh/authorized-keys-file}{authorized\_keys}} file does not exist, use
\begin{verbatim}
cat id_ed25519.pub > authorized_keys
\end{verbatim}
\item
Set permissions: ensure the correct permissions are set for the `authorized\_keys' file and your home directory
(most users will already have these permissions by default):
\begin{verbatim}
chmod 600 ~/.ssh/authorized_keys
chmod 700 ~
\end{verbatim}
\end{itemize}
% -------------- 2.11 Creating Virtual Environments -----------
% -------------------------------------------------------------
\subsection{Creating Virtual Environments}
\label{sect:environments}
\label{sect:examples-venv}
The following documentation is specific to \textbf{Speed}.
%HPC Facility at the
%Gina Cody School of Engineering and Computer Science.
Other clusters may have their own requirements.
%
Virtual environments are typically created using Conda or Python.
Another option is Singularity (detailed in \xs{sect:singularity-containers}).
These environments are usually created once during an interactive session
before submitting a batch job to the scheduler.
%
The job script submitted to the scheduler should:
\begin{enumerate}
\item Activate the virtual environment.
\item Use the virtual environment.
\item Deactivate the virtual environment at the end of the job.
\end{enumerate}
% 2.11.1 Anaconda
% -------------------
\subsubsection{Anaconda}
\label{sect:conda-venv}
To create an Anaconda environment, follow these steps:
\begin{enumerate}
\item Request an interactive session
\begin{verbatim}
salloc -p pg --gpus=1
\end{verbatim}
\item
Load the Anaconda module and create your Anaconda environment in your speed-scratch directory by using
the \option{--prefix} option (without this option, the environment will be created in your home directory by default).
\begin{verbatim}
module load anaconda3/2023.03/default
conda create --prefix /speed-scratch/$USER/myconda
\end{verbatim}
\item
List environments (to view your conda environment)
\begin{verbatim}
conda info --envs
# conda environments:
#
base * /encs/pkg/anaconda3-2023.03/root
/speed-scratch/a_user/myconda
\end{verbatim}
\item
Activate the environment
\begin{verbatim}
conda activate /speed-scratch/$USER/myconda
\end{verbatim}
\item
Add \tool{pip} to your environment (this will install \tool{pip} and \tool{pip}'s dependencies,
including \tool{python}, into the environment.)
\begin{verbatim}
conda install pip
\end{verbatim}
\end{enumerate}
\noindent
A consolidated example using Conda:
\begin{verbatim}
salloc -p pg --gpus=1 --mem=10G -A <slurm account name>
cd /speed-scratch/$USER
module load python/3.11.0/default
conda create -p /speed-scratch/$USER/pytorch-env
conda activate /speed-scratch/$USER/pytorch-env
conda install python=3.11.0
pip3 install torch torchvision torchaudio --index-url \
https://download.pytorch.org/whl/cu117
....
conda deactivate
exit # end the salloc session
\end{verbatim}
\noindent
If you encounter \textbf{no space left error} while creating Conda environments, please refer to
\xa{sect:quota-exceeded}. Likely you forgot \option{--prefix} or environment variables below.\\
\noindent
\textbf{Important Note:} \tool{pip} (and \tool{pip3}) are package installers for Python. When you use
\texttt{pip install}, it installs packages from the Python Package Index (PyPI), whereas,
\texttt{conda install} installs packages from Anaconda's repository.
% -----------------------------------------------------------------------------
\paragraph{Conda Env without \option{--prefix}}
If you don't want to use the \option{--prefix} option every time you create a new environment and
do not want to use the default home directory, you can create a new directory and set the following
variables to point to the newly created directory, e.g.:
\begin{verbatim}
mkdir -p /speed-scratch/$USER/conda
setenv CONDA_ENVS_PATH /speed-scratch/$USER/conda
setenv CONDA_PKGS_DIRS /speed-scratch/$USER/conda/pkg
\end{verbatim}
\noindent
If you want to make these changes permanent, add the variables to your \texttt{.tcshrc}
or \texttt{.bashrc} (depending on the default shell you are using).
% 2.11.2 Python
% -----------------------------------------------------------------------------
\subsubsection{Python}
\label{sect:python-venv}
Setting up a Python virtual environment is straightforward.
Here's an example that use a Python virtual environment:
\begin{verbatim}
salloc -p pg --gpus=1 --mem=10G -A <slurm account name>
cd /speed-scratch/$USER
module load python/3.9.1/default
mkdir -p /speed-scratch/$USER/tmp
setenv TMPDIR /speed-scratch/$USER/tmp
setenv TMP /speed-scratch/$USER/tmp
python -m venv $TMPDIR/testenv (testenv=name of the virtualEnv)
source /speed-scratch/$USER/tmp/testenv/bin/activate.csh
pip install modules...
deactivate
exit
\end{verbatim}
\noindent
See, e.g.,
\href
{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/gurobi-with-python.sh}
{\texttt{gurobi-with-python.sh}}\\
\noindent
\textbf{Important Note:} our partition \texttt{ps} is used for CPU jobs, while \texttt{pg},
\texttt{pt}, and \texttt{cl} are used for GPU jobs. You do not need to use \option{--gpus}
when preparing environments for CPU jobs.\\
\noindent
\textbf{Note:} Python enviornments are also preferred over Conda
in some clusters, see a note in~\xs{sect:jupyterlabs-venv}.
% -------------- 2.12 Example Job Script: Fluent --------------
% -------------------------------------------------------------
% TMP scheduler-specific section
% TODO: delete the file and move the content here
\input{scheduler-job-examples}
% scheduler-job-examples includes:
% 2.12 Sample Job Script: fluent
% 2.13 Example Job Script: EfficientDet
% 2.14 Java Jobs
% 2.15 Scheduling on the GPU Nodes
% 2.15.1 P6 on Multi-GPU, Multi-Node
% 2.15.2 CUDA
% 2.15.3 Special Notes for Sending CUDA Jobs to the GPU Queue
% 2.15.4 OpenISS Examples
% 2.16 Singularity Containers
% ------------------------------------------------------------------------------
% 3 Conclusion
% ------------------------------------------------------------------------------
\section{Conclusion}
\label{sect:conclusion}
The cluster operates on a ``first-come, first-served'' basis until it reaches full capacity.
After that, job positions in the queue are determined based on past usage.
The scheduler does attempt to fill gaps, so occasionally, a single-core job with lower priority
may be scheduled before a multi-core job with higher priority.
% -------------- 3.1 Important Limitations --------------------
% -------------------------------------------------------------
\subsection{Important Limitations}
\label{sect:limitations}
While Speed is a powerful tool, it is essential to recognize its limitations to use it effectively:
\begin{itemize}
\item
New users are limited to a total of 32 cores and 4 GPUs. If you need more cores temporarily,
%(up to 192 cores or six jobs of 32 cores each),
please contact \texttt{rt-ex-hpc AT encs.concordia.ca}.
\item
Batch job sessions can run for a maximum of one week.
Interactive jobs are limited to 24 hours see \xs{sect:interactive-jobs}.
\item
Scripts can live in your NFS-provided home directory, but substantial data
should be stored in your cluster-specific directory (located at \verb+/speed-scratch/<ENCSusername>/+).
NFS is suitable for short-term activities but not for long-term operations.
\textbf{Data that a job will read multiple times} should be copied at the start to the scratch disk of a compute node using
\api{\$TMPDIR} (and possibly \api{\$SLURM\_SUBMIT\_DIR}).
Intermediate job data should be produced in \api{\$TMPDIR}, and once a job is near completion,
these data should be copied to your NFS-mounted home directory (or other NFS-mounted space).
\textbf{In other words, IO-intensive operations should be performed locally whenever possible,
reserving network activity for the start and end of jobs.}
\item
Your current resource allocation is based on past usage,
which considers approximately one week's worth of past wall clock time
(time spent on the node(s)) and compute activity (on the node(s)).
\item
Jobs must always be run within the scheduler's system. Repeat offenders who
run jobs outside the scheduler risk losing cluster access.
\end{itemize}
% -------------- 3.2 Tips/Tricks ------------------------------
% -------------------------------------------------------------
\subsection{Tips/Tricks}
\label{sect:tips}
\begin{itemize}
\item
Ensure that files and scripts have Linux line breaks.
Use the \tool{file} command to verify and \tool{dos2unix} to convert if necessary.
\item
Use \tool{rsync} (preferred over \tool{scp}) for copying or moving large amounts of data.
\item
Before transferring a large number of files between NFS-mounted storage and
the cluster, compress the files into a \tool{tar} archive.
\item
If you plan to use a different shell (e.g., \tool{bash}~\cite{aosa-book-vol1-bash}),
change the shell declaration at the beginning of your script(s).
\item
Request resources (cores, memory, GPUs) that closely match the actual needs of your job.
Requesting significantly more than necessary can make your job harder to schedule when
resources are limited. Always check the efficiency of your job with either \tool{seff}
and/or the \option{--mail-type=ALL}, to adjust your job parameters.
\item
For any concerns or questions, email \texttt{rt-ex-hpc AT encs.concordia.ca}
\end{itemize}
% -------------- 3.3 Use Cases --------------------------------
% -------------------------------------------------------------
\subsection{Use Cases}
\label{sect:cases}
\begin{itemize}
\item
HPC Committee's initial batch about 6 students (end of 2019):
\begin{itemize}
\item 10000 iterations job in Fluent finished in $<26$ hours vs. 46 hours in Calcul Quebec
\end{itemize}
\item
NAG's MAC spoofer analyzer~\cite{mac-spoofer-analyzer-intro-c3s2e2014,mac-spoofer-analyzer-detail-fps2014},
such as \url{https://github.com/smokhov/atsm/tree/master/examples/flucid}
\begin{itemize}
\item compilation of forensic computing reasoning cases about false or true positives of hardware address spoofing in the labs
\end{itemize}
\item
S4 LAB/GIPSY R\&D Group's:
\begin{itemize}
\item MARFCAT and MARFPCAT (OSS signal processing and machine learning tools for
vulnerable and weak code analysis and network packet capture
analysis)~\cite{marfcat-nlp-ai2014,marfcat-sate2010-nist,fingerprinting-mal-traffic}
\item Web service data conversion and analysis
\item {\flucid} encoders (translation of large log data into {\flucid}~\cite{mokhov-phd-thesis-2013} for forensic analysis)
\item Genomic alignment exercises
\end{itemize}
\item \textbf{Best Paper award}, \bibentry{job-failure-prediction-compsysarch2024}
% RT521027
\item \bibentry{unsteady-wake-ouedraogo_essel_2023}
\item \bibentry{effects-reynolds-ouedraogo_essel_2024}
\item \bibentry{nozzle-effects-APS_2024}
\item \bibentry{effects-reynolds-APS-ouedraogo_essel_2024}
\item \bibentry{oi-containers-poster-siggraph2023}
\item \bibentry{Gopal2024Sep}
\item \bibentry{Gopal2023Mob}
% the next one is not visible (it produces an error)
%\item \bibentry{roof-mounted-vawt-2023}
\item \bibentry{root-mounted-vawt-corner-2023}
\item \bibentry{cfd-modeling-turbine-2023}
\item \bibentry{small-vaxis-turbine-corner-2022}
\item \bibentry{cfd-vaxis-turbine-wake-2022}
\item \bibentry{numerical-turbulence-vawt-2021}
\item \bibentry{niksirat2020}
\item The work ``\bibentry{lai-haotao-mcthesis19}'' using TensorFlow and Keras on OpenISS
adjusted to run on Speed based on the repositories:
\begin{itemize}
\item \bibentry{openiss-reid-tfk} and
\item \bibentry{openiss-yolov3}
\end{itemize}
and theirs forks by the team.
\end{itemize}
% ------------------------------------------------------------------------------
\appendix
% ------------------------------------------------------------------------------
% A History
% ------------------------------------------------------------------------------
\section{History}
% A.1 Acknowledgments
% -------------------------------------------------------------
\subsection{Acknowledgments}
\label{sect:acks}
\begin{itemize}
\item
The first 6 to 6.5 versions of this manual and early UGE job script samples,
Singularity testing and user support were produced/done by Dr.~Scott Bunnell
during his time at Concordia as a part of the NAG/HPC group. We thank
him for his contributions.
\item
The HTML version with devcontainer support was contributed by Anh H Nguyen.
\item
Dr.~Tariq Daradkeh, was our IT Instructional Specialist from August 2022 to September 2023;
working on the scheduler, scheduling research, end user support, and integration of
examples, such as YOLOv3 in \xs{sect:openiss-yolov3} and other tasks. We have a continued
collaboration on HPC/scheduling research (see~\cite{job-failure-prediction-compsysarch2024}).
\end{itemize}
% A.2 Migration from UGE to SLURM
% -------------------------------------------------------------
\subsection{Migration from UGE to SLURM}
\label{appdx:uge-to-slurm}
For long term users who started off with Grid Engine here are some resources
to make a transition and mapping to the job submission process.
\begin{itemize}
\item
Queues are called ``partitions'' in SLURM. Our mapping from the GE queues
to SLURM partitions is as follows:
\begin{verbatim}
GE => SLURM
s.q ps
g.q pg
a.q pa
\end{verbatim}
We also have a new partition \texttt{pt} that covers SPEED2 nodes,
which previously did not exist.
\item
Commands and command options mappings are found in \xf{fig:rosetta-mappings} from\\
\url{https://slurm.schedmd.com/rosetta.pdf}\\
\url{https://slurm.schedmd.com/pdfs/summary.pdf}\\
Other related helpful resources from similar organizations who either used
SLURM for a while or also transitioned to it:\\
\url{https://docs.alliancecan.ca/wiki/Running_jobs}\\
\url{https://www.depts.ttu.edu/hpcc/userguides/general_guides/Conversion_Table_1.pdf}\\
\url{https://docs.mpcdf.mpg.de/doc/computing/clusters/aux/migration-from-sge-to-slurm}
\begin{figure}[htpb]
\includegraphics[width=\columnwidth]{images/rosetta-mapping}
\caption{Rosetta Mappings of Scheduler Commands from SchedMD}
\label{fig:rosetta-mappings}
\end{figure}
\item
\noindent
\textbf{NOTE:} If you have used UGE commands in the past you probably still have these
lines there; \textbf{they should now be removed}, as they have no use in SLURM and
will start giving ``command not found'' errors on login when the software is removed:
csh/\tool{tcsh}: sample \file{.tcshrc} file:
\begin{verbatim}
# Speed environment set up
if ($HOSTNAME == speed-submit.encs.concordia.ca) then
source /local/pkg/uge-8.6.3/root/default/common/settings.csh