To create and activate a conda environment on the HPC2 node, you can use the following command:
module load conda3/4.X
conda create --name <environment_name> python=3.8
conda activate <environment_name>
This command will create a conda environment with the name <environment_name>
and Python version 3.8
. You can replace
<environment_name>
with the name of the environment that you want to create.
After you have created the conda environment, you can activate it using the following command:
module load conda3/4.X
conda activate <environment_name>
This command will activate the conda environment with the name <environment_name>
. You can replace <environment_name>
with the name of the environment that you want to activate.
You can find more information about creating and activating conda environments in the Conda documentation.
This is a collection of useful SLURM commands that I have found useful when running jobs on the HPC2 node.
You can find more information about SLURM commands in the SLURM documentation.
To submit a job, you can use either the sbatch
or srun
command. I have only used sbatch
, so I will show an example of using this command below.
To submit a job to the HPC2 node, you will need to create a job script. Below is an example of a job script called job_script.sh
. It contains certain configuration parameters, activates a Conda environment, and runs a Python file:
#!/bin/bash -l
#SBATCH -J training_irf_job
#SBATCH --mem=150G
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:2
#SBATCH --time=15-00:00:00
#SBATCH --partition=gpu-qi
#SBATCH [email protected]
#SBATCH --mail-type=ALL
#SBATCH -o training_irf_jobs/training_irf_job-%j.output
#SBATCH -e training_irf_jobs/training_irf_job-%j.error
# Run the Python script with the input file
module load conda3/4.X
conda activate regression_dl
python training_irf_models.py
Comment lines with SBATCH
are the parameters that you can set for your job. The parameters that I've set for this job are:
-J training_irf_job
: The name of the job, which istraining_irf_job
in this case--mem=150G
: The amount of memory that you want to allocate to the job--cpus-per-task=4
: The number of CPUs that you want to allocate to the job--gres=gpu:2
: The number of GPUs that you want to allocate to the job--time=15-00:00:00
: The amount of time that you want to allocate to the job--partition=gpu-qi
: The partition that you want to allocate the job to[email protected]
: The email address that you want to receive notifications at.- Note: After speaking with IT, this directive is not functional on the HPC2 node.
--mail-type=ALL
: The type of notifications that you want to receive- Note: After speaking with IT, this directive is not functional on the HPC2 node.
-o training_irf_jobs/training_irf_job-%j.output
: The output file for the job-e training_irf_jobs/training_irf_job-%j.error
: The error file for the job
After you have created the job script, you can submit the job to the HPC2 node using the following command:
sbatch job_script.sh
This file should be sufficient for our purposes. If you need to add more parameters, you can refer to the SLURM documentation.
To monitor the jobs that you have submitted to the HPC2 node, you can use the following commands:
squeue
: This command will show you all of the jobs that are currently running on the HPC2 nodesqueue -u nsabzwar
: This command will show you all of the jobs that are currently running on the HPC2 node for the usernsabzwar
squeue -A gpu-qi
: This command will show you all of the jobs that are currently running on the HPC2 node for the partitiongpu-qi
sacct
: This command will show you the status of the jobs that you have submitted to the HPC2 nodesacct -u nsabzwar
: This command will show you the status of the jobs that you have submitted to the HPC2 node for the usernsabzwar
sacct -X -o jobId,start,end,state
: This command will show you the job ID, start time, end time, and state of the jobs that you have submitted to the HPC2 node
scontrol show job <job_id>
: This command will show you the details of a specific job that you have submitted to the HPC2 nodescontrol show job 12345
: This command will show you the details of the job with the job ID12345
You can find more information about these commands in the SLURM documentation.
To cancel a job that you have submitted to the HPC2 node, you can use the following command:
scancel <job_id>
: This command will cancel the job with the job ID<job_id>
scancel 12345
: This command will cancel the job with the job ID12345
You can find more information about this command in the SLURM documentation.