Skip to content

Latest commit

 

History

History
109 lines (92 loc) · 4.5 KB

README.md

File metadata and controls

109 lines (92 loc) · 4.5 KB

About

This repository showcases example workflows using the high-performance computing (HPC) cluster at Florida Atlantic University (FAU) to scale up scientific computing tasks. The workflows serve as a starting point for Max Planck Florida Institute for Neuroscience (MPFI) researchers. Please refer to the exhaustive Knowledge Base provided by the HPC team at FAU for many great detailed guides on specific topics. You can also request help from the HPC team by submitting a ticket under Services > Research Computing/HPC > Research Application Assistance. Note that this requires an FAU login (see below).

Example Workflows

Specific examples are provided that elaborate on the following topics:

  • Running MATLAB
    • submit jobs in a loop
    • utilize temporary scratch directories
  • Running Python
    • install custom packages
    • use GPU nodes
    • workflow for DeepLabCut
    • restart unfinished jobs

Please read below for general information on how to access the HPC cluster and submit jobs.

Accessing the clusters

In order to get access to the HPC cluster, you need to have an FAU login. Contact the MPFI IT team to request an account. To log in, you will have to use the Duo Mobile App.

The Open OnDemand web interface provides user-friendly and immediate access the HPC cluster. You can

  • upload/download files
  • run interactive sessions for e.g.
    • Jupyter notebooks
    • MATLAB
    • virtual Linux desktop
  • open a terminal on the login node

If you are using the terminal regularly, I highly recommend setting up SSH access and using Zsh. Note that you can also use WSL when working on a Windows machine.

There are multiple ways to transfer files between your local machine and the HPC cluster On Windows, I recommend mounting the HPC cluster as a network drive.

Some resources from the Knowledge Base to get started:

Job scheduler

Submitting jobs

The HPC cluster uses the SLURM job scheduler. Jobs are submitted using the sbatch command:

sbatch submit.sh

where submit.sh is a submit script that holds instructions for the job scheduler. Example submit scripts for different tasks are provided in the subfolders of this repository.

The lines in the submit script starting with #SBATCH are instructions for the job scheduler, such as which queue and how many CPU cores to use and how to name the output and error files.

#SBATCH --partition=shortq7 
#SBATCH --nodes=1 
#SBATCH --cpus-per-task=24 
#SBATCH --output="%x.%j.out" 
#SBATCH --error="%x.%j.err"

This is equvivalent to adding the following flags to the sbatch command:

sbatch --partition=shortq7 --nodes=1 --cpus-per-task=24 --output="%x.%j.out" --error="%x.%j.err" submit.sh

The --output and --error flags specify the file names for the text output and errors created during the job, where %x is the job name and %j is the unique job ID assigned by the job scheduler.

Monitoring jobs

You can monitor the status of your jobs using the squeue command:

squeue -u $USER

To cancel a job, use the scancel command:

scancel JOB_ID

Show available resources

You can show the available queues and their time limits using the sinfo command:

sinfo

Note that by default you only have access to shortq7, shortq7-gpu, mediumq7, and longq7.

Another useful tool is the gnodes script from slurm-utils, which provides a graphical representation of all queues and their utilization. Copy the file to the HPC and run from the same directory:

gnodes