Skip to content

Latest commit

 

History

History
453 lines (339 loc) · 21 KB

README.md

File metadata and controls

453 lines (339 loc) · 21 KB

Examples

This directory has example job scripts and some tips and tricks how to run certcain things.

TOC

Sample Jobs

These are examples either trivial or some are more elaborate. Some are described in the manual more in detail or vice versa. The examples were written by the Speed team as well as contributed by the users or a result of solving a problem of some kind.

  • Basic examples:
    • tcsh.sh -- default tcsh job script example
    • tmpdir.sh -- example use of TMPDIR on a local node
    • bash.sh -- example use with bash shell as opposed to tcsh
    • manual.sh -- example job to compile our very manual here to PDF and HTML using LaTeX
    • poppler.txt -- Interactive job example: PDF rendering using poppler and pdf2image; instructions and code ready to paste.
  • Common packages:
    • fluent.sh -- Fluent job
    • comsol.sh -- Comsol job
    • matlab-sge.sh -- MATLAB job
  • Advanced or research examples:
    • msfp-speed-job.sh -- MAC Spoofer Investigation starter job script (for detailes see here and here)
    • efficientdet.sh -- efficientdet with Conda environment described below
    • gurobi-with-python.sh -- using Gurobi with Python and Python virtual environment
    • pytorch-multicpu.txt -- using Pytorch with Python virtual environment to run on CPUs; with instructions and code ready to paste.
    • pytorch-multinode-multigpu.sh -- using Pytorch with Python virtual environment to run on Multinodes and MultiGpus
    • lambdal-singularity.sh -- an example use of the Singularity container to run LambdaLabs software stack on the GPU node. The container was built from the docker image as a source.
    • openfoam-multinode.sh -- an example using OpenFoam, icoFoam solver to run on Multinodes-multicpus
    • openiss-reid-speed.sh -- OpenISS computer vision exame for re-edentification, see more in its section
    • openiss-yolo-cpu.sh, openiss-yolo-gpu.sh, and openiss-yolo-interactive.sh -- OpenISS examples with YOLO, related to reid, see more in the corresponding section

Creating Environments and Compiling Code on Speed

Correct Procedure

Overview of preparing environments, compiling code and testing

  • Create an salloc session to the queue you wish to run your jobs (e.g., salloc -p pg --gpus=1 for GPU jobs)
  • Within the salloc session, create and activate an Anaconda environment in your /speed-scratch/ directory using the instructions found in Section 2.11.1 of the manual: https://nag-devops.github.io/speed-hpc/#creating-virtual-environments
  • Compile your code within the environment.
  • Test your code with a limited data set.
  • Once you are satisfied with your test results, exit your salloc session.

Once your environment and code have been tested

Do not use the submit node to create environments or compile code

  • speed-submit is a virtual machine intended to submit user jobs to the job scheduler. It is not intended to compile or run code.
  • Importantly, speed-submit does not have GPU drivers. This means that code compiled on speed-submit will not be compiled against proper GPU drivers.
  • Processes run outside of the scheduler on speed-submit will be killed and you will lose your work.

pip

By default, pip installs packages to a system-wide default location.

Creating environments via pip shound NOT be done outside of an Anaconda environment.

Why you should create an Anaconda environment and not use pip directly from the command line:

  • Using pip directly from the command line affects the system wide environment. If all users use pip in this way, the packages and versions installed via pip may change while your jobs run.
  • Creating Anaconda environments allows you to fully control what python packages, and their versions, are within that environment.
  • It is possible to create multiple conda environments for your different projects.

Environments

Virtual Environment Creation documentation. The following documentation is specific to Speed.

Anaconda

Load the Anaconda module

To view the Anaconda modules available, run module avail anaconda

Load the desired version of anaconda using the module load command.

For example: module load anaconda3/2023.03/default

Initialize Shell

To initialize your shell, run conda init <SHELL_NAME>

The default shell for ENCS accounts is tcsh. Therefore, to initialize your default shell run conda init tcsh

Create an Environment

To create an anaconda environment in your speed-scratch directory, use the --prefix option when executing conda create.

For example: conda create --prefix /speed-scratch/$USER/myconda

Where $USER is an environment variable containing your encs_username

Without the --prefix option, conda create creates the environment in your home directory by default.

List Environments

To view your conda environments, type conda info --envs

# conda environments:
#
base                  *  /encs/pkg/anaconda3-2023.03/root
                         /speed-scratch/<encs_username>/myconda

Activate an Environment

Activate the environment /speed-scratch/<encs_username>/myconda as follows

conda activate /speed-scratch/$USER/myconda

After activating your environment, add pip to your environment by using

conda install pip

This will install pip and pip's dependencies, including python.

Important Note: pip (and pip3) are used to install modules from the python distribution while conda install installs modules from anaconda's repository.

No Space left error when creating Conda Environment

You are using your $HOME directory as conda default directory, the tarballs and pkgs are using all the space

conda clean --all --dry-run will show you the size of tarballs, packages, caches conda clean -all will wipe-out all unused packages, caches and tarballs

If the conda clean hasn't freed enough space, try to set change the location of Conda pkgs to another directory, e.g:

setenv CONDA_PKGS_DIRS /speed-scratch/$USER/tmp/pkgs

Example: Create Conda Environment in Speed

On speed-submit: salloc --mem=10Gb -n1 -pps

On the node where the interactive session is running:

setenv TMPDIR /speed-scratch/$USER/tmp
setenv TMP /speed-scratch/$USER/tmp
module load anaconda3/2023.03/default
setenv CONDA_PKGS_DIRS $TMP/pkgs
conda create -p $TMP/Venv-Name python==3.11
conda activate $TMP/Venv-Name

Conda envs without prefix

If you don't want to use the --prefix option everytime you create a new environment and you don't want to use the default $HOME directory, create a new directory and set CONDA_ENVS_PATH and CONDA_PKGS_DIRS variables to point to the new created directory, e.g:

setenv CONDA_ENVS_PATH /speed-scratch/$USER/condas
setenv CONDA_PKGS_DIRS /speed-scratch/$USER/condas/pkg

If you want to make these changes permanent, add the variables to your .tcshrc or .bashrc (depending on the default shell you are using)

efficientdet

The following steps describing how to create an efficientdet environment on speed, were submitted by a member of Dr. Amer's Research Group.

  • Enter your ENCS user account's speed-scratch directory cd /speed-scratch/$USER
  • load python module load python/3.8.3
  • create virtual environment python3 -m venv my_env_name
  • activate virtual environment source my_env_name/bin/activate.csh
  • install DL packages for Efficientdet
pip install tensorflow==2.7.0
pip install lxml>=4.6.1
pip install absl-py>=0.10.0
pip install matplotlib>=3.0.3
pip install numpy>=1.19.4
pip install Pillow>=6.0.0
pip install PyYAML>=5.1
pip install six>=1.15.0
pip install tensorflow-addons>=0.12
pip install tensorflow-hub>=0.11
pip install neural-structured-learning>=1.3.1
pip install tensorflow-model-optimization>=0.5
pip install Cython>=0.29.13
pip install git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI

Diviner Tools

Diviner Tools is a custom library for pre-processing Diviner RDR LVL1 Channel 7 data by Chantelle Dubois.

OpenFoam-multinode

This example is taken from OpenFoam tutorials section: $FOAM_TUTORIALS/incompressible/icoFoam/cavity/cavity

  1. Go to your speed-scratch directory: cd /speed-scratch/$USER
  2. open a salloc session
  3. Load OpenFoam module: module load OpenFOAM/v2306/default
  4. Copy the cavity example to your speed-scratch space: cp -r $FOAM_TUTORIALS/incompressible/icoFoam/cavity/cavity/ .
  5. Modify cavity/system/decomposeParDict: Remove coeffs section and modify the following: numberOfSubdomains 10; method scotch;
  6. Exit the salloc session, go to the cavity directory and run the script: sbatch --mem=10Gb -pps --constraint=el9 openfoam-multinode.sh

OpenISS-yolov3

This is a case study example on image classification, for more details please visit OpenISS keras-yolo3.

Prerequisites

Images and Videos

Images and videos can be from any source, but a sample video and images are provided in video and image folders in the OpenISS-YOLOv3 Github repository.

YOLOv3 Weights

The YOLOv3 weights can be downloaded from YOLO website. However the script provided includes a command to wget the weights from the link above.

Environment Setup

To set up the virtual development environment, refer to section 2.11 of the Speed manual Creating Virtual Environments for detailed information.

Configuration and execution

  • Log into SPEED and navigate to your speed-scratch directory:

    ssh [email protected]
    cd /speed-scratch/$USER/
    

    Note: To see a live video in an interactive session, enable X11 forwarding. Linux can run X11, however, to run X server on:

    • Windows: use MobaXterm or Putty
    • MacOS: use XQuarz with its xterm

    For more information refer to How to Launch X11 applications

  • Clone the OpenISS-YOLOv3 Github repository

    git clone --depth=1 https://github.com/NAG-DevOps/openiss-yolov3.git
    cd /speed-scratch/$USER/openiss-yolov3
    

Run Non-interactive Script

The script performs the following:

  • Configures job resources and paths for Conda environments.
  • Creates, or activates the Conda environment, and installs required packages if necessary.
  • Downloads YOLOv3 weights.
  • Converts the Darknet YOLO model to Keras format.
  • Runs YOLO inference on a sample video.
  • Deactivates the Conda environment and exits.

Run Interactive Script

Note To run interactive job we need to use ssh -X

  • Request resources with salloc command

    salloc --x11=first --mem=60G -n 32 --gpus=1 -p pt
    
  • Download and run openiss-yolo-interactive.sh script from Speed-HPC Github repository. You need to add permission access to the project files.

    chmod u+x *.sh 
    ./openiss-yolo-interactive.sh
    
  • A pop up window will show a classifed live video.

The script does the following:

  • Prepare and create Conda environment based on environment.yml
  • Download YOLOv3 Weights
  • Convert the Darknet YOLO model into a Keras model using convert.py
  • Run YOLO inference on a sample video in an intaractive mode

Note: If you need to delete the created virtual environment

  conda deactivate
  conda env remove -p /speed-scratch/$USER/envs/yolo_env

For Tiny YOLOv3, it can be run in the same way, but you will need to specify model path and anchor path with --model model_file and --anchors anchor_file.

Performance comparison

Time is in minutes, run Yolo with different hardware configurations GPU types V100 and Tesla P6. Please note that there is an issue to run Yolo project on more than one GPU in case of tesla P6. The project uses keras.utils library calling multi_gpu_model() function, which cause hardware faluts and force to restart the server. GPU name for V100 is gpu32, and for P6 is gpu16, you can find that in scripts shell.

1GPU-P6 1GPU-V100 2GPU-V100 32CPU
22.45 17.15 23.33 60.42
22.15 17.54 23.08 60.18
22.18 17.18 23.13 60.47

OpenISS Person Re-Identification Baseline

The following are the steps required to run the OpenISS Person Re-Identification Baseline Project (https://github.com/NAG-DevOps/openiss-reid-tfk) on the Speed cluster. This implementatoin is based on tensorflow and keras

Prerequisites

Dataset

Using the Market1501 dataset which consist of

  • Train images: 12,936
  • Query images: 3,368
  • Gallery images: 15,913

Running for 10 epochs as an example, the results for different Speed configurations were:

  • Using GPU: 29 minute
  • Using CPUs (32 cores): 6 hours and 49 minute

Environment Setup

The environment setup instructions are located in environment.yml (https://github.com/NAG-DevOps/openiss-reid-tfk). Ensure all dependencies are correctly installed.

Configuration and execution

  • Log into Speed and navigate to your speed-scratch directory:

    ssh [email protected]
    cd /speed-scratch/$USER/
    
  • Clone the GitHub repo from https://github.com/NAG-DevOps/openiss-reid-tfk

  • Download the dataset: Navigate to the datasets/ directory, make the script executable, and run get_dataset_market1501.sh:

    chmod u+x *.sh && ./get_dataset_market1501.sh
    
  • Download openiss-reid-speed.sh execution script from this repository.

  • In reid.py set the number of epochs (g_epochs=120 by default)

  • In environment.yml comment/uncomment the TensorFlow section depending on whether you are running on CPU or GPU. GPU is enabled by default.

  • In openiss-reid-speed.sh comment/uncomment the resource allocation lines for either CPU or GPU, depending on the target node (GPU is default). Ensure that only one type (CPU or GPU) is requested.

  • Submit the job:

    For CPU nodes: sbatch ./openiss-reid-speed.sh

    For GPU nodes: sbatch -p pg ./openiss-reid-speed.sh

CUDA

When calling CUDA within job scripts, it is important to create a link to the desired CUDA libraries and set the runtime link path to the same libraries. For example, to use the cuda-11.5 libraries, specify the following in your Makefile.

-L/encs/pkg/cuda-11.5/root/lib64 -Wl,-rpath,/encs/pkg/cuda-11.5/root/lib64

In your job script, specify the version of gcc to use prior to calling cuda. For example: module load gcc/8.4 or module load gcc/9.3

Special Notes for sending CUDA jobs to the GPU Partition (pg)

Interactive jobs (easier to debug) should be submitted to the GPU Queue with salloc in order to compile and link CUDA code.

We have several versions of CUDA installed in:

/encs/pkg/cuda-11.5/root/
/encs/pkg/cuda-10.2/root/
/encs/pkg/cuda-9.2/root

For CUDA to compile properly for the GPU queue, edit your Makefile replacing /usr/local/cuda with one of the above.

Jupyter notebook example: Jupyter-Pytorch-CUDA

Example prepared to run on speed, extracted from: https://developers.redhat.com/learning/learn:openshift-data-science:configure-jupyter-notebook-use-gpus-aiml-modeling/resource/resources:how-examine-gpu-resources-pytorch

From speed-submit:

  • Download gpu-ml-model.ipynb from this github to your /speed-scratch/$USER space
  • salloc --mem=10Gb --gpus=1

From the node (interactive session):

  • module load singularity/3.10.4/default
  • srun singularity exec -B $PWD\:/speed-pwd,/speed-scratch/$USER\:/my-speed-scratch,/nettemp --env SHELL=/bin/bash --nv /speed-scratch/nag-public/jupyter-pytorch-cuda.sif /bin/bash -c '/opt/conda/bin/jupyter notebook --no-browser --notebook-dir=/speed-pwd --ip="*" --port=8888 --allow-root'
  • Follow the steps described in: https://nag-devops.github.io/speed-hpc/#jupyter-notebooks
  • When Jupyter is running on the browser, open gpu-ml-model.ipynb and run each cell

Python Modules

By default when adding a python module /tmp is used for the temporary repository of files downloaded. /tmp on speed-submit is too small for pytorch.

To add a python module:

  • First create you own tmp directory in /speed-scratch
    • mkdir /speed-scratch/$USER/tmp
  • Use the tmp direcrtory you created
    • setenv TMPDIR /speed-scratch/$USER/tmp
  • Attempt the installation of pytorch

Where $USER is an environment variable containing your GCS ENCS username