JABAS (Joint Adaptive Batching and Automatic Scaling) is a novel DNN training system for a heterogeneous GPU cluster. Major components of JABAS are a DNN training framework called IIDP, which provides the same theoretical convergence rate of distributed SGD in a heterogeneous GPU cluster, a fine-grained adaptive batching technique with dynamic configuration, and a coarse-grained automatic resource scaling technique that leverages the prediction of global batch size changes for an epoch to auto-scale GPU resources optimally.
For more details, please refer to EuroSys '25 paper entitled JABAS: Joint Adaptive Batching and Automatic Scaling for DNN Training on Heterogeneous GPUs (link will be uploaded).
- Ubuntu >= 16.04
- Anaconda3 4.13.0
- Python 3.8
- NVIDIA driver >= 450.80.02
- CUDA 11.1
- cuDNN 8.2.1
- Remote storage (e.g, NFS, AWS S3)
Install CUDA and CuDNN
- CUDA download toolkit [link]. Make sure that
/usr/local/cuda
is linked to/usr/local/cuda-11.1
. - CuDNN download toolikt [link].
Install Anaconda (Optional) - If Anaconda has already been installed, skip this step.
wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh
bash Anaconda3-2020.02-Linux-x86_64.sh
Prepare conda environment
CONDA_ENV=jabas
conda create -n $CONDA_ENV python=3.8 -y
conda activate $CONDA_ENV
conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses -y
conda install -c pytorch magma-cuda111 -y # For CUDA 11.1
Install IIDP by building source code
BASE=$HOME # Set the custom base path
JABAS_HOME=$BASE/JABAS
IIDP_HOME=$JABAS_HOME/IIDP
PYTORCH_HOME=$BASE/pytorch
VISION_HOME=$BASE/vision
cd $BASE
git clone --recursive https://github.com/unist-ssl/JABAS
git clone --recursive -b v1.8.1 https://github.com/pytorch/pytorch.git
cd $PYTORCH_HOME
patch -p1 < $IIDP_HOME/pytorch.patch
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py install
cd $BASE
git clone -b v0.9.1 https://github.com/pytorch/vision.git
cd $VISION_HOME
pip install pillow==10.4.0
python setup.py install
cd $IIDP_HOME
pip install -r requirements.txt
python setup.py install
Install JABAS
cd $JABAS_HOME
pip install -r requirements.txt
python setup.py install
We provide a guideline to setup NFS (Network File System), assuming mount point is /mnt
.
Install NFS packages
sudo apt-get install nfs-kernel-server nfs-common -y
Make NFS mount directory
USER=`whoami`
sudo mkdir /mnt/nfs
sudo chown -R $USER:$USER /mnt/nfs
sudo chmod -R 777 /mnt/nfs
/etc/exports file setup
In /etc/exports file, write the below configuration.
$ sudo vi /etc/exports
----------------------------------------------------
/mnt/nfs *(rw,no_root_squash,no_all_squash,async)
----------------------------------------------------
On a main node (NFS server)
:
Start NFS server
sudo systemctl restart nfs-server
Check if NFS server status is active
sudo systemctl status nfs-server
On NFS client nodes
:
Set user's NFS server ip
$ NFS_SERVER_IP=<User NFS server ip>
sudo mount -t nfs $NFS_SERVER_IP:/mnt/nfs /mnt/nfs
Check if NFS directory is mounted
df -h
IIDP/iidp/
: Implementation of IIDP.jabas/train/trainer.py
: Core runtime of JABAS to cooperate with adaptive batching and automatic scaling.jabas/profiler/
andIIDP/iidp/profiler/
: Profiler.jabas/config/
andIIDP/iidp/config/
: Configuration Solver.jabas/elastic/
: Main code for elastic training. gRPC communication stack (jabas/elastic/runtime/) is mainly borrowed from stanford-futuredata/gavel.examples/
: Example (benchmark) codes for JABAS.
Refer to README.md in examples/resnet50/quickstart/
directory.
Refer to README.md in examples/
directory.