-
Notifications
You must be signed in to change notification settings - Fork 640
Deepspeed ZeRO Infinity
warning: This stuff is experimental. If you have issues let us know in the Issues section. so we can help you fix it or figure it out.
Also - many of these options will not work well or at all on anything other than deepspeed "stage 3". DeepSpeed is sort of a tough install - and stage 3 is often unsupported on GPUs other than the V100 and A100. There are cards which are similar enough in architecture - the RTX2000 and RTX3000 series of cards, that could work, but currently have a tough time with it.
Dependencies:
- llvm-9-dev
- cmake
- gcc
- python3.8.x
- deepspeed
- libaio-dev
- cudatoolkit=10.2 or 11.1 # Doesn't work on 11.2 unfortunately.
- pytorch=1.8.*
apt install -y libaio-dev gcc cmake llvm-9-dev
python -V # Check your version
# For CUDA 11.1 - change if you have a different version. CUDA 11.2 not supported.
pip3 install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
pip3 install deepspeed
pip3 install dalle-pytorch
At the time of this writing, 20.04 still uses system76-cuda-10.2
and system76-cuddn-10.2
in their "latest" release.
On 20.10 system76-cuda-latest
will give you cuda-toolkit-11.2
. As such if you're on Pop!_OS version 20.10 (not 20.04),
then you should be sure to install system76-cuda-11.1
and system76-cudnn-11.1
instead.
sudo apt install system76-cuda-latest
sudo apt install system76-cudnn-latest
sudo update-alternatives --config cuda
# Choose the most recent version of cuda-toolkit-you see here.
# After you're done - to switch back to your original cuda-toolkit version, just run:
sudo update-alternatives --config cuda
In your train_dalle.py
there is a dictionary "deepspeed_config" which you need to change.
There are far more parameters to tinker with. You can find those at the DeepSpeed ZeRO json config documentation
deepspeed_config = {
"zero_optimization": {
"stage": 3,
# Offload the model parameters If you have an nvme drive - you should use the nvme option.
# Otherwise, use 'cpu' and remove the `nvme_path` line
"offload_param": {
"device": "nvme",
"nvme_path": "/path/to/nvme/folder",
},
# Offload the optimizer of choice. If you have an nvme drive - you should use the nvme option.
# Otherwise, use 'cpu' and remove the `nvme_path` line
"offload_optimizer": {
"device": "nvme", # options are 'none', 'cpu', 'nvme'
"nvme_path": "/path/to/nvme/folder",
},
},
# Override pytorch's Adam optim with `FusedAdam` (just called Adam here). Can
"optimizer": {
"type": "Adam", # You can also use AdamW here
"params": {
"lr": LEARNING_RATE,
},
},
'train_batch_size': BATCH_SIZE,
'gradient_clipping': GRAD_CLIP_NORM,
'fp16': {
'enabled': args.fp16,
},
}