-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(train): support deepspeed #1849
Conversation
I've discovered that torch.autocast integrates seamlessly with DeepSpeed, thereby saving us a significant amount of manual labor when it comes to casting tensor types, especially when fp16/bf16 is enabled.: with torch.cuda.amp.autocast(cache_enabled=False):
loss = model_wrapped_by_deepspeed_initilize(inputs) |
The script I used to test 1.8B model: #!/bin/bash
# Copyright [2023-05-10] <[email protected], Xingchen Song>
size="1.8B"
stage=stage2
dir=u2pp_conformer_deepspeed_shard_nccl_${size}_${stage}
rm -rf tensorboard/$dir
rm -rf exp/$dir
if [ -d "/usr/local/cuda" ]; then
export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
#export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}:/usr/local/lib/openmpi/:/usr/local/nccl_2.4.7-1+cuda10.0_x86_64/lib
export CUDA_HOME=/usr/local/cuda
export CFLAGS="-I$CUDA_HOME/include $CFLAGS"
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
#export LIBRARY_PATH=/usr/local/nccl_2.4.7-1+cuda10.0_x86_64/lib/:$LIBRARY_PATH
export CUDA_PATH=$CUDA_HOME
fi
bash run.sh \
--deepspeed true \
--train_set "train" \
--data_type "shard" \
--stage 4 --stop_stage 4 \
--deepspeed_save_states "model+optimizer" \
--deepspeed_config conf/ds_$stage.json \
--train_config conf/train_u2++_conformer_1.8B.yaml \
--dir exp/$dir |
I think this PR is ready for final review, we can merge this so that others can start experimenting and then we can fix whatever needs to be fixed. cc @robin1001 |
LGTM. |
done in #2055 |
Brief
This PR integrates Deepspeed into wenet, which enables:
Initial result (part-1):
I believe that we can get the same result within the same training configurations (The current minor differences may be due to variations in batch size and the number of GPUs.). This initial result indicates that the integration of deepspeed is correct.
Initial result (part-2):
We can clearly see that under the same training configurations, deepspeed+bfloat16 is better than torch.ddp+float16.
From tensorboard logs, we can conclude that torch.ddp and deepspeed have the same trend on train_loss/cv_loss/lr.
Benchmark: Training speed on small model
info: 4 * RTX 3090 (24G), fp32 training, 8 dataloader workers and 500 prefetch, 32 batch size, nccl
Benchmark: Ability to train 1.8B model with efficient batchsize
info: 4 * RTX 3090 (24G), batchsize16 per device, bf16 training, 8 dataloader workers and 500 prefetch, nccl
about 50min per epoch
TODO
Limitations