- Follow steps in README.md
- Launch script in 2.2 Run this recipe for DeBERTa If running on AzureML,
cd huggingface/script
python hf-ort.py --gpu_cluster_name <gpu_cluster_name> --hf_model deberta-v2-xxlarge --run_config ort
If running locally,
cd huggingface/script
python hf-ort.py --hf_model deberta-v2-xxlarge --run_config ort --process_count <process_count> --local_run
Run configuration | PyTorch | ORTModule | Gain |
---|---|---|---|
fp16 | 47.22 | 59.19 | 25.3% |
fp16 with deepspeed stage 1 | 48.55 | 63.42 | 30.6% |
These numbers are average of samples/sec from 10 runs on ND40rs_v2
VMs (V100 32G x 8), Cuda 11, with stable release onnxruntime_training-1.8.0%2Bcu111-cp36-cp36m-manylinux2014_x86_64.whl
with batch size of 4. Cuda 10.2 option is also available through --use_cu102
flag. Please check dependency details in Dockerfile. We look at the metrics stable_train_samples_per_second
in the log, which discards first step that includes setup time. Also please note since ORTModule takes some time to do initial setup, smaller --max_steps
value may lead to longer total run time for ORTModule compared to PyTorch. However, if you want to see finetuning to finish faster, adjust --max_steps
to a smaller value. Lastly, we do not recommend running this recipe on [NC
] series VMs which uses old architecture (K80).
We're investigating how to run larger batch size than 4 for DeBERETa now.