Skip to content

Latest commit

 

History

History
24 lines (22 loc) · 1.81 KB

File metadata and controls

24 lines (22 loc) · 1.81 KB

Run Instruction

  1. Follow steps in README.md
  2. Launch script in 2.2 Run this recipe for DeBERTa If running on AzureML,
cd huggingface/script
python hf-ort.py --gpu_cluster_name <gpu_cluster_name> --hf_model deberta-v2-xxlarge --run_config ort

If running locally,

cd huggingface/script
python hf-ort.py --hf_model deberta-v2-xxlarge --run_config ort --process_count <process_count> --local_run

Performance Comparison

Run configuration PyTorch ORTModule Gain
fp16 47.22 59.19 25.3%
fp16 with deepspeed stage 1 48.55 63.42 30.6%

These numbers are average of samples/sec from 10 runs on ND40rs_v2 VMs (V100 32G x 8), Cuda 11, with stable release onnxruntime_training-1.8.0%2Bcu111-cp36-cp36m-manylinux2014_x86_64.whl with batch size of 4. Cuda 10.2 option is also available through --use_cu102 flag. Please check dependency details in Dockerfile. We look at the metrics stable_train_samples_per_second in the log, which discards first step that includes setup time. Also please note since ORTModule takes some time to do initial setup, smaller --max_steps value may lead to longer total run time for ORTModule compared to PyTorch. However, if you want to see finetuning to finish faster, adjust --max_steps to a smaller value. Lastly, we do not recommend running this recipe on [NC] series VMs which uses old architecture (K80). We're investigating how to run larger batch size than 4 for DeBERETa now.

Convergence

Loss