This is our coding implementation for the paper:
Yifan Ding, Nicholas Botzer, Tim Weninger. HetSeq: Distributed GPU Training on Heterogeneous Infrastructure, Proc. of Association for the Advancement of Artificial Intelligence (AAAI) Innovative Application of Artificial Intelligence, February 2021.
Author: Yifan Ding ([email protected])
arxiv paper available: https://arxiv.org/abs/2009.14783
Documentation available: https://hetseq.readthedocs.io
Medium towards data science Post: Training BERT at a University
Documentation includes Distributed Setting, Scripts to Run HetSeq, Extending HetSeq, Parameter Explanation and Code Reference.
HetSeq is a distributed neural network platiform designed to run on Heterogeneous Infrastructure with common scientific shared file system. It can be run directly on command line with SSH or task queen submission system without privilege or any extra packages. It takes care of the data index randomization and assignment to different GPUs in the multi-node and multi-GPU setting. Users can easily extend HetSeq to many other models with minimum effort.
HetSeq requires installation of PyTorch with GPU support and NCCL.
- create and activate conda virtual environment with Python 3.7.4 (recommended)
$ conda create --name hetseq
$ conda activate hetseq
$ conda install python=3.7.4
- Git clone directory and install nessasory package
$ git clone https://github.com/yifding/hetseq.git
$ cd /path/to/hetseq
$ pip install -r requirements.txt
$ pip install --editable .
- To Run BERT: Download data files including training corpus, model configuration, and BPE dictionary. Test corpus from here, full data from this link. Download test_DATA.zip for test or DATA.zip for full run, unzip it and place the
preprocessing/
directory inside the package directory. Available corpus underpreprocessing/
,
- phase one of BERT training corpus :
preprocessing/hdf5_lower_case_1_seq_len_128.../wikicorpus_en/
- phase two of BERT training corpus :
preprocessing/hdf5_lower_case_1_seq_len_512.../wikicorpus_en/
- sample test for phase one :
preprocessing/test128/
- sample test for phase two :
preprocessing/test512/
- see NVIDIA-pytorch-BERT, google_original_BERT and BERT paper for more information.
- current provided is generated from NVIDIA-pytorch-BERT with wikipedia data (book data is not available)
- Running HetSeq script is available at https://hetseq.readthedocs.io/en/master/examples.html,
HetSeq can be executed on single GPU on a single node, multiple GPUs on a single node, or multiple GPUs across multiple nodes. Main logic is defined at train.py.
- --distributed-init-method: defines an initialization. e.g.: "tcp://10.32.82.207:11111" (tcp for multiple nodes) or "file:///hetseq/communicate.txt" (shared file for multiple nodes).
- --distributed-world-size: total number of GPUs used in the training.
- --distributed-gpus: the number of GPUs on the current node.
- --distributed-rank: represents the rank/index of the first GPU used on current node.
Running BERT on nodes with 4 GPUs each.
nodes | GPUs | epochs | batch size | steps | avg. time per step | training time | training loss | expansion | speedup |
---|---|---|---|---|---|---|---|---|---|
1 | 4 | 5 | 128 | 267,139 | 2.60s | 7.19d | 0.026 | 1 | 1 |
2 | 8 | 5 | 256 | 133,570 | 2.69s | 4.19d | 0.028 | 0.86 | 1.72 |
4 | 16 | 5 | 512 | 66,785 | 2.794 | 2.23d | 0.031 | 0.81 | 3.22 |
8 | 32 | 5 | 1024 | 33,393 | 3.126 | 1.21d | 0.055 | 0.74 | 5.94 |
loading BERT data takes a while.
- currently not supporting continue training
- mnist datasets download does not support multiple GPUs
- bert processing pipeline not included
- interface of datasets/transformers not included
- hetseq not supporting download from pip
- evaluation separate/combined not included
- fp16 support
this repository is MIT-licensed. It is created based on fairseq, NVIDIA-BERT, and pytorch
Please send us e-mail or leave comments on github if have any questions.
Copyright (c) 2020 Yifan Ding and Weninger Lab