Skip to content
This repository has been archived by the owner on Oct 18, 2024. It is now read-only.
/ hetseq Public archive

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

License

Notifications You must be signed in to change notification settings

yifding/hetseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

This is our coding implementation for the paper:

Yifan Ding, Nicholas Botzer, Tim Weninger. HetSeq: Distributed GPU Training on Heterogeneous Infrastructure, Proc. of Association for the Advancement of Artificial Intelligence (AAAI) Innovative Application of Artificial Intelligence, February 2021.

Author: Yifan Ding ([email protected])

arxiv paper available: https://arxiv.org/abs/2009.14783

Documentation available: https://hetseq.readthedocs.io

Medium towards data science Post: Training BERT at a University

Documentation includes Distributed Setting, Scripts to Run HetSeq, Extending HetSeq, Parameter Explanation and Code Reference.

Overview

HetSeq is a distributed neural network platiform designed to run on Heterogeneous Infrastructure with common scientific shared file system. It can be run directly on command line with SSH or task queen submission system without privilege or any extra packages. It takes care of the data index randomization and assignment to different GPUs in the multi-node and multi-GPU setting. Users can easily extend HetSeq to many other models with minimum effort.

HetSeq requires installation of PyTorch with GPU support and NCCL.

Installation

  1. create and activate conda virtual environment with Python 3.7.4 (recommended)
$ conda create --name hetseq
$ conda activate hetseq
$ conda install python=3.7.4
  1. Git clone directory and install nessasory package
$ git clone https://github.com/yifding/hetseq.git
$ cd /path/to/hetseq
$ pip install -r requirements.txt 
$ pip install --editable . 
  1. To Run BERT: Download data files including training corpus, model configuration, and BPE dictionary. Test corpus from here, full data from this link. Download test_DATA.zip for test or DATA.zip for full run, unzip it and place the preprocessing/ directory inside the package directory. Available corpus under preprocessing/,
  • phase one of BERT training corpus : preprocessing/hdf5_lower_case_1_seq_len_128.../wikicorpus_en/
  • phase two of BERT training corpus : preprocessing/hdf5_lower_case_1_seq_len_512.../wikicorpus_en/
  • sample test for phase one : preprocessing/test128/
  • sample test for phase two : preprocessing/test512/
  • see NVIDIA-pytorch-BERT, google_original_BERT and BERT paper for more information.
  • current provided is generated from NVIDIA-pytorch-BERT with wikipedia data (book data is not available)
  1. Running HetSeq script is available at https://hetseq.readthedocs.io/en/master/examples.html,

Distributed Configuration

HetSeq can be executed on single GPU on a single node, multiple GPUs on a single node, or multiple GPUs across multiple nodes. Main logic is defined at train.py.

  • --distributed-init-method: defines an initialization. e.g.: "tcp://10.32.82.207:11111" (tcp for multiple nodes) or "file:///hetseq/communicate.txt" (shared file for multiple nodes).
  • --distributed-world-size: total number of GPUs used in the training.
  • --distributed-gpus: the number of GPUs on the current node.
  • --distributed-rank: represents the rank/index of the first GPU used on current node.

Performance table

Running BERT on nodes with 4 GPUs each.

nodes GPUs epochs batch size steps avg. time per step training time training loss expansion speedup
1 4 5 128 267,139 2.60s 7.19d 0.026 1 1
2 8 5 256 133,570 2.69s 4.19d 0.028 0.86 1.72
4 16 5 512 66,785 2.794 2.23d 0.031 0.81 3.22
8 32 5 1024 33,393 3.126 1.21d 0.055 0.74 5.94

Notice and tips

loading BERT data takes a while.

Known issues

  • currently not supporting continue training
  • mnist datasets download does not support multiple GPUs

future patch

  • bert processing pipeline not included
  • interface of datasets/transformers not included
  • hetseq not supporting download from pip
  • evaluation separate/combined not included
  • fp16 support

License

this repository is MIT-licensed. It is created based on fairseq, NVIDIA-BERT, and pytorch

Please send us e-mail or leave comments on github if have any questions.

Copyright (c) 2020 Yifan Ding and Weninger Lab