Project Checkpoint

~ New ~ : Project Final Report Available at http://eric-haibin-lin.github.io/mxnet

CURRENT STATE OF THE PROJECT

In the past 2 weeks, we have implemented a general bucket-based training scheduler for PennTreeBank model based on mxnet framework. With this training scheduler, we have achieved 3.5x speedup compared to training without bucketing, and 1.3x speedup compared to training with static bucketing, on a dataset of approximately 1 million words. Furthermore, our bucket-based approach is designed for general purpose and can be applied to other training task easily. Right now we are actively working with author of mxnet to integrate out approach into mxnet framework. We've done intensive work in analysing and improving the performance of data parallelism with bucketing on multiple GPU's in single machine. Our first version of code has been merged into mxnet master branch. In the coming weeks, we're looking into model parallelism with bucketing over multiple GPU on single machine and if time allows, test our approach's speed up with a more complicated LSTM task (English-German translation) in a distributed training on multiple machines with much larger dataset.

Project Schedule Review

Week 1: project proposal, background study, reviewing existing codebase (done)
Week 2: basic performance benchmark (done)
Week 3-4: Apply the bucket based approach to data parallelism training and benchmark performance (done)
~~Week 5-6: Write the slicing bunch based approach and get performance compare.~~
Week 4-5: Apply the bucket based approach to model parallelism and benchmark performance (new plan)
Week 5-6: Evaluate the bucket based approach in distributed machine learning task (new plan)

In a word, we achieved what we planned on the schedule.

THE PROBLEM WE'RE SOLVING

We are tackling multiple-GPU training speedup for Recurrent Neural Network(RNN) model. RNNs are networks with loops in them, allowing information to persist. The model on the left side of the equation is a typical RNN network whose output h_t is fed as part of the input to the next layer. If the input is a sequence of x_1,...,x_t, the model needs to be unrolled as suggested on the right side of the equation. The arrows in the model indicates data dependency during training.

An unrolled RNN

The length of sequence of LSTM-RNN unrolled is the length of input sequence. If we have such input sentence(word sequence):

Although preliminary findings were reported more than a year ago the latest results appear in today's new England journal of medicine a forum likely to bring new attention to the problem.

There is no asbestos in our products now.

Then the unrolled LSTM-RNN models for the 2 sentences are of different length. Models of different lengths indicate different amount of computation to be done. The variance of unrolled sequence does no harm in a single thread sequential scenario. However, in the training process, we cannot train for one sentence after another. In that case, it might be longer than your Ph.D to finish training. Thus, we need to train sentences in parallel.

Here comes the problem. Variance of sentence length means variance of training workload within a mini-batch(usually we train 32-100 data at a time as a mini-batch). However, it is too much a waste of memory and time to unroll several LSTM model for a mini-batch when the length variables significantly. In practice, machine learning framework will pad non-information words at the end of short sentences to make all the models the same length within the same batch. Clearly, this implementation will results in waste of computation.

What's more, same problem lies in almost every learning task using LSTM model. For example, in machine translation, the training data consists of a pair of sentence of different languages. It requires more insight into the task to allocate training data. Thus, We want to design a general data aligning method that can be computed efficiently. To illustrate the problem, we denote wasted computation with LSTM-cells in white in the following figure (we used mini-batch of size 2 just for illustration purpose).

As the final picture of our project, we would love to design a general bucketing policy for lstm based training. We will either generate buckets automatically or provide simple API for users to define their workload without having to go deep about parallel implementation. Right now our first version of auto bucketing has been merged into master branch of mxnet.

OVERVIEW OF WORKFLOW

Here is an overview of how we train a lstm model.

1. Generate buckets based on dataset.
2. Sample data into mini-batches using buckets.
3. for each mini-batch in mini-batches:
4.         forward data in unrolled lstm model.
5.         compute loss function for the mini-batch.
6.         use KVStore database to update parameters in the model
7.         IF there is unvisited mini-batch, return to 3. ELSE ends.

MAIN IDEA OF BUCKETING

The approach we take to solve this problem is called bucketing. The basic idea is to re-organize model construction and initialization, minimal computation cost is paid to align inputs to the same-length model for each mini-batch. For example, the model to train the previous 2 mini-batches will re-organized in the following figure.

With the bucketing idea in mind, we further illustrate two ways to train LSTM models in parallel: data parallelism and model parallelism. Bucketing will help speedup them both.

DATA PARALLELISM WITH BUCKETING

One way to parallel rnn training is using data parallelism. The primitive of parallel is data. In our case, the primitive we divide workload on is sentence. The figure below shows how training a small batch of 2 sentences is assigned to 2 GPUs. Each GPU will be responsible for forwarding the embedded vector. Then they will combine the loss and backpropagate gradiants.

In order to get better performance, it is ideal that GPUs will not idle for a long time waiting for others to complete their works. Thus, aligned data with similar length can help speeding up the training. Our approach will select batch_size number of sentences with same or similar length to form them as a batch. So that we can get the best performance for training.

MODEL PARALLELISM WITH BUCKETING

Model parallelism has been under heated discussion in applied machine learning recently. Originally, it is designed for super large convolutional layer in GoogleNet. We borrow the idea to place each layer in one GPU. The primitive for model parallel is the layers in neural network model. The benefit it brings is that GPU does not have to maintain all the layers in memory, which relieves the memory limitation in large scale tasks(for example, machine translation).

In the figure above, we assign different lstm model to different GPUs. After GPU1 finish computing layer 1 with first sentence. The output will be given to GPU 0. At the same time, GPU1 will fetch the next sentence and start training. This is significantly different from data parallelism that there's no contention to update the shared model at the end of each iteration, and most of the communication happens during pipelining intermediate results between GPU's.

As we all know, the pipeline-like design will have bottleneck if one of the unit is clearly slower than others. Here, good bucketing can make sure that sentences from a mini-batch is of same length. It helps us getting good performance.

PRELIMINARY RESULTS

We used a multi-GPU-socket machine with 4 GeForce GTX 750 Ti GPU's installed. Each GPU has 2GB memory capacity.

We trained a LSTM-RNN language model with 200 hidden units, 2 LSTM layers, 200 embedding units with 25 epochs on Penn Tree Bank dataset(approx. 1 million words), the figures below is the training time w.r.t different number of GPU's. The left one is experimented with batch size 32, and the right one with batch size 128.

We notice that the trends are different in the two figures. When the batch size is set to 32, arithmetic intensity of training per batch is lower than when batch size is set to 128 since there's less data to process per batch. The fact that training takes long with larger number of GPU's is caused by the cost of communication between GPU's. Each of the GPU's has to access and update the same shared model parameters(parameters of size over 100MB) after each iteration, and more GPU's lead to more contention on the shared resource. Also, there's fixed cost when setting up training model for each GPU (e.g. data & parameter movement). These factors explain why the training speed is not as good when batch size is small. When the batch size is increased to 128, however, training time drops as the number of GPU's increases because the computation intensity is higher.

On the other hand, we measured the training time with different strategies: (1) no bucketing, (2) static default bucketing, (3) auto generated bucketing based on dataset distribution. The figure above shows that method (2) and (3) outperform the baseline (1). Method (2) creates a default bucket setting of range [0-10, 10-20, 20-30, 30-40, 40-50, 50-60+], which is reasonable to the PTB dataset. The result shows that even a default static bucketing setting benefits the training speed to around 3x since lots of padding and wasted computation are avoided. Method (3) generates more fine-grained bucketing according to dataset and achieved even more speed up compared to method (2). The cost of generating the bucketing itself is negligible compared to the whole training cost.

The figure above shows the speedup of no-bucketing, default-bucketing and auto bucketing in 1 to 3 GPU's. Our methods gets the best speedup results.

On the other hand, we also analysed the impact of batch size vs. converge rate of the model. We trained the same LSTM model with default bucketing on 1 GPU, and the loss function per epoch is shown as below. What we discovered is that a smaller batch size leads to faster convergence per epoch. This reminds us that despite the fact that larger batch size increases arithmetic intensity and faster parallel training per epoch, the convergence of the model isn't necessarily faster. We need to keep in mind that training of small batch size requires fewer number of iterations for the model to converge.

APPENDIX

Raw data for bucketing experiment, batch size 32

Bucket	Batch size	1 GPU (mins)	2 GPU (mins)	3 GPU (mins)	4 GPU (mins)
no bucket	32	124	132	152	Hang
10, 20, 30, 40, 50, 60	32	45	54	62	Hang
auto gen	32	34	45	51	62

Raw data for bucketing experiment, batch size 128

Bucket	Batch size	1 GPU (mins)	2 GPU (mins)	3 GPU (mins)	4 GPU (mins)
no bucket	128	91	58	63	68
10, 20, 30, 40, 50, 60	128	27	21	19	26
auto gen	128	20	16	18	20

Memory consumption with data parallelism training

Batch size	Memory (MB)
32	367
64	515
128	764

WHAT WE CAN PRESENT ON MAY 9TH

We can present the training time results of our new extended mxnet on PTB training task. We can present

Raw data table for training time on different GPU.
Graph of training time comparison between our approach and old version of mxnet and other learning framework.
Graph of study of our method's influence on converge rate.

Hopefully we can report results/progress on a English-French machine translation task. We are working on this with authors of mxnet.

==========================

Created by Haibin Lin @eric-haibin-lin and Yiming Wu @harowu, Carnegie Mellon University

Provide feedback

Saved searches

Use saved searches to filter your results more quickly