Project Report

Summary

In this project, we mainly focus on training LSTM faster using GPUs. Firstly, we implemented Bucketing to reduce unnecessary workload. Then we reasoned about pros and cons of Data Parallelism(DP) and Model Parallelism(MP). We found that model parallelism scales better in multi-GPU training in our experiments.

In this implementation, we achieved 5.2x speedup of LSTM training on 4 GPUs compare to 1 GPU perforamce by model parallelism and bucketing techniques on the Penn Treebank dataset of approximately 1 million words.

Use Bucketing to Speed Up LSTM Training

What is LSTM

Long term short memory(LSTM) is a widely used machine learning model in the real world, and proves quite successful on applications such as automatic speech recognition and machine translation^[1][2].

The figure on the left is a typical LSTM network used to predict the next word in sentence. x is the input, A and B are 2 LSTM cells which evaluates input based model parameters, h is the output(prediction of the next word). LSTM is effective in prediction because its recurrent nature, where the output is used as the input of next round. Its equivalent structure is shown as the unrolled one on the right hand side. For example, predicting "Fun" as the next word of "Parallel programming is" shown as below, and the arrows indicate the data dependency in the network.

As you can see, LSTM evaluation is inheritantly hard due to its complex data dependency. LSTM training, which has more data dependency in reverse order at its back propagation phase, is even harder to parallelize.

Padding data leads to unnecessary workload

LSTM models are trained based on mini-batches on GPU's by stochastic gradient descent algorithm, where a batch of input sequences (of potentially different lengths) are trained together. The model parameter has to be updated with the gradient before next mini-batch. Since it's not practical to unroll a model for each of the sequence in the mini-batch, a common practice is to pad the input with NULL values so that they can be trained with the same unrolled model. For example, gray cells in the figure below denotes the padding for the short sequences in batch 1 and batch 2.

Obviously, mini-batches with various of NULL-padded input waste lots of computation and data movement during training.

Bucketing is a good idea by gathering sequences of similar length

To reduce the amount of overhead caused by variable length input sequences in mini-batches, we applied a technique called bucketing. The main idea is to collect sequences of similar lengths and re-arrange them to the same training batch so as to avoid NULL-padding as much as possible, which reduces unnecessary workload. This technique was introduced recently by Tensorflow^[3].

If we apply bucketing to the previous example of NULL-padded batches, we get a batch with a bucket of size 8 and another of size 4. As we can see, the amount of padding is significantly reduced, which in turn reduce the time to move data and compute gradient for each mini-batch.

Auto-generating buckets from data gets the best speed up on LSTM training

In our implementation, we generate the buckets using a greedy algorithm based on input automatically.

The main idea is to sort the inputs according to their lengths, and greedily assign them to a bucket to so that the number of items in the bucket can be divided by batch-szie. In this way, paddings within the same bucket is small. The figure below shows the original input sequences on the left hand, and bucket assignment on the right hand side.

The computation overhead to generate such buckets is negligible compared to the training time. In the following figure, we measure the speedup of LSTM training with user configured buckets and with auto generated buckets, compared to training without bucketing. The experiment was done with the PennTreeBank dataset on a GTX 750 GPU. As you can see, training with auto generated buckets gains most speedup (3.35x).

Using Data Parallelism or Model Parallelism to train LSTM on multi GPU

Training LSTM on multi GPU is beneficial while challenging

Exploring more parallelism from multi-GPU is always alluring to people who want to train Neural Network fast. We want to know how to scale up training speed when we have more GPU to play with. However, because of GPU-GPU and GPU-CPU communication cost, training with multi GPU's is challenging. In general, there are 2 ways: Data Parallelism where we split data over GPUs to divide workload; Model Parallelism where we assign layers to different GPUs within the model.

DP 101: Data is the primitive

In Data Parallelism, we split data over GPUs to divide workload. In our case, the primitive we divide workload on is sentence. The figure below shows how training a small batch of 2 sentences is assigned to 2 GPUs. Each GPU will be responsible for forwarding the embedded vector. Then they will combine the loss and backpropagate gradiants.

Data Parallelism is easier to implement. Programmer can just run same instances on each GPUs to deal with different data. One other benefit of Data Parallelism is it has good workload balance.

DP does not scale well on multi-GPU

However, Data Parallelim does not scale well on multi-GPU. The 2 figures below shows the speed up of Data Parallelism on multi-GPU under 32 and 290 batch size respectively. We can see from the figure that when batch size is 32, more GPU can only make performance worse. When the batch size is 290, where computation-intensity is higher, adding more GPU still does little help.

The overhead mainly comes from communication cost when multi-gpu updating shared gradients concurrently. So far, not much great idea has been proposed if we want to maintain real-time consistency of shared parameters. In mxnet, a parameter database KVStore is responsible for updating gradients. The following table shows that communication with centralized database takes up more time when GPU number goes up from 1 to 4 GPUs. The batch size is 32 for the experiment.

MP 101: Using multi GPU as a pipeline

Model parallelism has been under heated discussion in applied machine learning recently. Originally, it is designed for super large convolutional layer in GoogleNet. We borrow the idea to place each layer in one GPU. The primitive for model parallel is the layers in neural network model. The benefit it brings is that GPU does not have to maintain all the layers in memory, which relieves the memory limitation in large scale tasks(for example, machine translation).

In the figure above, we assign different lstm model to different GPUs. After GPU1 finish computing layer 1 with first sentence. The output will be given to GPU 2. At the same time, GPU 1 will fetch the next sentence and start training. This is significantly different from data parallelism that there's no contention to update the shared model at the end of each iteration, and most of the communication happens during pipelining intermediate results between GPU's.

Experiments results prove MP scales better

We set up experiments to see how Model Parallelism scales. We trained a 8 layer, 400 hidden units lstm model with encoder and decoder on Penn TreeBank task. We tried the bucketing version and non-bucketing version. The experiment results, displayed in the figure below, show that Model Parallelism can achieve 2x speedup without bucketing and around 5.2x speed up on 4 GPUs.

MP has lower communication cost

We believe Model Parallelism scales better because it has lower communication cost. One reason is that there is little contention in MP's data transmition. In DP, contention is raised by multi-GPUs trying to update the same address space. In MP, the data, usually below 1M size, transmitted between sender and receiver will not drain the memory bandwidth.

Implementing MP requires good knowledge of workload

Implementing Model Parallelism requires good knowledge of training task to partition the network throughout the GPUs. Although it requires detailed analysis that is beyond the scope of a course project, we found that we can lay down some general principles.

Place neighbor layers in the same GPU to avoid data transmition.
Balancing the workload between GPUs to avoid bottleneck in a pipeline situation.
Remember, different kind of layers have different computation-memory properties.

Let us have a quick look into the 2 pipeline above. They both have 8 layers with a decoder and an encoder layer. Clearly, based on our first principle, it is unwise to place all neighbor layers in separate GPUs. One other thing is we want to balance the workload accross GPUs. Here LSTM layers, although having less memory comsumptions than decoder/encoder layers, will take up more computation time because dependency of unrolled LSTM. Thus, the partition on the left will be better in speed than the right because of a more even workload in Model Parallelism.

Some prelimnary experiment results also prove our thoughts. Here the number denotes the layer_id that is in a specific GPU. We can see the first partition, which is the left one in the graph above, achieves the best speed.

GPU 1	GPU 2	GPU 3	GPU 4	Time
01	23	45	67	243
04	15	26	37	308
0	123	456	7	276
01	23	456	7	275
012	345	67		269
036	147	25		345

Final Conclusion: MP is better than DP in training LSTM on multi GPU

In conclusion, Model Parallelism is better for training LSTM model on multi-GPU situation. Model Parallelism requires less communication cost than Data Parallelism which helps it scales better on multiple GPUs. On the other hand, Model Parallelism requires more expertise in both Machine Learning and System in order to have good workload balance. In experiments, shown in the chart below, Model Parallelism converges faster than Data Parallelism.

	Data Parallelism	Model Parallelism
Communication Cost	High	Low
Workload Balance	Balance	Requires insight into both ML and Sys
Converge Speed	Slower	Faster

Appedix

Speedup of Model Parallel and Data Parallel of the same model: 8 layer LSTM screen shot 2016-05-06 at 8 24 25 pm

Convergence of Model Parallel and Data Parallel

* DP: 376 sec / epoch * MP: 257 sec / epoch

Work Division

Equal amount of work is done by us.

Reference

[1] Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

[2] Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural computation, 12(10), 2451-2471.

[3] Sequence-to-Sequence Models. (2016, May 7). Retrieved from https://www.tensorflow.org/versions/r0.8/tutorials/seq2seq/index.html

Created by Haibin Lin @eric-haibin-lin and Yiming Wu @harowu, Carnegie Mellon University

Provide feedback

Saved searches

Use saved searches to filter your results more quickly