A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset
- 39.98 Perplexity after 5 training epochs using LSTM Language Model with Adam Optimizer
- Trained in ~26 hours using 1 Nvidia V100 GPU (~5.1 hours per epoch) with 2048 batch size (~10.7 GB GPU memory)
- 46.47 Perplexity after 5 training epochs on a 1-layer, 2048-unit, 256-projection LSTM Language Model [3]
- Trained for 3 days using 1 Nvidia P100 GPU (~12.5 hours per epoch)
- Implemented Sampled Softmax and Log-Uniform Sampler functions
Type | LM Memory Size | GPU |
---|---|---|
w/o tied weights | ~9 GB | Nvidia 1080 TI, Nvidia Titan X |
w/ tied weights [6] | ~7 GB | Nvidia 1070 or higher |
- There is an option to tie the word embedding and softmax weight matrices together to save GPU memory.
Parameter | Value |
---|---|
# Epochs | 5 |
Training Batch Size | 128 |
Evaluation Batch Size | 1 |
BPTT | 20 |
Embedding Size | 256 |
Hidden Size | 2048 |
Projection Size | 256 |
Tied Embedding + Softmax | False |
# Layers | 1 |
Optimizer | AdaGrad |
Learning Rate | 0.10 |
Gradient Clipping | 1.00 |
Dropout | 0.01 |
Weight-Decay (L2 Penalty) | 1e-6 |
- Download Google Billion Word Dataset for Torch - Link
- Run "process_gbw.py" on the "train_data.th7" file to create the "train_data.sid" file
- Install Cython framework and build Log_Uniform Sampler
- Convert Torch data tensors to PyTorch tensor format (Requires Pytorch v0.4.1)
I leverage the GBW data preprocessed for the Torch framework. (See Torch GBW) Each data tensor contains all the words in data partition. The "train_data.sid" file marks the start and end positions for each independent sentence. The preprocessing step and "train_data.sid" file speeds up loading the massive training data.
- Data Tensors - (test_data, valid_data, train_data, train_small, train_tiny) - (#words x 2) matrix - (sentence id, word id)
- Sentence ID Tensor - (#sentences x 2) matrix - (start position, sentence length)
- Download 1-Billion Word Dataset - Link
The Torch Data Format loads the entire dataset at once, so it requires at least 32 GB of memory. The original format partitions the dataset into smaller chunks, but it runs slower.