Skip to content

Commit

Permalink
fixing links
Browse files Browse the repository at this point in the history
  • Loading branch information
soumith committed Jul 25, 2016
1 parent 782fd63 commit 6553bd5
Showing 1 changed file with 6 additions and 6 deletions.
12 changes: 6 additions & 6 deletions blog/_posts/2016-07-25-nce.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ but they are not the only kind of model that can be used model language.
There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which
have special gated cells that facilitate the backpropagation of gradients through longer sequences.

![lstm](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LSTM.png)
<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LSTM.png"></p>

The exact implementation is as follows:

Expand Down Expand Up @@ -366,7 +366,7 @@ For a `FloatTensor` or `CudaTensor`, that single tensor will take up 20GB of mem
The number can be double for `gradInput` (i.e. gradients with respect to input),
and double again as both `Linear` and `SoftMax` store a copy for the `output`.

![Scale of output layer buffers with Linear](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-Linear.png)
<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-Linear.png"></p>

Excluding parameters and their gradients, the above figure outlines the approximate memory consumption of a 4-layer LSTM with 2048 units with a `seqlen=50`.
Even if somehow you can find a way to put 80GB on a GPU (or distribute it over many), you still run into the problem of
Expand Down Expand Up @@ -400,7 +400,7 @@ nn.Sequential():add(nn.Linear(inputsize, #trainset.ivocab)):add(nn.LogSoftMax())
For evaluating perplexity, the model still implements `Linear` + `SoftMax`.
NCE is useful for reducing the memory consumption during training (compare to the figure above):

![Scale of output layer buffers with NCE](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-NCE.png)
<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-NCE.png"></p>

Along with the [NCECriterion](https://github.com/Element-Research/dpnn#nn.NCECriterion),
the `NCEModule` implements the algorithm is described in [[1]](#nce.ref).
Expand Down Expand Up @@ -541,7 +541,7 @@ As can be observed in the previous section, training a 2-layer LSTM with only 25
generated samples. The model needs much more capacity than what can fit on a 12GB GPU.
For parameters and their gradients, a 4x2048 LSTM model requires the following:

![LM parameter memory consumption](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-params.png)
<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-params.png"></p>

This doesn't include all the intermediate buffers required for the different modules (outlined in [NCE section](#nce.nce)).
The solution was of course to distribution the model over more GPUs.
Expand Down Expand Up @@ -687,14 +687,14 @@ The following figure outlines the learning curves for the above 4x2048 LSTM mode
The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`.
Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`.

![LSTM NCE Learning curves](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LSTM-NCE-curve.png)
<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LSTM-NCE-curve.png"></p>

As you can see, most of the learning is done in the first epochs.
Nevertheless, the training and validation error are consistently reduced training progresses.

The following figure compares the valiation learning curves (again, NCE error) for a small 2x250 LSTM (no dropout) and big 4x2048 LSTM (with dropout).

![Small vs Big LSTM](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/small-vs-big-lstm.png)
<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/small-vs-big-lstm.png"></p>

What I find impressive about this figure is how quickly the higher-capacity model bests the lower-capacity model.
This clearly demonstrates the importance of capacity when optimizing large-scale language models.
Expand Down

0 comments on commit 6553bd5

Please sign in to comment.