Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/torch/torch.github.io.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSoumith Chintala <soumith@fb.com>2016-07-26 01:19:57 +0300
committerSoumith Chintala <soumith@fb.com>2016-07-26 01:19:57 +0300
commit6553bd5f7e91a6b72c4a3787461f0a425181152c (patch)
tree68bcb93446e8a08e8810881d8f8b989ae9f62c10
parent782fd6307ca13c840737ee5f6f0d0780210fbc4d (diff)
fixing links
-rw-r--r--blog/_posts/2016-07-25-nce.md12
1 files changed, 6 insertions, 6 deletions
diff --git a/blog/_posts/2016-07-25-nce.md b/blog/_posts/2016-07-25-nce.md
index a738e16..89862a5 100644
--- a/blog/_posts/2016-07-25-nce.md
+++ b/blog/_posts/2016-07-25-nce.md
@@ -111,7 +111,7 @@ but they are not the only kind of model that can be used model language.
There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which
have special gated cells that facilitate the backpropagation of gradients through longer sequences.
-![lstm](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LSTM.png)
+<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LSTM.png"></p>
The exact implementation is as follows:
@@ -366,7 +366,7 @@ For a `FloatTensor` or `CudaTensor`, that single tensor will take up 20GB of mem
The number can be double for `gradInput` (i.e. gradients with respect to input),
and double again as both `Linear` and `SoftMax` store a copy for the `output`.
-![Scale of output layer buffers with Linear](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-Linear.png)
+<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-Linear.png"></p>
Excluding parameters and their gradients, the above figure outlines the approximate memory consumption of a 4-layer LSTM with 2048 units with a `seqlen=50`.
Even if somehow you can find a way to put 80GB on a GPU (or distribute it over many), you still run into the problem of
@@ -400,7 +400,7 @@ nn.Sequential():add(nn.Linear(inputsize, #trainset.ivocab)):add(nn.LogSoftMax())
For evaluating perplexity, the model still implements `Linear` + `SoftMax`.
NCE is useful for reducing the memory consumption during training (compare to the figure above):
-![Scale of output layer buffers with NCE](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-NCE.png)
+<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-NCE.png"></p>
Along with the [NCECriterion](https://github.com/Element-Research/dpnn#nn.NCECriterion),
the `NCEModule` implements the algorithm is described in [[1]](#nce.ref).
@@ -541,7 +541,7 @@ As can be observed in the previous section, training a 2-layer LSTM with only 25
generated samples. The model needs much more capacity than what can fit on a 12GB GPU.
For parameters and their gradients, a 4x2048 LSTM model requires the following:
-![LM parameter memory consumption](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-params.png)
+<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LM-params.png"></p>
This doesn't include all the intermediate buffers required for the different modules (outlined in [NCE section](#nce.nce)).
The solution was of course to distribution the model over more GPUs.
@@ -687,14 +687,14 @@ The following figure outlines the learning curves for the above 4x2048 LSTM mode
The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`.
Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`.
-![LSTM NCE Learning curves](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LSTM-NCE-curve.png)
+<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/LSTM-NCE-curve.png"></p>
As you can see, most of the learning is done in the first epochs.
Nevertheless, the training and validation error are consistently reduced training progresses.
The following figure compares the valiation learning curves (again, NCE error) for a small 2x250 LSTM (no dropout) and big 4x2048 LSTM (with dropout).
-![Small vs Big LSTM](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/small-vs-big-lstm.png)
+<p align='center'><img width="100%" src="https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/small-vs-big-lstm.png"></p>
What I find impressive about this figure is how quickly the higher-capacity model bests the lower-capacity model.
This clearly demonstrates the importance of capacity when optimizing large-scale language models.