From 259333edd7df4be28257aa7e08b92e79e9aefbf1 Mon Sep 17 00:00:00 2001
From: nicholas-leonard <nick@nikopia.org>
Date: Wed, 20 Jul 2016 11:55:44 -0400
Subject: nce++

---
 blog/_posts/2016-05-11-nce.md | 127 +++++++++++++++++++++++-------------------
 1 file changed, 70 insertions(+), 57 deletions(-)
diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md
index 11ee839..195579c 100644
--- a/blog/_posts/2016-05-11-nce.md
+++ b/blog/_posts/2016-05-11-nce.md
@@ -7,12 +7,28 @@ excerpt: TODO
 picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/output_52iFki.gif
 ---
 
-<!---# Noise Contrastive Estimation -->
-
-In the past couple of months we have seen increased interest in generative character-level 
-recurrent neural network (RNN) models like [char-rnn](https://github.com/karpathy/char-rnn)
+<!---# Noise contrastive estimation for the Google billion words dataset -->
+
+ * Word versus character level language models
+ * Recurrent neural network language models
+ * Loading the Google billion words dataset
+ * Building a multi-layer LSTM
+ * Training and evaluation scripts
+ * Results
+
+In this blog post, we use Torch to use noise contrastive estimation (NCE) [[2]](#nce.ref)
+to train a multi-GPU recurrent neural network language model (RNNLM) 
+on the Google billion words (GBW) dataset [[7]](#nce.ref). 
+This blog post is the result of many months of work. 
+The enormity of the dataset caused us to contribute some novel open-source modules, criteria and even a multi-GPU tensor.
+We also provide scripts so that you can train and evaluate your own language models.
+
+## Word versus character level language models
+
+In recent months you may have noticed increased interest in generative character-level 
+RNNLMs like [char-rnn](https://github.com/karpathy/char-rnn)
 and the more recent [torch-rnn](https://github.com/jcjohnson/torch-rnn).
-These models are very interesting as they can be used to generate sequences of text like:
+These models are very interesting as they can be used to generate sequences of characters like the following:
 
 ```lua
 <post>
@@ -27,16 +43,14 @@ the developers built or trying to run the patch to Jagex.
 ``` 
 
 The above was generated one character at a time using a sample of [reddit](https://www.reddit.com/) comments. 
-As you can see for yourself, the general structure of the generated text looks good at first view.
+As you can see for yourself, the general structure of the generated text looks good, at first view.
 The tags are opened and closed appropriately. The first sentence looks good: `I liked this game so much!!`
 and it is related to the subreddit of the post: `Diablo`. But reading the rest of it, we can 
 start to see the limitations of char-level language models. The spelling of individual words looks great, but 
-the meaning of the next sentence is difficult to understand.
-
-## Word-Level vs Char-Level Language Models
+the meaning of the next sentence is difficult to understand (it is also very long).
 
 In this blog post we will show how Torch can be used to train a large-scale word-level language model to generate 
-independent sentences. Word-level models have an important advantage of char-level models. 
+independent sentences. Word-level models have an important advantage over char-level models. 
 Take the following sequence as an example (a quote from Robert A. Heinlein):
 
 ```
@@ -53,30 +67,22 @@ have a really small vocabulary. For example, the Google Billion Words dataset wi
 compared to 800,000 words (after pruning low-frequency tokens). In practice this means that char-level models will 
 require less memory and have faster inference than their word-level counterparts.
 
-## Output Layer Bottleneck
+## Recurrent neural network language models
 
-With its small vocabulary of 10000 words, the Penn Tree Bank dataset is relatively easy to use to build word-level language models. 
-The output layer is still tractable to compute for both training and inference, especially for GPUs.
-For these smaller vocabularies, the output layer is basically a `Linear` followed by a `SoftMax`:
+Our task is to build a language model which will maximize the likelihood of the 
+next word given the history of previous words in the sentence. 
+The following figure illustrates the a Simple Recurrent Neural Network (Simple RNN) language model:
 
-```lua
-outputlayer = nn.Sequential()
-   :add(nn.Linear(hiddensize, vocabsize))
-   :add(nn.SoftMax())
-``` 
+![rnnlm](images/rnnlm.png)
 
-However, when training with large vocabularies, like the 793471 words that makes up 
-the Google Billion Words (GBW) dataset [[1]](#nce.ref),
-the output layer quickly becomes a bottle neck. 
-If you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` 
-(size of sequence to backpropagate through time),
-the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `32 x 100 x 793471`.
-For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory. 
-The number can be double for gradients, and doubled again as both Linear and SoftMax store a copy for the output.
-If somehow you can find a way to put >40GB on a GPU (or distribute it over many), you then run in the problem of 
-forward/backward propagating through that `outputlayer` in a reasonable time-frame. 
+So for this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on.
+The RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words.
+Simple RNNs are not the only kind of model that can be used model language.
+There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which 
+have special gated cells that facilitate the backpropagation of gradients through longer sequences.
+LSTMs can learn dependencies seperated by much longer time-steps .
 
-## GBW Data Loader
+## Loading the Google billion words dataset
 
 For our word-level language model we use the GBW dataset.
 The dataset is different from Penn Tree Bank in that sentences are 
@@ -233,24 +239,9 @@ means that the data is iterated in sequence.
 Each sentence in the GBW dataset is encapsulated by `<S>` and `</S>` tokens to indicate the 
 start and end of the sequence, respectively. Each token is mapped to an integer. So for example,
 you can see that `<S>` is mapped to integer `793470` in the above example.
-Now that we feel confident in our dataset, lets look at the model.
-
-## RNNLM
-
-Our task is to build a language model which will maximize the likelihood of the 
-next word given the history of previous words in the sentence. 
-The following figure illustrates the a Simple Recurrent Neural Network (Simple RNN) language model:
-
-![rnnlm](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/rnnlm.png)
-
-So for this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on.
-The RNN as an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words.
-Simple RNNs are not the only kind of model that can be used model language.
-There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which 
-have special gated cells that facilitate the backpropagation of gradients through longer sequences.
-LSTMs can learn dependencies seperated by much longer time-steps . 
+Now that we feel confident in our dataset, lets look at the model. 
 
-## Multi-layer LSTM
+## Building a multi-layer LSTM
 
 The input layer of the the `lm` model is a lookup table :
 
@@ -287,7 +278,29 @@ each time-step:
 lm:add(nn.SplitTable(1))
 ``` 
 
-### Noise Contrastive Estimation
+### Output layer bottleneck
+
+With its small vocabulary of 10000 words, the Penn Tree Bank dataset is relatively easy to use to build word-level language models. 
+The output layer is still computationally tractable for both training and inference, especially for GPUs.
+For these smaller vocabularies, the output layer is basically a `Linear` followed by a `SoftMax`:
+
+```lua
+outputlayer = nn.Sequential()
+   :add(nn.Linear(hiddensize, vocabsize))
+   :add(nn.SoftMax())
+``` 
+
+However, when training with large vocabularies, like the 793471 words that makes up the GBW dataset ,
+the output layer quickly becomes a bottle neck. 
+If you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` 
+(size of sequence to backpropagate through time),
+the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `32 x 100 x 793471`.
+For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory. 
+The number can be double for gradients, and doubled again as both Linear and SoftMax store a copy for the output.
+If somehow you can find a way to put >40GB on a GPU (or distribute it over many), you then run in the problem of 
+forward/backward propagating through that `outputlayer` in a reasonable time-frame. 
+
+### Noise contrastive estimation
 
 The output layer of the LM uses Noise Contrastive Estimation (NCE) to speed up training and reduce memory consumption:
 
@@ -340,11 +353,11 @@ Reference [[2]](#nce.ref) implement a faster version where the noise samples are
 This make the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used.
 This faster NCE version described in [[2]](#nce.ref) is the default implementation of the `NCEModule`. Sampling per batch-row can be turned on with `NCEModule.rownoise=true`.
 
-## Scripts
+## Training and evaluation scripts
 
 The experiments presented here use three scripts: two for training and one for evaluation.
 
-### Single-GPU Training Script
+### Single-GPU training script
 
 We provide training scripts for a single gpu via the [noise-contrastive-estimate.lua](https://github.com/Element-Research/rnn/blob/master/examples/noise-contrastive-estimate.lua) script.
 Running the following on a 12GB NVIDIA Titan X should resulted in a test set perplexity of 65.6 after 321 epochs:
@@ -377,8 +390,9 @@ nn.Serial @ nn.Sequential {
 
 To use about one third less memory, you can set momentum of 0. 
 
-### Evaluation Script
+### Evaluation script
 
+The evaluation script can be used to measure perplexity on the test set or sample independent sentences.
 To evaluate a saved model, you can use the [evaluate-rnnlm.lua](https://github.com/Element-Research/rnn/blob/master/scripts/evaluate-rnnlm.lua) script:
 
 ```bash
@@ -407,11 +421,10 @@ The `--temperature` flag can be reduced to make the sampling more deterministic.
 <S> It was last modified at 23.31 GMT on Saturday 22 December 2009 . </S> 
 <S> He told the newspaper the prosecution had been treating the small boy as " a young man who was playing for a while . </S> 
 <S> " We are astounded that our employees are not made aware of the risks and risks they are pursuing during this period of time , " he said . </S> 
-<S> " I had a right to come up with the idea . </S> 
-<S> But the truth
+<S> " I had a right to come up with the idea . </S>
 ``` 
 
-### Multi-GPU Training Script
+### Multi-GPU training script
 
 As can be observed in the previous section, training a 2-layer LSTM with only 250 hidden units will not yield the best
 generated samples. The model needs much more capacity than what can fit on a 12GB GPU. 
@@ -543,7 +556,7 @@ nn.Serial @ nn.Sequential {
 ## Results
 
 On the 4-layer LSTM with 2048 hidden units, [[1]](#nce.ref) obtain 43.2 perplexity on the GBW test set. 
-After early-stopping on a sub-set of the validation set (at 100 epochs of training), our model was able to reach *40.61* perplexity.
+After early-stopping on a sub-set of the validation set (at 100 epochs of training where 1 epoch is 128 sequences x 400k words/sequence), our model was able to reach *40.61* perplexity.
 
 This model was run on 4x12GB NVIDIA Titan X GPUs. 
 Training requires approximately 40GB of memory, distributed across the 4 GPU devices.
@@ -551,7 +564,7 @@ As in the original paper, we do not make use of momentum as it provides little b
 
 Training runs at about 3800 words/second.
 
-### Generated Samples
+### Generating Sentences
 
 Here are 8 sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7:
 
@@ -566,7 +579,6 @@ Here are 8 sentences sampled independently from the 4-layer LSTM with a `tempera
 <S> Later he was driven to a nearby house where he was later found to be severely ill . </S>
 ``` 
 
-Not bad, right?
 
 ### Learning Curves
 
@@ -581,3 +593,4 @@ Not bad, right?
 4. *S Hochreiter, J Schmidhuber*, [Long Short Term Memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf)
 5. *A Graves, A Mohamed, G Hinton*, [Speech Recognition with Deep Recurrent Neural Networks](http://arxiv.org/pdf/1303.5778.pdf)
 6. *K Greff, RK Srivastava, J Koutník*, [LSTM: A Search Space Odyssey](http://arxiv.org/pdf/1503.04069)
+7. *C Chelba, T Mikolov, M Schuster, Q Ge, T Brants, P Koehn, T Robinson*, [One billion word benchmark for measuring progress in statistical language modeling](http://arxiv.org/pdf/1312.3005)
-- 
cgit v1.2.3