more links

author: nicholas-leonard <nick@nikopia.org> 2016-07-20 19:11:52 +0300
committer: nicholas-leonard <nick@nikopia.org> 2016-07-20 19:11:52 +0300
commit: 147898b12aec04f85badb85a0a6bf348ac529b51 (patch)
tree: 1ecf534a4b09b40ff7fcfbbbef4140e6ed4f276b
parent: 259333edd7df4be28257aa7e08b92e79e9aefbf1 (diff)
1 files changed, 27 insertions, 15 deletions
diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md
index 195579c..f1ff0bd 100644
--- a/blog/_posts/2016-05-11-nce.md
+++ b/blog/_posts/2016-05-11-nce.md
@@ -3,26 +3,30 @@ layout: post
 title: Noise Contrastive Estimation
 comments: True
 author: nicholas-leonard
-excerpt: TODO
+excerpt: Noise contrastive estimation is used
+to train a multi-GPU recurrent neural network language model 
+on the Google billion words dataset.
 picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/output_52iFki.gif
 ---
 
 <!---# Noise contrastive estimation for the Google billion words dataset -->
 
- * Word versus character level language models
- * Recurrent neural network language models
- * Loading the Google billion words dataset
- * Building a multi-layer LSTM
- * Training and evaluation scripts
- * Results
+ * [Word versus character level language models](#nce.char)
+ * [Recurrent neural network language models](#nce.rnnlm)
+ * [Loading the Google billion words dataset](#nce.gbw)
+ * [Building a multi-layer LSTM](#nce.lstm)
+ * [Training and evaluation scripts](#nce.script)
+ * [Results](#nce.result)
+ * [References](#nce.ref)
 
-In this blog post, we use Torch to use noise contrastive estimation (NCE) [[2]](#nce.ref)
+In this Torch blog post, we use noise contrastive estimation (NCE) [[2]](#nce.ref)
 to train a multi-GPU recurrent neural network language model (RNNLM) 
 on the Google billion words (GBW) dataset [[7]](#nce.ref). 
 This blog post is the result of many months of work. 
 The enormity of the dataset caused us to contribute some novel open-source modules, criteria and even a multi-GPU tensor.
 We also provide scripts so that you can train and evaluate your own language models.
 
+<a name='nce.char'></a>
 ## Word versus character level language models
 
 In recent months you may have noticed increased interest in generative character-level 
@@ -60,28 +64,32 @@ Progress isn't made by early risers. It's made by lazy men trying to find easier
 After tokenization, the word-level model might view this sequence as containing 22 tokens.
 On the other hand, the char-level will view this sequence as containing 102 tokens.
 This longer sequence makes the task of the char-level model harder than the word-level model, as it 
-must take into account dependencies between more tokens over more time-steps.
+must take into account dependencies between more tokens over more time-steps.[[8]](#nce.ref)
 
 The main advantage of char-level over word-level language models is that they 
-have a really small vocabulary. For example, the Google Billion Words dataset will contain approximately 800 characters
+have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters
 compared to 800,000 words (after pruning low-frequency tokens). In practice this means that char-level models will 
 require less memory and have faster inference than their word-level counterparts.
 
+<a name='nce.rnnlm'></a>
 ## Recurrent neural network language models
 
-Our task is to build a language model which will maximize the likelihood of the 
+Our task is to build a language model which maximizes the likelihood of the 
 next word given the history of previous words in the sentence. 
-The following figure illustrates the a Simple Recurrent Neural Network (Simple RNN) language model:
+The following figure illustrates the a simple recurrent neural network (Simple RNN) language model:
 
 ![rnnlm](images/rnnlm.png)
 
-So for this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on.
-The RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words.
+For this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on.
+The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words.
 Simple RNNs are not the only kind of model that can be used model language.
 There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which 
 have special gated cells that facilitate the backpropagation of gradients through longer sequences.
-LSTMs can learn dependencies seperated by much longer time-steps .
+LSTMs can learn dependencies seperated between much longer time-steps.
+Like convolutions, these LSTM layers can also be stacked to form deeper models.
+In the Building 
 
+<a name='nce.gbw'></a>
 ## Loading the Google billion words dataset
 
 For our word-level language model we use the GBW dataset.
@@ -241,6 +249,7 @@ start and end of the sequence, respectively. Each token is mapped to an integer.
 you can see that `<S>` is mapped to integer `793470` in the above example.
 Now that we feel confident in our dataset, lets look at the model. 
 
+<a name='nce.lstm'></a>
 ## Building a multi-layer LSTM
 
 The input layer of the the `lm` model is a lookup table :
@@ -353,6 +362,7 @@ Reference [[2]](#nce.ref) implement a faster version where the noise samples are
 This make the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used.
 This faster NCE version described in [[2]](#nce.ref) is the default implementation of the `NCEModule`. Sampling per batch-row can be turned on with `NCEModule.rownoise=true`.
 
+<a name='nce.script'></a>
 ## Training and evaluation scripts
 
 The experiments presented here use three scripts: two for training and one for evaluation.
@@ -553,6 +563,7 @@ nn.Serial @ nn.Sequential {
 }
 ``` 
 
+<a name='nce.result'></a>
 ## Results
 
 On the 4-layer LSTM with 2048 hidden units, [[1]](#nce.ref) obtain 43.2 perplexity on the GBW test set. 
@@ -594,3 +605,4 @@ Here are 8 sentences sampled independently from the 4-layer LSTM with a `tempera
 5. *A Graves, A Mohamed, G Hinton*, [Speech Recognition with Deep Recurrent Neural Networks](http://arxiv.org/pdf/1303.5778.pdf)
 6. *K Greff, RK Srivastava, J Koutník*, [LSTM: A Search Space Odyssey](http://arxiv.org/pdf/1503.04069)
 7. *C Chelba, T Mikolov, M Schuster, Q Ge, T Brants, P Koehn, T Robinson*, [One billion word benchmark for measuring progress in statistical language modeling](http://arxiv.org/pdf/1312.3005)
+8. *A Graves*, [Generating Sequences With Recurrent Neural Networks](http://arxiv.org/pdf/1308.0850v5.pdf)
author	nicholas-leonard <nick@nikopia.org>	2016-07-20 19:11:52 +0300
committer	nicholas-leonard <nick@nikopia.org>	2016-07-20 19:11:52 +0300
commit	147898b12aec04f85badb85a0a6bf348ac529b51 (patch)
tree	1ecf534a4b09b40ff7fcfbbbef4140e6ed4f276b
parent	259333edd7df4be28257aa7e08b92e79e9aefbf1 (diff)