diff options
author | nicholas-leonard <nick@nikopia.org> | 2016-07-20 19:11:52 +0300 |
---|---|---|
committer | nicholas-leonard <nick@nikopia.org> | 2016-07-20 19:11:52 +0300 |
commit | 147898b12aec04f85badb85a0a6bf348ac529b51 (patch) | |
tree | 1ecf534a4b09b40ff7fcfbbbef4140e6ed4f276b | |
parent | 259333edd7df4be28257aa7e08b92e79e9aefbf1 (diff) |
more links
-rw-r--r-- | blog/_posts/2016-05-11-nce.md | 42 |
1 files changed, 27 insertions, 15 deletions
diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md index 195579c..f1ff0bd 100644 --- a/blog/_posts/2016-05-11-nce.md +++ b/blog/_posts/2016-05-11-nce.md @@ -3,26 +3,30 @@ layout: post title: Noise Contrastive Estimation comments: True author: nicholas-leonard -excerpt: TODO +excerpt: Noise contrastive estimation is used +to train a multi-GPU recurrent neural network language model +on the Google billion words dataset. picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/output_52iFki.gif --- <!---# Noise contrastive estimation for the Google billion words dataset --> - * Word versus character level language models - * Recurrent neural network language models - * Loading the Google billion words dataset - * Building a multi-layer LSTM - * Training and evaluation scripts - * Results + * [Word versus character level language models](#nce.char) + * [Recurrent neural network language models](#nce.rnnlm) + * [Loading the Google billion words dataset](#nce.gbw) + * [Building a multi-layer LSTM](#nce.lstm) + * [Training and evaluation scripts](#nce.script) + * [Results](#nce.result) + * [References](#nce.ref) -In this blog post, we use Torch to use noise contrastive estimation (NCE) [[2]](#nce.ref) +In this Torch blog post, we use noise contrastive estimation (NCE) [[2]](#nce.ref) to train a multi-GPU recurrent neural network language model (RNNLM) on the Google billion words (GBW) dataset [[7]](#nce.ref). This blog post is the result of many months of work. The enormity of the dataset caused us to contribute some novel open-source modules, criteria and even a multi-GPU tensor. We also provide scripts so that you can train and evaluate your own language models. +<a name='nce.char'></a> ## Word versus character level language models In recent months you may have noticed increased interest in generative character-level @@ -60,28 +64,32 @@ Progress isn't made by early risers. It's made by lazy men trying to find easier After tokenization, the word-level model might view this sequence as containing 22 tokens. On the other hand, the char-level will view this sequence as containing 102 tokens. This longer sequence makes the task of the char-level model harder than the word-level model, as it -must take into account dependencies between more tokens over more time-steps. +must take into account dependencies between more tokens over more time-steps.[[8]](#nce.ref) The main advantage of char-level over word-level language models is that they -have a really small vocabulary. For example, the Google Billion Words dataset will contain approximately 800 characters +have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters compared to 800,000 words (after pruning low-frequency tokens). In practice this means that char-level models will require less memory and have faster inference than their word-level counterparts. +<a name='nce.rnnlm'></a> ## Recurrent neural network language models -Our task is to build a language model which will maximize the likelihood of the +Our task is to build a language model which maximizes the likelihood of the next word given the history of previous words in the sentence. -The following figure illustrates the a Simple Recurrent Neural Network (Simple RNN) language model: +The following figure illustrates the a simple recurrent neural network (Simple RNN) language model: ![rnnlm](images/rnnlm.png) -So for this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on. -The RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words. +For this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on. +The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words. Simple RNNs are not the only kind of model that can be used model language. There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which have special gated cells that facilitate the backpropagation of gradients through longer sequences. -LSTMs can learn dependencies seperated by much longer time-steps . +LSTMs can learn dependencies seperated between much longer time-steps. +Like convolutions, these LSTM layers can also be stacked to form deeper models. +In the Building +<a name='nce.gbw'></a> ## Loading the Google billion words dataset For our word-level language model we use the GBW dataset. @@ -241,6 +249,7 @@ start and end of the sequence, respectively. Each token is mapped to an integer. you can see that `<S>` is mapped to integer `793470` in the above example. Now that we feel confident in our dataset, lets look at the model. +<a name='nce.lstm'></a> ## Building a multi-layer LSTM The input layer of the the `lm` model is a lookup table : @@ -353,6 +362,7 @@ Reference [[2]](#nce.ref) implement a faster version where the noise samples are This make the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used. This faster NCE version described in [[2]](#nce.ref) is the default implementation of the `NCEModule`. Sampling per batch-row can be turned on with `NCEModule.rownoise=true`. +<a name='nce.script'></a> ## Training and evaluation scripts The experiments presented here use three scripts: two for training and one for evaluation. @@ -553,6 +563,7 @@ nn.Serial @ nn.Sequential { } ``` +<a name='nce.result'></a> ## Results On the 4-layer LSTM with 2048 hidden units, [[1]](#nce.ref) obtain 43.2 perplexity on the GBW test set. @@ -594,3 +605,4 @@ Here are 8 sentences sampled independently from the 4-layer LSTM with a `tempera 5. *A Graves, A Mohamed, G Hinton*, [Speech Recognition with Deep Recurrent Neural Networks](http://arxiv.org/pdf/1303.5778.pdf) 6. *K Greff, RK Srivastava, J KoutnÃk*, [LSTM: A Search Space Odyssey](http://arxiv.org/pdf/1503.04069) 7. *C Chelba, T Mikolov, M Schuster, Q Ge, T Brants, P Koehn, T Robinson*, [One billion word benchmark for measuring progress in statistical language modeling](http://arxiv.org/pdf/1312.3005) +8. *A Graves*, [Generating Sequences With Recurrent Neural Networks](http://arxiv.org/pdf/1308.0850v5.pdf) |