diff options
author | Soumith Chintala <soumith@gmail.com> | 2016-07-26 00:42:44 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2016-07-26 00:42:44 +0300 |
commit | 289eebac6c879a08e92da0af15a78e1d3c83c2cb (patch) | |
tree | 2710d53720798590228e2ae8a7eb7949738be5f7 | |
parent | 236ede5c88ce1571eb70a7b830b1451e46e16db8 (diff) | |
parent | 3aac64502c9083d7db4c6c0b6c1397eef5d9ba76 (diff) |
Merge pull request #56 from nicholas-leonard/nce
Noise Contrastive Estimate
-rw-r--r-- | blog/_posts/2016-05-11-nce.md | 972 | ||||
-rw-r--r-- | blog/_posts/images/LM-Linear.png | bin | 0 -> 10672 bytes | |||
-rw-r--r-- | blog/_posts/images/LM-NCE.png | bin | 0 -> 5671 bytes | |||
-rw-r--r-- | blog/_posts/images/LM-params.png | bin | 0 -> 8196 bytes | |||
-rw-r--r-- | blog/_posts/images/LSTM-NCE-curve.png | bin | 0 -> 38640 bytes | |||
-rw-r--r-- | blog/_posts/images/LSTM.png | bin | 0 -> 82623 bytes | |||
-rw-r--r-- | blog/_posts/images/rnnlm.png | bin | 0 -> 17199 bytes | |||
-rw-r--r-- | blog/_posts/images/small-vs-big-lstm.png | bin | 0 -> 38671 bytes |
8 files changed, 972 insertions, 0 deletions
diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md new file mode 100644 index 0000000..70a6bcb --- /dev/null +++ b/blog/_posts/2016-05-11-nce.md @@ -0,0 +1,972 @@ +--- +layout: post +title: Language modeling a billion words +comments: True +author: nicholas-leonard +excerpt: Noise contrastive estimation is used to train a multi-GPU recurrent neural network language model on the Google billion words dataset. +picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/rnnlm.png +--- + +<!---# Language modeling a billion words --> + + * [Word versus character language models](#nce.char) + * [Recurrent neural network language models](#nce.rnnlm) + * [Loading the Google billion words dataset](#nce.gbw) + * [Building a multi-layer LSTM](#nce.lstm) + * [Training and evaluation scripts](#nce.script) + * [Results](#nce.result) + * [Future work](#nce.future) + * [References](#nce.ref) + +In this Torch blog post, we use noise contrastive estimation (NCE) [[2]](#nce.ref) +to train a multi-GPU recurrent neural network language model (RNNLM) +on the Google billion words (GBW) dataset [[7]](#nce.ref). +The work presented here is the result of many months of on-and-off work. +The enormity of the dataset caused us to contribute some novel open-source Torch modules, criteria and even a multi-GPU tensor. +We also provide scripts so that you can train and evaluate your own language models. + +If you are only interested in generated samples, perplexity and learning curves, please jump to the [results section](#nce.result). + +<a name='nce.char'></a> +## Word versus character language models + +In recent months you may have noticed increased interest in generative character-level +RNNLMs like [char-rnn](https://github.com/karpathy/char-rnn) +and the more recent [torch-rnn](https://github.com/jcjohnson/torch-rnn). +These models are very interesting as they can be used to generate sequences of characters like the following: + +```lua +<post> +Diablo +<comment score=1> +I liked this game so much!! Hope telling that numbers' benefits and +features never found out at that level is a total breeze +because it's not even a developer/voice opening and rusher runs +the game against so many people having noticeable purchases of selling +the developers built or trying to run the patch to Jagex. +</comment> +``` + +The above was generated one character at a time using a sample of [reddit](https://www.reddit.com/) comments. +As you can see for yourself, the general structure of the generated text looks good, at first view. +The tags are opened and closed appropriately. The first sentence looks good: `I liked this game so much!!` +and it is related to the subreddit of the post: `Diablo`. But reading the rest of it, we can +start to see the limitations of char-level language models. The spelling of individual words looks great, but +the meaning of the next sentence is difficult to understand (it is also very long). + +In this blog post we will show how Torch can be used to train a large-scale word-level language model to generate +independent sentences. Word-level models have an important advantage over char-level models. +Take the following sequence as an example (a quote from Robert A. Heinlein): + +``` +Progress isn't made by early risers. It's made by lazy men trying to find easier ways to do something. +``` + +After tokenization, the word-level model might view this sequence as containing 22 tokens. +On the other hand, the char-level will view this sequence as containing 102 tokens. +This longer sequence makes the task of the character model harder than the word model, as it +must take into account dependencies between more tokens over more time-steps. +Another issue with character language models is that they need to learn spelling in +addition to syntax, semantics, etc. +In any case, word language models will typically have lower error than character models.[[8]](#nce.ref) + +The main advantage of character over word language models is that they +have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters +compared to 800,000 words (after pruning low-frequency tokens). In practice this means that character models will +require less memory and have faster inference than their word counterparts. +Another advantage is that they do not require tokenization as a preprocessing step. + +<a name='nce.rnnlm'></a> +## Recurrent neural network language models + +Our task is to build a language model which maximizes the likelihood of the +next word given the history of previous words in the sentence. +The following figure illustrates the workings of a simple recurrent neural network (Simple RNN) language model: + +![rnnlm](images/rnnlm.png) + +The exact implementation is as follows: + +```lua +h[t] = σ(W[x->h]x[t] + W[h->h]h[t−1] + b[1->h]) (1) +y[t] = softmax(W[x->y]h[t] + b[1->y]) (2) +``` + +For this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on. +The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the likelihood of the remaining words in the sequence. +Internally, the Simple RNN has parameters from input to hidden (word embeddings), hidden to hidden (recurrent connections) and hidden to output (output embeddings that feed into a softmax). +The input to hidden parameters consist of a `LookupTable` that learns to represent each word as a vector. +These vectors form a embeddings space for words. +The input `x[t]` to the `LookupTable` is a unique integer associated to the word `w[t]`. +The embedding vector for that word is obtained by indexing the embedding space `W[x->h]` which we represent by `W[x->h]x[t]`. +The hidden to hidden parameters model the temporal dependencies of words by generating a hidden state `h[t]` given `h[t-1]` and `x[t]`. +This is where the actual recurrence takes place as `h[t]` is a function of `h[t-1]` (and word `x[t]`). +The hidden to output layer does an affine transform (i.e. a `Linear` module: `W[x->y]h[t] + b[1->h]`) followed by a `softmax`. +This is to estimate a probability distribution `y[t]`over the next word given the previous words which is emboddied by the hidden state `h[t]`. +The criterion is to maximize the likelihood of the next word `w[t+1]` given previous words: +`P(w[t+1]|w[1],w[2],...,w[t])`. + +Simple RNNs are easy to build using the [rnn](https://github.com/Element-Research/rnn) package (see [simple RNN example](https://github.com/Element-Research/rnn/blob/master/examples/simple-recurrence-network.lua)), +but they are not the only kind of model that can be used model language. +There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which +have special gated cells that facilitate the backpropagation of gradients through longer sequences. + +![lstm](images/LSTM.png) + +The exact implementation is as follows: + +```lua +i[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + b[1->i]) (3) +f[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + b[1->f]) (4) +z[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c]) (5) +c[t] = f[t]c[t−1] + i[t]z[t] (6) +o[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + b[1->o]) (7) +h[t] = o[t]tanh(c[t]) (8) +``` + +The main advantage is that LSTMs can learn dependencies between words seperated between much longer time-steps. +It isn't as prone to the problems of vanishing gradients as the different gates can preserve the gradients during back-propagation. +To create a LM, the word embeddings (`W[x->h]x[t]` in eq.1) would be fed to the LSTM and the resulting hidden state would be fed to eq. 2. + +The error of language model is traditionally measured using perplexity. +Perplexity is a measure of how surprised the model is to see a sequence of text. +If you feed it in a sequence of words, and for each successive word the model is able to +predict with high likelihood what word comes next, it will have low perplexity. +If the next word in the sequence `s` of length `T` is indexed by `s[t]` and the model-inferred likelihood is `y[t]` such that +the likelihood of that word is `y[t][s[t]]`, then the perplexity of that sequence of words is: + +``` + log(y[1][s[1]) + log(y[2][s[2]) + ... + log(y[T][s[T]) +PPL(s,y) = exp( -------------------------------------------------------- ) + -T +``` + +The lower the perplexity, the better. + + +<a name='nce.gbw'></a> +## Loading the Google billion words dataset + +For our word-level language model we use the GBW dataset. +The dataset is different from Penn Tree Bank in that sentences are +kept independent of each other. So then our dataset consists of a set of +independent variable-length sequences. The dataset can be easily loaded using +the [dataload](https://github.com/Element-Research/dataload) package: + +```lua +local dl = require 'dataload' +local train, valid, test = dl.loadGBW(batchsize) +``` + +The above will automatically download the data if not found on disk and +return the training, validation and test set. +These are [dl.MultiSequence](https://github.com/Element-Research/dataload#dl.MultiSequence) instances +which have the following constructor: + +```lua +dataloader = dl.MultiSequence(sequences, batchsize) +``` + +The `sequences` argument is a Lua table or [tds.Vector](https://github.com/torch/tds#d--tdsvec--tbl) +where each element is a Tensor containing an independent sequence. For example: + +```lua +sequences = { + torch.LongTensor{424,158,115,667,28,505,228}, + torch.LongTensor{389,456,188}, + torch.LongTensor{77,172,760,687,552,529} +} +batchsize = 2 +dataloader = dl.MultiSequence(sequences, batchsize) +``` + +Note how the sequences vary in length. +Like all [dl.DataLoader](https://github.com/Element-Research/dataload#dl.DataLoader) sub-classes, the +`dl.MultiSequence` loader provides a method for sub-sampling a batch of `inputs` and `targets` from the dataset: + +```lua +local inputs, targets = dataloader:sub(1, 10) +``` + +The `sub` method takes the `start` and `end` indices of sub-sequences to index. +Internally, these indices are only used to determine length (`seqlen`) of the requested multi-sequences. +Each successive call to `sub` will return multi-sequences contiguous to the previous ones. + +The returned `inputs` and `targets` are `seqlen x batchsize [x inputsize]` +tensors containg a batch of 2 multi-sequences, each containing 8 time-steps. +Starting with the `inputs` : + +```lua +print(inputs) + 0 0 + 424 77 + 158 172 + 115 760 + 667 687 + 28 552 + 505 0 + 0 424 +[torch.DoubleTensor of size 8x2] +``` + +Each column is a vector containing potentially multiple sequences, i.e. a multi-sequence. +Independent sequences are seperated by zeros. In the next section, we will see how the +[rnn](https://github.com/Element-Research/rnn) package can use these zero-masked time-steps to +efficiently forget its hidden state between independent sequences (at the granularity of columns). +For now, notice how the original `sequences` are contained in the returned `inputs` and separated by zeros. + +The `targets` are similar to the `inputs`, but use masks of 1 to separate sequences (as `ClassNLLCriterion` will otherwise complain). +As is typical in language models, the task is to predict the next word, such that the `targets` are delayed by one time-step +with respect to the commensurate `inputs`: + +```lua +print(targets) + 1 1 + 158 172 + 115 760 + 667 687 + 28 552 + 505 529 + 228 1 + 1 158 +[torch.DoubleTensor of size 8x2] +``` + +The `train`, `valid` and `test` returned by the call to `dl.loadGBW` have the same properties as the above. +Except that the dataset is much bigger (it has one billion words). For debugging and such, we can choose to +load a smaller subset of the training set. This will load much faster than the default training set file: + +```lua +local train, valid, test = dl.loadGBW({2,2,2}, 'train_tiny.th7') +``` + +The above will use a `batchsize` of 2 for all sets. +Iteration through the dataloader is made easier using the [subiter](https://github.com/Element-Research/dataload#iterator-subiterbatchsize-epochsize-) : + +``` +local seqlen, epochsize = 3, 10 +for i, inputs, targets in train:subiter(seqlen, epochsize) do + print("T = " .. i) + print(inputs) +end +``` + +Which will output: + +```lua +T = 3 + 0 0 + 793470 793470 + 211427 6697 +[torch.DoubleTensor of size 3x2] + +T = 6 + 477149 400396 + 720601 213235 + 660496 368322 +[torch.DoubleTensor of size 3x2] + +T = 9 + 676607 61007 + 161927 767587 + 248714 635004 +[torch.DoubleTensor of size 3x2] + +T = 10 + 280570 130510 +[torch.DoubleTensor of size 1x2] + +``` + +We could also return the above batches as one big chunk instead: + +```lua +train:reset() -- resets the internal sequence iterator +print(train:sub(1,10)) + 0 0 + 793470 793470 + 211427 6697 + 477149 400396 + 720601 213235 + 660496 368322 + 676607 61007 + 161927 767587 + 248714 635004 + 280570 130510 +[torch.DoubleTensor of size 10x2] +``` + +Notice how the above small batches are aligned with this big chunk. Which +means that the data is iterated in sequence. + +Each sentence in the GBW dataset is encapsulated by `<S>` and `</S>` tokens to indicate the +start and end of the sequence, respectively. Each token is mapped to an integer. So for example, +you can see that `<S>` is mapped to integer `793470` in the above example. +Now that we feel confident in our dataset, lets look at the model. + +<a name='nce.lstm'></a> +## Building a multi-layer LSTM + +In this section, we get down to the business of actually building our multi-layer LSTM. +We will introduce NCE once we get to the output layer, starting from the input layer. + +The input layer of the the `lm` model is a lookup table : + +```lua +lm = nn.Sequential() + +-- input layer (i.e. word embedding space) +local lookup = nn.LookupTableMaskZero(#trainset.ivocab, opt.inputsize) +lm:add(lookup) -- input is seqlen x batchsize +``` + +A sub-class of `LookupTable`, we use the [LookupTableMaskZero](https://github.com/Element-Research/rnn#rnn.LookupTableMaskZero) +to learn word embeddings. The main difference is that it supports zero-indexes, which are forwarded as zero-tensors. +Then we have the actual multi-layer LSTM implementation, which uses the [SeqLSTM](https://github.com/Element-Research/rnn#rnn.SeqLSTM) module: + +```lua +local inputsize = opt.inputsize +for i,hiddensize in ipairs(opt.hiddensize) do + local rnn = nn.SeqLSTM(inputsize, hiddensize) + rnn.maskzero = true + lm:add(rnn) + if opt.dropout > 0 then + lm:add(nn.Dropout(opt.dropout)) + end + inputsize = hiddensize +end +``` + +As demonstrated in the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm) repository, the `SeqLSTM` implemention is very fast. +Next we split the output of the SeqLSTM (which is a `seqlen x batchsize x outputsize` Tensor) into a table containing a `batchsize x outputsize` tensor for +each time-step: + +```lua +lm:add(nn.SplitTable(1)) +``` + +### The problem: bottleneck at the output layer + +With its small vocabulary of 10000 words, the Penn Tree Bank dataset is relatively easy to use to build word-level language models. +The output layer is still computationally tractable for both training and inference, especially for GPUs. +For these smaller vocabularies, the output layer is basically a `Linear` followed by a `SoftMax`: + +```lua +outputlayer = nn.Sequential() + :add(nn.Linear(hiddensize, vocabsize)) + :add(nn.SoftMax()) +``` + +However, when training with large vocabularies, like the 793471 words that makes up the GBW dataset , +the output layer quickly becomes a bottleneck. +For example, if you are training your model with a `batchsize = 128` (number of sequences per batch) and a `seqlen = 50` +(size of sequence to backpropagate through time), +the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `128 x 50 x 793471`. +For a `FloatTensor` or `CudaTensor`, that single tensor will take up 20GB of memory! +The number can be double for `gradInput` (i.e. gradients with respect to input), +and double again as both `Linear` and `SoftMax` store a copy for the `output`. + +![Scale of output layer buffers with Linear](images/LM-Linear.png) + +Excluding parameters and their gradients, the above figure outlines the approximate memory consumption of a 4-layer LSTM with 2048 units with a `seqlen=50`. +Even if somehow you can find a way to put 80GB on a GPU (or distribute it over many), you still run into the problem of +forward/backward propagating through that `outputlayer` in a reasonable time-frame. + +<a name='nce.nce'></a> +### The solution: noise contrastive estimation + +The output layer of the LM uses NCE to speed up training and reduce memory consumption: + +```lua +local unigram = trainset.wordfreq:float() +local ncemodule = nn.NCEModule(inputsize, #trainset.ivocab, opt.k, unigram, opt.Z) + +-- NCE requires {input, target} as inputs +lm = nn.Sequential() + :add(nn.ParallelTable() + :add(lm):add(nn.Identity())) + :add(nn.ZipTable()) -- {{x1,x2,...}, {t1,t2,...}} -> {{x1,t1},{x2,t2},...} + +-- encapsulate stepmodule into a Sequencer +lm:add(nn.Sequencer(nn.MaskZero(ncemodule, 1))) +``` + +The [NCEModule](https://github.com/Element-Research/dpnn#nn.NCEModule) is a more efficient version of: + +```lua +nn.Sequential():add(nn.Linear(inputsize, #trainset.ivocab)):add(nn.LogSoftMax()) +``` + +For evaluating perplexity, the model still implements `Linear` + `SoftMax`. +NCE is useful for reducing the memory consumption during training (compare to the figure above): + +![Scale of output layer buffers with NCE](images/LM-NCE.png) + +Along with the [NCECriterion](https://github.com/Element-Research/dpnn#nn.NCECriterion), +the `NCEModule` implements the algorithm is described in [[1]](#nce.ref). +I won't go into the details of the algorithm as it involves a lot of math which is more appropriately detailed in the reference papers. +The way it works is that for each target word (the likelihood of which we want to maximize), +`k` words are sampled from a noise distribution, which is typically the unigram distribution. + +Remember that a softmax is basically: + +```lua + exp(x[i]) +y[i] = --------------------------------- (9) + exp(x[1])+exp(x[2])+...+exp(x[n]) +``` + +where `x[i]` is the `i`-th output of the output `Linear` layer. +The above denominator is the cause of the bottleneck as the `Linear` needs to be computed for each output `x[i]`. +For a `n=797470` vocabulary, this is prohibitively expensive. +NCE goes around this problem by replacing the denominator of eq. 9 with a constant `Z` during training: + +```lua + exp(x[i]) +y[i] = ------------ (10) + Z +``` + +Now this is not what actually happens during training as back-propagating through the above will not produce gradients +for the `x[j]` where `j~=i` (`j` not equal `i`). +Notice that backpropagating through eq. 9 will produce gradients for all outputs `x` of the `Linear` (i.e. for all `i`). +Another problem with eq. 10 is that nothing is pushing `exp(x[1])+exp(x[2])+...+exp(x[n])` to approximate `Z`. +What NCE does is formulate the problem such that `k` noise samples can be included in the equation to +both make sure that some (at most `k`) negative samples (i.e. `x[j]` where `j`) get gradients and that the denominator of eq. 9 approximates the denominator of eq. 10. +The `k` noise samples are sampled from a noise distribution, i.e. the unigram distribution. +The output layer `Linear` need only be computed for the target and noise-sampled words, which is where the efficiency is gained. + +The `unigram` variable above is a tensor of size 793470 where each element is the frequency of the commensurate word in the corpus. +Sampling from such a large distribution using something like [torch.multinomial](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.multinomial) +can become a bottleneck during training. +So we implemented a more efficient version in [torch.AliasMultinomial](https://github.com/nicholas-leonard/torchx/blob/master/AliasMultinomial.lua). +The latter multinomial sampler requires more setup time than the former, but this isn't a problem as the unigram distribution is constant. + +NCE uses the noise samples to approximate a normalization term `Z` where the output distribution is `exp(x[i])/Z` and `x[i]` is the output of the `Linear` for word `i`. +For the Softmax, which NCE tries to approximate, the `Z` is the sum over the `exp(x[i'])` over all words `i'`. +For NCE, the `Z` is typically fixed to `Z=1`. +Our initial experiments found that setting `Z` to `Z=N*mean(exp(x[i]))` +(where `N` is the number of words and the `mean` is approximated over a small batch of word samples `i`) +gave much better results, but this is because we weren't appropriately initializing the output layer parameters. + +One notable aspect of NCE papers (there are many) is that they often forget to mention the importance of this parameter initialization. +Setting `Z=1` is only really possible if the `NCEModule.bias` is initialized to `bias[i] = -log(N)`. +This is what the authors of [[2]](#nce.ref) use, although it isn't mentioned in the paper (I contacted one of the authors to find out). + +Sampling `k` noise samples per time-step and per batch-row means that the `NCEModule` needs to internally use something like +[torch.baddbmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.baddbmm) to compute the `output`. +Reference [[2]](#nce.ref) implement a faster version where the noise samples are drawn once and used for the entire batch (but still once for each time-step). +This makes the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used instead of `torch.baddbmm`. +This faster NCE version described in [[2]](#nce.ref) is the default implementation of the `NCEModule`. Sampling per batch-row can be turned on with `NCEModule.rownoise=true`. + +<a name='nce.script'></a> +## Training and evaluation scripts + +The experiments presented here use three scripts: two for training (you only need to use one) and one for evaluation. +The training scripts only differ in the amount of GPUs to use. +Both train a language model on the training set and do early-stopping on the validation set. +The evaluation script is used to measure the perplexity of a trained model on the test set, or to generate sentences. + +### Single-GPU training script + +We provide training scripts for a single gpu via the [noise-contrastive-estimate.lua](https://github.com/Element-Research/rnn/blob/master/examples/noise-contrastive-estimate.lua) script. +Running the following on a 12GB NVIDIA Titan X should resulted in a test set perplexity of 65.6 after 321 epochs: + +```bash +th examples/noise-contrastive-estimate.lua --cuda --device 2 --startlr 1 --saturate 300 --cutoff 10 --progress --uniform 0.1 --seqlen 50 --batchsize 128 --trainsize 400000 --validsize 40000 --hiddensize '{250,250}' --k 400 --minlr 0.001 --momentum 0.9 +``` + +The resulting model will look like this: + +```lua +nn.Serial @ nn.Sequential { + [input -> (1) -> (2) -> (3) -> output] + (1): nn.ParallelTable { + input + |`-> (1): nn.Sequential { + | [input -> (1) -> (2) -> (3) -> (4) -> output] + | (1): nn.LookupTableMaskZero + | (2): nn.SeqLSTM + | (3): nn.SeqLSTM + | (4): nn.SplitTable + | } + |`-> (2): nn.Identity + ... -> output + } + (2): nn.ZipTable + (3): nn.Sequencer @ nn.Recursor @ nn.MaskZero @ nn.NCEModule(250 -> 793471) +} +``` + +To use about one third less memory, you can set momentum of 0. + +<a name='nce.eval'></a> +### Evaluation script + +The evaluation script can be used to measure perplexity on the test set or sample independent sentences. +To evaluate a saved model, you can use the [evaluate-rnnlm.lua](https://github.com/Element-Research/rnn/blob/master/scripts/evaluate-rnnlm.lua) script: + +```bash +th scripts/evaluate-rnnlm.lua --xplogpath /home/nicholas14/save/rnnlm/gbw:uranus:1466538423:1.t7 --cuda +``` + +where you should replace `/home/nicholas14/save/rnnlm/gbw:uranus:1466538423:1.t7` with the path to your own trained model. +Evaluating on the test set can take a while as it must use the less efficient `Linear` + `SoftMax`, and thus a very small batch size (so as not to use too much memory). + +The evaluation script can also be used to generate samples from the language model: + +```bash +th scripts/evaluate-rnnlm.lua --xplogpath /home/nicholas14/save/rnnlm/gbw:uranus:1466790001:1.t7 --cuda --nsample 200 --temperature 0.7 +``` + +The `--nsample` flag specifies how many tokens to sample. The first token input to the language model is the start-of-sentence tag (`<S>`). +When the end-of-sentence tag (`</S>`), the model's hidden states are set to zero, such that each sentence is sampled independently. +The `--temperature` flag can be reduced to make the sampling more deterministic. + +```xml +<S> There were a number of players in the starting lineup during the season and in recent weeks , in recent years , some fans have been frustrated . </S> +<S> WASHINGTON ( Reuters ) - The government plans to cut greenhouse gases by as much as 12 % on the global economy , a new report said . </S> +<S> One of the most important things about the day was that the two companies had just been guilty of the same nature . </S> +<S> " It has been as much a bit of a public service as a public organisation . </S> +<S> In a nutshell , it 's not only the fate of the economy . </S> +<S> It was last modified at 23.31 GMT on Saturday 22 December 2009 . </S> +<S> He told the newspaper the prosecution had been treating the small boy as " a young man who was playing for a while . </S> +<S> " We are astounded that our employees are not made aware of the risks and risks they are pursuing during this period of time , " he said . </S> +<S> " I had a right to come up with the idea . </S> +``` + +### Multi-GPU training script + +As can be observed in the previous section, training a 2-layer LSTM with only 250 hidden units will not yield the best +generated samples. The model needs much more capacity than what can fit on a 12GB GPU. +For parameters and their gradients, a 4x2048 LSTM model requires the following: + +![LM parameter memory consumption](images/LM-params.png) + +This doesn't include all the intermediate buffers required for the different modules (outlined in [NCE section](#nce.nce)). +The solution was of course to distribution the model over more GPUs. +The [multigpu-nce-rnnlm.lua](https://github.com/Element-Research/rnn/blob/master/examples/multigpu-nce-rnnlm.lua) script is thus provided to train a language model on four GPUs. + +It uses the [GPU](https://github.com/torch/nn/blob/master/doc/simple.md#nn.GPU) (which we contributed it to the [nn](https://github.com/torch/nn)) to decorate modules such that +all their operations and memory are hosted on a specified device. +The `GPU` module won't parallelize kernel execution over different GPU-devices. +But it does allow us to distribute large models over devices. + +For our LM, the input word embeddings (i.e. `LookupTableMaskZero`) and output layer (i.e. `NCEModule`) take up most of the memory. +The first was pretty easy to distribute: + +```lua +lm = nn.Sequential() +lm:add(nn.Convert()) + +-- input layer (i.e. word embedding space) +local concat = nn.Concat(3) +for device=1,2 do + local inputsize = device == 1 and torch.floor(opt.inputsize/2) or torch.ceil(opt.inputsize/2) + local lookup = nn.LookupTableMaskZero(#trainset.ivocab, inputsize) + lookup.maxnormout = -1 -- prevent weird maxnormout behaviour + concat:add(nn.GPU(lookup, device):cuda()) -- input is seqlen x batchsize +end +``` + +Basically, the embedding space is split into two tables. +For a 2048 unit embedding space, half, i.e. 1024 units, are located on each of two devices. +We use [Concat](https://github.com/torch/nn/blob/master/doc/containers.md#nn.Concat) to concatenate them back together after a `forward`. + +For the hidden layers (i.e. `SeqLSTM`), we just distribute them on the devices used by the input layer. +The hidden layers use up little memory (approximately 1GB each) so they aren't the problem. +We locate them on the same devices as the input layer as the output layer utilizes more memory (for buffers). + +```lua +local inputsize = opt.inputsize +for i,hiddensize in ipairs(opt.hiddensize) do + local rnn = nn.SeqLSTM(inputsize, hiddensize) + rnn.maskzero = true + local device = i <= #opt.hiddensize/2 and 1 or 2 + lm:add(nn.GPU(rnn, device):cuda()) + if opt.dropout > 0 then + lm:add(nn.GPU(nn.Dropout(opt.dropout), device):cuda()) + end + inputsize = hiddensize +end + +lm:add(nn.GPU(nn.SplitTable(1), 3):cuda()) +``` + +The `NCEModule` was a bit more difficult to distribute as it cannot be so easily parallelized as `LookupTableMaskZero`. +Our solution was to provide a simple [multicuda()](https://github.com/Element-Research/dpnn/blob/26edf00f7f22edd1e090619bb10528557cede4df/NCEModule.lua#L419-L439) +method to distribute the `weight` on `gradWeight` on different devices. +This is accomplished by swaping the weight tensors for our own : [torch.MultiCudaTensor](https://github.com/nicholas-leonard/torchx/blob/master/MultiCudaTensor.lua). +Lua has no severe type-checking system, so you can fake a tensor by creating a `torch.class` table with the same methods. +To save time, the current version of `MultiCudaTensor` only supports the operations required by the NCEModule. +The advantage of this approach is that it requires minimal changes to the `NCEModule` and maintains backward compatiblity without requiring redundant code or excessive refactoring. + +```lua +-- output layer +local unigram = trainset.wordfreq:float() +ncemodule = nn.NCEModule(inputsize, #trainset.ivocab, opt.k, unigram, opt.Z) +ncemodule:reset() -- initializes bias to get approx. Z = 1 +ncemodule.batchnoise = not opt.rownoise +-- distribute weight, gradWeight and momentum on devices 3 and 4 +ncemodule:multicuda(3,4) + +-- NCE requires {input, target} as inputs +lm = nn.Sequential() + :add(nn.ParallelTable() + :add(lm):add(nn.Identity())) + :add(nn.ZipTable()) -- {{x1,x2,...}, {t1,t2,...}} -> {{x1,t1},{x2,t2},...} + +-- encapsulate stepmodule into a Sequencer +local masked = nn.MaskZero(ncemodule, 1):cuda() +lm:add(nn.GPU(nn.Sequencer(masked), 3, opt.device):cuda()) +``` + +To reproduce the results in [[2]](#nce.ref) run the following: + +```bash +th examples/multigpu-nce-rnnlm.lua --startlr 0.7 --saturate 300 --minlr 0.001 --cutoff 10 --progress --uniform 0.1 --seqlen 50 --batchsize 128 --trainsize 400000 --validsize 40000 --hiddensize '{2048,2048,2048,2048}' --dropout 0.2 --k 400 --Z 1 --momentum -1 +``` + +Notable differences to paper are the following: + * we use a [gradient norm clipping](https://github.com/Element-Research/dpnn#nn.Module.gradParamClip) [[3]](#nce.ref) (with a `cutoff` norm of 10) to counter exploding and vanishing gradient; + * they use an adaptive learning rate schedule (which isn't specified in the paper). We linearly decay from a learning rate of 0.7 (which they also start from) such that it reaches 0.001 after 300 epochs; + * we use `k=400` samples whereas they use `k=100`. Why? I didn't see a major drop in speed, so why not? + * we use a sequence length of `seqlen=50` for Truncated BPTT. They use 100 (again, not in the paper). The average length of sentences in the dataset is 27 so 50 is more than enough. + +Like them, we use a `dropout=0.2` between LSTM layers. +This is what the resulting model looks like: + +```lua +nn.Serial @ nn.Sequential { + [input -> (1) -> (2) -> (3) -> output] + (1): nn.ParallelTable { + input + |`-> (1): nn.Sequential { + | [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> output] + | (1): nn.Convert + | (2): nn.GPU(2) @ nn.Concat { + | input + | |`-> (1): nn.GPU(1) @ nn.LookupTableMaskZero + | |`-> (2): nn.GPU(2) @ nn.LookupTableMaskZero + | ... -> output + | } + | (3): nn.GPU(2) @ nn.Dropout(0.2, busy) + | (4): nn.GPU(1) @ nn.SeqLSTM + | (5): nn.GPU(1) @ nn.Dropout(0.2, busy) + | (6): nn.GPU(1) @ nn.SeqLSTM + | (7): nn.GPU(1) @ nn.Dropout(0.2, busy) + | (8): nn.GPU(2) @ nn.SeqLSTM + | (9): nn.GPU(2) @ nn.Dropout(0.2, busy) + | (10): nn.GPU(2) @ nn.SeqLSTM + | (11): nn.GPU(2) @ nn.Dropout(0.2, busy) + | (12): nn.GPU(3) @ nn.SplitTable + | } + |`-> (2): nn.Identity + ... -> output + } + (2): nn.ZipTable + (3): nn.GPU(3) @ nn.Sequencer @ nn.Recursor @ nn.MaskZero @ nn.NCEModule(2048 -> 793471) +} +``` + +<a name='nce.result'></a> +## Results + +On the 4-layer LSTM with 2048 hidden units, [[1]](#nce.ref) obtain 43.2 perplexity on the GBW test set. +After early-stopping on a sub-set of the validation set (at 100 epochs of training where 1 epoch is 128 sequences x 400k words/sequence), our model was able to reach *40.61* perplexity. + +This model was run on 4x12GB NVIDIA Titan X GPUs. +Training requires approximately 40GB of memory distributed across the 4 GPU devices, and 2-3 weeks of training. +As in the original paper, we do not make use of momentum as it provides little benefit and requires 1/2 more memory. + +Training runs at about 3800 words/second. + +### Learning curves + +The following figure outlines the learning curves for the above 4x2048 LSTM model. +The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`. +Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`. + +![LSTM NCE Learning curves](images/LSTM-NCE-curve.png) + +As you can see, most of the learning is done in the first epochs. +Nevertheless, the training and validation error are consistently reduced training progresses. + +The following figure compares the valiation learning curves (again, NCE error) for a small 2x250 LSTM (no dropout) and big 4x2048 LSTM (with dropout). + +![Small vs Big LSTM](images/small-vs-big-lstm.png) + +What I find impressive about this figure is how quickly the higher-capacity model bests the lower-capacity model. +This clearly demonstrates the importance of capacity when optimizing large-scale language models. + +### Generating sentences + +Here are some sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7: + +```xml +<S> The first , for a lot of reasons , is the " Asian Glory " : an American military outpost in the middle of an Iranian desert . </S> +<S> But the first new stage of the project will be a new <UNK> tunnel linking the new terminal with the new terminal at the airport . </S> +<S> The White House said Bush would also sign a memorandum of understanding with Iraq , which will allow the Americans to take part in the poll . </S> +<S> The folks who have campaigned for his nomination know that he is in a fight for survival . </S> +<S> The three survivors , including a woman whose name was withheld and not authorized to speak , were buried Saturday in a makeshift cemetery in the town and seven people were killed in the town of Eldoret , which lies around a dozen miles ( 40 kilometers ) southwest of Kathmandu . </S> +<S> The art of the garden was created by pouring water over a small brick wall and revealing that an older , more polished design was leading to the creation of a new house in the district . </S> +<S> She added : " The club has not made any concession to the club 's fans and was not notified of the fact they had reached an agreement with the club . </S> +<S> The Times has learnt that the former officer who fired the fatal shots must have known about the fatal carnage . </S> +<S> Obama supporters say they 're worried about the impact of the healthcare and energy policies of Congress . </S> +<S> Not to mention the painful changes to the way that women are treated in the workplace . </S> +<S> The dollar stood at 14.38 yen ( <UNK> ) and <UNK> Swiss francs ( <UNK> ) . </S> +<S> The current , the more intractable <UNK> , the <UNK> and the <UNK> about a lot of priorities . </S> +<S> The job , which could possibly be completed in 2011 , needs to be approved in a new compact between the two companies . </S> +<S> " The most important thing for me is to get back to the top , " he said . </S> +<S> It was a one-year ban and the right to a penalty . </S> +<S> The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . </S> +<S> The six were scheduled to return to Washington on Wednesday . </S> +<S> " It 's a ... mistake , " he said . </S> +<S> The government 's offensive against the rebels and insurgents has been criticized by the United Nations and UN agencies . </S> +<S> " Our <UNK> model is not much different from many of its competitors , " said Richard Bangs , CEO of the National Center for Science in the Public Interest in Chicago . </S> +<S> He is now a large part of a group of young people who are spending less time studying and work in the city . </S> +<S> He said he was confident that while he and his wife would have been comfortable working with him , he would be able to get them to do so . </S> +<S> The summer 's financial meltdown is the worst in decades . </S> +<S> It was a good night for Stuart Broad , who took the ball to Ravi Bopara at short leg to leave England on 88 for five at lunch . </S> +<S> And even for those who worked for them , almost everything was at risk . </S> +<S> The new strategy is all part of a stepped-up war against Taliban and al-Qaida militants in northwest Pakistan . </S> +<S> The governor 's office says the proposal is based on a vision of an outsider in the town who wants to preserve the state 's image . </S> +<S> " The fact that there is no evidence to support the claim made by the government is entirely convincing and that Dr Mohamed will have to be detained for a further two years , " he said . </S> +<S> The country 's tiny nuclear power plants were the first to use nuclear technology , and the first such reactors in the world . </S> +<S> " What is also important about this is that we can go back to the way we worked and work and fight , " he says . </S> +<S> And while he has been the star of " The Wire " and " The Office , " Mr. Murphy has been a careful , intelligent , engaging competitor for years . </S> +<S> On our return to the water , we found a large abandoned house . </S> +<S> The national average for a gallon of regular gas was $ 5.99 for the week ending Jan . </S> +<S> The vote was a rare early start for the contest , which was held after a partial recount in 26 percent of the vote . </S> +<S> The first one was a show of force by a few , but the second was an attempt to show that the country was serious about peace . </S> +<S> It was a little more than half an hour after the first reports of a shooting . </S> +<S> The central bank is expected to cut interest rates further by purchasing more than $ 100 billion of commercial paper and Treasuries this week . </S> +<S> Easy , it 's said , to have a child with autism . </S> +<S> He said : " I am very disappointed with the outcome because the board has not committed itself . </S> +<S> " There is a great deal of tension between us , " said Mr C. </S> +<S> The odds that the Fed will keep its benchmark interest rate unchanged are at least half as much as they were at the end of 2008 . </S> +<S> For them , investors have come to see that : a ) the government will maintain a stake in banks and ( 2 ) the threat of financial regulation and supervision ; and ( 3 ) it will not be able to raise enough capital from the private sector to support the economy . </S> +<S> The court heard he had been drinking and drank alcohol at the time of the attack . </S> +<S> " The whole thing is quite a bit more intense . </S> +<S> This is a very important project and one that we are working closely with . </S> +<S> " We are confident that in this economy and in the current economy , we will continue to grow , " said John Lipsky , who chaired the IMF 's board of governors for several weeks . </S> +<S> The researchers said they found no differences among how men drank and whether they were obese . </S> +<S> Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . </S> +<S> The £ 7m project is a new project for the city of Milton Keynes and aims to launch a new challenge for the British Government . </S> +<S> But he was not without sympathy for his father . </S> +``` + +The syntax seems quite reasonable, especially when comparing it to the previous results obtained from the [single-GPU 2x250 LSTM](#nce.eval). +However, in some cases, the semantics, i.e. the meaning of the words, is not so good. +For example, the sentence +```xml +<S> Easy , it 's said , to have a child with autism . </S> +``` +would make more sense, to me at least, by replacing `Easy` with `Not easy`. + +On the other hand, sentences like this one demonstrate good semantics: + +```xml +<S> The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . </S>`. +``` + +[Michelle Bachelet](https://en.wikipedia.org/wiki/Michelle_Bachelet) was actually a president of Chile. +In her earlier life, she was also [kidnapped by military men](https://www.theguardian.com/world/2005/nov/22/chile.gender), so it kind of makes sense that she would be strong on the issue of kidnappings. + +Here is an example of some weird semantics : + +```xml +<S> Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . </S> +``` + +The first part about `load voice` doesn't mean anything to me. +And I fail to see how there being `many brands that have no connection to the Internet` relates to `the iPhone is a great deal for consumers`. +But of course, all these sentences are generated independently, so the LM needs to learn to generate a meaning on the fly. +This is hard as there is no context to the sentence being generated. + +In any case, I am quite happy with the results as they are definitely some of the most natural-looking synthetic sentences I have seen so far. + +<a name='nce.future'></a> +## Future work + +I am currently working on a language modeling dataset based on one month of [reddit.com](https://www.reddit.com/) data. +Each sequence is basically a reddit submission consisting of a `TITLE`, `SELFTEXT` (or `URL`), `SCORE`, `AUTHOR` and a thread of `COMMENTS`. +These sequences are much longer (average of 205 tokens) than the sentences that make up the GBW dataset (average of 26 tokens). +Training is still underway, but to pique your interest, this is an example of generated data (indentation and line breaks added for clarity): + +```xml +<SUBMISSION> + <AUTHOR> http://www.reddit.com/u/[deleted] </AUTHOR> + <SCORE> 0 </SCORE> + <TITLE> + [ WP ] You take a picture of a big bang . + You discover an alien that lives in the center of the planet in an unknown way . + You can say " what the fuck is that ? " + </TITLE> + <COMMENTS> + <CoMMeNT> + <ScoRE> 2 </ScoRE> + <AuTHoR> http://www.reddit.com/u/Nev2k </AuTHoR> + <BodY> + I have a question . + When i was younger , my parents had a house that had a living room in it . + One that was only a small portion of an entire level . + This was a month before i got my money . + If i was living in a house with a " legacy " i would make some mistakes . + When i was a child , i did n't know how to do shit about the house . + My parents got me into my own house and i never found a place to live . + So i decide to go to college . + I was so freaked out , i didnt have the drive to see them . + I never had a job , i was n't going anywhere . + I was so happy . + I knew i was going to be there . + I gave myself a job and my parents came . + That 's when i realized that i was in the wrong . + So i started to go . + I couldnt decide how long i wanted to live in this country . + I was so excited about the future . + I had a job . + I saved my money . + I did n't have a job . + I went to a highschool in a small town . + I had a job . + A job . + I did n't know what to do . + I was terrified of losing my job . + So i borrowed my $ 1000 in an hour . + I could n't afford to pay my rent . + I was so low on money . + I had my parents and i got into a free college . + I got in touch with my parents . + All of my friends were dead . + I was still with my family for a week . + I became a good parent . + I was a good choice . + When i got on my HSS i was going to go to my parents ' house . + I started to judge my parents . + I had a minor problem . + My parents . + I was so fucking bad . + My sister had a voice that was very loud . + I 'm sure my cousins were in a place where i could just hear my voice . + I felt like i was supposed to be angry . + I was so angry . + To cope with this . + My dad and i were both on break and i felt so alone . + I got unconscious and my mum left . + When I got to college , i was back in school . + I was a good kid . + I was happy . + And I told myself I was ready . + I told my parents . + They always talked about how they were going to be a good mom , and that I was going to be ready for that . + They always wanted to help me . + I did n't know what to do . + I had to . + I tried to go back to my dad , because I knew a lot about my mom . + I loved her . + I cared about her . + We cared for our family . + The time together was my only relationship . + I loved my heart . + And I hated my mother . + I chose it . + I cried . I cried . I cried . I cried . I cried . I cried . I cried . + The tears were gone . + I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . + I do n't know how to do it . + I do n't know how to deal with it . + I ca n't feel my emotions . + I ca n't get out of bed . + I ca n't sleep . + I ca n't tell my friends . + I just need to leave . + I want to leave . + I hate myself . + I hate feeling like I 'm being selfish . + I feel like I 'm not good enough anymore . + I need to find a new job . + I hate that I have to get my shit together . + I love my job . + I 'm having a hard time . + Why do I need to get a job ? + I have no job . + I have n't been feeling good lately . + I feel like I 'm going to be so much worse in the long run . + I feel so alone . + I ca n't believe I 'm so sad about going through my entire life . + </BodY> + <AuTHoR> http://www.reddit.com/u/Scarbarella </AuTHoR> + </CoMMeNT> + </COMMENTS> + <SUBREDDIT> http://www.reddit.com/r/offmychest </SUBREDDIT> + <SELFTEXT> + I do n't know what to do anymore . + I feel like I 'm going to die and I 'm going to be sick because I have no more friends . + I do n't know what to do about my depression and I do n't know where to go from here . + I do n't know how I do because I know I 'm scared of being alone . + Any advice would be appreciated . + Love . + </SELFTEXT> +</SUBMISSION> +``` + +This particular sample is a little depressing, but that might just be the nature of the `offmychest` subreddit. +Conditioned on the opening `<SUBMISSION>` token, this generated sequence, although imperfect, is incredibly human. +Reading through the comment, I feel like I am reading a story written by an actual (somewhat schizophrenic) person. +The ability to similuate human creativity is one of the reasons I am so interested in using reddit data for language modeling. + +A less depressing sample is the following, which concerns the [Destiny](https://en.wikipedia.org/wiki/Destiny_(video_game)) video game: + +```xml +<SUBMISSION> + <SUBREDDIT> http://www.reddit.com/r/DestinyTheGame </SUBREDDIT> + <TITLE> + Does anyone have a link to the Destiny Grimoire that I can use to get my Xbox 360 to play ? + </TITLE> + <COMMENTS> + <CoMMeNT> + <AuTHoR> http://www.reddit.com/u/CursedSun </AuTHoR> + <BodY> + I 'd love to have a weekly reset . + </BodY> + <ScoRE> 1 </ScoRE> + </CoMMeNT> + </COMMENTS> + <SCORE> 0 </SCORE> + <SELFTEXT> + I have a few friends who are willing to help me out . + If I get to the point where I 'm not going to have to go through all the weekly raids , I 'll have to " complete " the raid . + I 'm doing the Weekly strike and then doing the Weekly ( and hopefully also the Weekly ) on Monday . + I 'm not planning to get the chest , but I am getting my first exotic that I just got done from my first Crota raid . + I 'm not sure how well it would work for the Nightfall and Weekly , but I do n't want to loose my progress . + I 'd love to get some other people to help me , and I 'm open to all suggestions . + I have a lot of experience with this stuff , so I figured it 's a good idea to know if I 'm getting the right answer . + I 'm truly sorry for the inconvenience . + </SELFTEXT> + <AUTHOR> <OOV> </AUTHOR> +</SUBMISSION> +``` + +For those not familiar with this game, terms like +[Grimoire](http://destiny.wikia.com/wiki/Grimoire), [weekly reset](https://www.vg247.com/tag/destiny-weekly-reset/), +[raids](http://destiny.wikia.com/wiki/Raid), [Nightfall stike](http://destiny.wikia.com/wiki/Weekly_Nightfall_Strike), +[exotics](http://destiny.wikia.com/wiki/Exotic) and [Crota raid](http://destiny.wikia.com/wiki/Crota%27s_End) +may seem odd. But these are all part of the game vocabulary. + +The particular model (a 4x1572 LSTM with dropout) only backpropagates through 50 time-steps. +What I would like to see is for the `COMMENTS` to actually answer the question posed by the `TITLE` and `SELFTEXT`. +This is a very difficult semantic problem which I hope the Reddit dataset will help solve. +More to follow in my next Torch blog post. + +<a name='nce.ref'></a> +## References + +1. *A Mnih, YW Teh*, [A fast and simple algorithm for training neural probabilistic language models](https://www.cs.toronto.edu/%7Eamnih/papers/ncelm.pdf) +2. *B Zoph, A Vaswani, J May, K Knight*, [Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies](http://www.isi.edu/natural-language/mt/simple-fast-noise.pdf) +3. *R Pascanu, T Mikolov, Y Bengio*, [On the difficulty of training Recurrent Neural Networks](http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf) +4. *S Hochreiter, J Schmidhuber*, [Long Short Term Memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf) +5. *A Graves, A Mohamed, G Hinton*, [Speech Recognition with Deep Recurrent Neural Networks](http://arxiv.org/pdf/1303.5778.pdf) +6. *K Greff, RK Srivastava, J Koutník*, [LSTM: A Search Space Odyssey](http://arxiv.org/pdf/1503.04069) +7. *C Chelba, T Mikolov, M Schuster, Q Ge, T Brants, P Koehn, T Robinson*, [One billion word benchmark for measuring progress in statistical language modeling](http://arxiv.org/pdf/1312.3005) +8. *A Graves*, [Generating Sequences With Recurrent Neural Networks, table 1](http://arxiv.org/pdf/1308.0850v5.pdf) diff --git a/blog/_posts/images/LM-Linear.png b/blog/_posts/images/LM-Linear.png Binary files differnew file mode 100644 index 0000000..46c92d9 --- /dev/null +++ b/blog/_posts/images/LM-Linear.png diff --git a/blog/_posts/images/LM-NCE.png b/blog/_posts/images/LM-NCE.png Binary files differnew file mode 100644 index 0000000..39b6fad --- /dev/null +++ b/blog/_posts/images/LM-NCE.png diff --git a/blog/_posts/images/LM-params.png b/blog/_posts/images/LM-params.png Binary files differnew file mode 100644 index 0000000..1ae0e05 --- /dev/null +++ b/blog/_posts/images/LM-params.png diff --git a/blog/_posts/images/LSTM-NCE-curve.png b/blog/_posts/images/LSTM-NCE-curve.png Binary files differnew file mode 100644 index 0000000..447cc56 --- /dev/null +++ b/blog/_posts/images/LSTM-NCE-curve.png diff --git a/blog/_posts/images/LSTM.png b/blog/_posts/images/LSTM.png Binary files differnew file mode 100644 index 0000000..80c6067 --- /dev/null +++ b/blog/_posts/images/LSTM.png diff --git a/blog/_posts/images/rnnlm.png b/blog/_posts/images/rnnlm.png Binary files differnew file mode 100644 index 0000000..ab8b7d3 --- /dev/null +++ b/blog/_posts/images/rnnlm.png diff --git a/blog/_posts/images/small-vs-big-lstm.png b/blog/_posts/images/small-vs-big-lstm.png Binary files differnew file mode 100644 index 0000000..b580afd --- /dev/null +++ b/blog/_posts/images/small-vs-big-lstm.png |