From 21bb018b80ad29f73910c3557020a320c6e0d659 Mon Sep 17 00:00:00 2001 From: nicholas-leonard Date: Tue, 19 Jul 2016 16:53:25 -0400 Subject: initial commit for NCE --- blog/_posts/2016-05-11-nce.md | 583 ++++++++++++++++++++++++++++++++++++++++++ blog/_posts/images/rnnlm.png | Bin 0 -> 17199 bytes 2 files changed, 583 insertions(+) create mode 100644 blog/_posts/2016-05-11-nce.md create mode 100644 blog/_posts/images/rnnlm.png diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md new file mode 100644 index 0000000..11ee839 --- /dev/null +++ b/blog/_posts/2016-05-11-nce.md @@ -0,0 +1,583 @@ +--- +layout: post +title: Noise Contrastive Estimation +comments: True +author: nicholas-leonard +excerpt: TODO +picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/output_52iFki.gif +--- + + + +In the past couple of months we have seen increased interest in generative character-level +recurrent neural network (RNN) models like [char-rnn](https://github.com/karpathy/char-rnn) +and the more recent [torch-rnn](https://github.com/jcjohnson/torch-rnn). +These models are very interesting as they can be used to generate sequences of text like: + +```lua + +Diablo + +I liked this game so much!! Hope telling that numbers' benefits and +features never found out at that level is a total breeze +because it's not even a developer/voice opening and rusher runs +the game against so many people having noticeable purchases of selling +the developers built or trying to run the patch to Jagex. + +``` + +The above was generated one character at a time using a sample of [reddit](https://www.reddit.com/) comments. +As you can see for yourself, the general structure of the generated text looks good at first view. +The tags are opened and closed appropriately. The first sentence looks good: `I liked this game so much!!` +and it is related to the subreddit of the post: `Diablo`. But reading the rest of it, we can +start to see the limitations of char-level language models. The spelling of individual words looks great, but +the meaning of the next sentence is difficult to understand. + +## Word-Level vs Char-Level Language Models + +In this blog post we will show how Torch can be used to train a large-scale word-level language model to generate +independent sentences. Word-level models have an important advantage of char-level models. +Take the following sequence as an example (a quote from Robert A. Heinlein): + +``` +Progress isn't made by early risers. It's made by lazy men trying to find easier ways to do something. +``` + +After tokenization, the word-level model might view this sequence as containing 22 tokens. +On the other hand, the char-level will view this sequence as containing 102 tokens. +This longer sequence makes the task of the char-level model harder than the word-level model, as it +must take into account dependencies between more tokens over more time-steps. + +The main advantage of char-level over word-level language models is that they +have a really small vocabulary. For example, the Google Billion Words dataset will contain approximately 800 characters +compared to 800,000 words (after pruning low-frequency tokens). In practice this means that char-level models will +require less memory and have faster inference than their word-level counterparts. + +## Output Layer Bottleneck + +With its small vocabulary of 10000 words, the Penn Tree Bank dataset is relatively easy to use to build word-level language models. +The output layer is still tractable to compute for both training and inference, especially for GPUs. +For these smaller vocabularies, the output layer is basically a `Linear` followed by a `SoftMax`: + +```lua +outputlayer = nn.Sequential() + :add(nn.Linear(hiddensize, vocabsize)) + :add(nn.SoftMax()) +``` + +However, when training with large vocabularies, like the 793471 words that makes up +the Google Billion Words (GBW) dataset [[1]](#nce.ref), +the output layer quickly becomes a bottle neck. +If you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` +(size of sequence to backpropagate through time), +the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `32 x 100 x 793471`. +For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory. +The number can be double for gradients, and doubled again as both Linear and SoftMax store a copy for the output. +If somehow you can find a way to put >40GB on a GPU (or distribute it over many), you then run in the problem of +forward/backward propagating through that `outputlayer` in a reasonable time-frame. + +## GBW Data Loader + +For our word-level language model we use the GBW dataset. +The dataset is different from Penn Tree Bank in that sentences are +kept independent of each other. So then our dataset consists of a set of +independent variable-length sequences. The dataset can be easily loaded using +the [dataload](https://github.com/Element-Research/dataload) package: + +```lua +local dl = require 'dataload' +local train, valid, test = dl.loadGBW(batchsize) +``` + +The above will automatically download the data if not found on disk and +return the training, validation and test set. +These are [dl.MultiSequence](https://github.com/Element-Research/dataload#dl.MultiSequence) instances +which have the following constructor: + +```lua +dataloader = dl.MultiSequence(sequences, batchsize) +``` + +The `sequences` argument is a Lua table or [tds.Vector](https://github.com/torch/tds#d--tdsvec--tbl) +where each element is a Tensor containing an independent sequence. For example: + +```lua +sequences = { + torch.LongTensor{424,158,115,667,28,505,228}, + torch.LongTensor{389,456,188}, + torch.LongTensor{77,172,760,687,552,529} +} +batchsize = 2 +dataloader = dl.MultiSequence(sequences, batchsize) +``` + +Note how the sequences vary in length. +Like all [dl.DataLoader](https://github.com/Element-Research/dataload#dl.DataLoader) sub-classes, the +`dl.MultiSequence` loader provides a method for sub-sampling a batch of `inputs` and `targets` from the dataset: + +```lua +local inputs, targets = dataloader:sub(1, 10) +``` + +The `sub` method takes the `start` and `end` indices of sub-sequences to index. +Internally, these indices are only used to determine length (`seqlen`) of the requested multi-sequences. +Each successive call to `sub` will return multi-sequences contiguous to the previous ones. + +The returned `inputs` and `targets` are `seqlen x batchsize [x inputsize]` +tensors containg a batch of 2 multi-sequences, each containing 8 time-steps. +Starting with the `inputs` : + +```lua +print(inputs) + 0 0 + 424 77 + 158 172 + 115 760 + 667 687 + 28 552 + 505 0 + 0 424 +[torch.DoubleTensor of size 8x2] +``` + +Each column is vector containing potentially multiple sequences, i.e. a multi-sequence. +Independent sequences are seperated by zeros. We will see later how the +[rnn](https://github.com/Element-Research/rnn) package can use these zero-masked time-steps to +efficiently forget its hidden state between independent sequences, at the granularity of columns. +For now, notice how the original `sequences` are contained in the returned `inputs` and separated by zeros. + +The `targets` are similar to the `inputs`, but use masks of 1 to separate sequences (as `ClassNLLCriterion` will otherwise complain). +As is typical in language models, the task is to predict the next word, such that the `targets` are delayed by one time-step +with respect to the commensurate `inputs`: + +```lua +print(targets) + 1 1 + 158 172 + 115 760 + 667 687 + 28 552 + 505 529 + 228 1 + 1 158 +[torch.DoubleTensor of size 8x2] +``` + +The `train`, `valid` and `test` returned by the call to `dl.loadGBW` have the same properties as the above. +Except that the dataset is much bigger (it has one billion words). For debugging and such, we can choose to +load a smaller subset of the training set. This will load much faster than the default training set file: + +```lua +local train, valid, test = dl.loadGBW({2,2,2}, 'train_tiny.th7') +``` + +The above will use a `batchsize` of 2 for all sets. +Iteration through the dataloader is made easier using the [subiter](https://github.com/Element-Research/dataload#iterator-subiterbatchsize-epochsize-) : + +``` +local seqlen, epochsize = 3, 10 +for i, inputs, targets in train:subiter(seqlen, epochsize) do + print("T = " .. i) + print(inputs) +end +``` + +Which will output: + +```lua +T = 3 + 0 0 + 793470 793470 + 211427 6697 +[torch.DoubleTensor of size 3x2] + +T = 6 + 477149 400396 + 720601 213235 + 660496 368322 +[torch.DoubleTensor of size 3x2] + +T = 9 + 676607 61007 + 161927 767587 + 248714 635004 +[torch.DoubleTensor of size 3x2] + +T = 10 + 280570 130510 +[torch.DoubleTensor of size 1x2] + +``` + +We could also return the above batches as one big chunk instead: + +```lua +train:reset() -- resets the internal sequence iterator +print(train:sub(1,10)) + 0 0 + 793470 793470 + 211427 6697 + 477149 400396 + 720601 213235 + 660496 368322 + 676607 61007 + 161927 767587 + 248714 635004 + 280570 130510 +[torch.DoubleTensor of size 10x2] +``` + +Notice how the above small batches are aligned with this big chunk. Which +means that the data is iterated in sequence. + +Each sentence in the GBW dataset is encapsulated by `` and `` tokens to indicate the +start and end of the sequence, respectively. Each token is mapped to an integer. So for example, +you can see that `` is mapped to integer `793470` in the above example. +Now that we feel confident in our dataset, lets look at the model. + +## RNNLM + +Our task is to build a language model which will maximize the likelihood of the +next word given the history of previous words in the sentence. +The following figure illustrates the a Simple Recurrent Neural Network (Simple RNN) language model: + +![rnnlm](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/rnnlm.png) + +So for this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on. +The RNN as an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words. +Simple RNNs are not the only kind of model that can be used model language. +There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which +have special gated cells that facilitate the backpropagation of gradients through longer sequences. +LSTMs can learn dependencies seperated by much longer time-steps . + +## Multi-layer LSTM + +The input layer of the the `lm` model is a lookup table : + +```lua +lm = nn.Sequential() + +-- input layer (i.e. word embedding space) +local lookup = nn.LookupTableMaskZero(#trainset.ivocab, opt.inputsize) +lm:add(lookup) -- input is seqlen x batchsize +``` + +A sub-class of `LookupTable`, we use the [LookupTableMaskZero](https://github.com/Element-Research/rnn#rnn.LookupTableMaskZero) +to learn word embeddings. The main difference is that it supports zero-indexes, which are forwarded as zero-tensors. +Then we have the actual multi-layer LSTM implementation, which uses the fast [SeqLSTM](https://github.com/Element-Research/rnn#rnn.SeqLSTM) module: + +```lua +local inputsize = opt.inputsize +for i,hiddensize in ipairs(opt.hiddensize) do + local rnn = nn.SeqLSTM(inputsize, hiddensize) + rnn.maskzero = true + lm:add(rnn) + if opt.dropout > 0 then + lm:add(nn.Dropout(opt.dropout)) + end + inputsize = hiddensize +end +``` + +The `SeqLSTM` implemention is very fast and it benchmarked by the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm). +Next we split the output of the SeqLSTM (which is a `seqlen x batchsize x outputsize` Tensor) into a table containing a `batchsize x outputsize` tensor for +each time-step: + +```lua +lm:add(nn.SplitTable(1)) +``` + +### Noise Contrastive Estimation + +The output layer of the LM uses Noise Contrastive Estimation (NCE) to speed up training and reduce memory consumption: + +```lua +local unigram = trainset.wordfreq:float() +local ncemodule = nn.NCEModule(inputsize, #trainset.ivocab, opt.k, unigram, opt.Z) + +-- NCE requires {input, target} as inputs +lm = nn.Sequential() + :add(nn.ParallelTable() + :add(lm):add(nn.Identity())) + :add(nn.ZipTable()) -- {{x1,x2,...}, {t1,t2,...}} -> {{x1,t1},{x2,t2},...} + +-- encapsulate stepmodule into a Sequencer +lm:add(nn.Sequencer(nn.MaskZero(ncemodule, 1))) +``` + +The [NCEModule](https://github.com/Element-Research/dpnn#nn.NCEModule) is a more efficient version of: + +```lua +nn.Sequential():add(nn.Linear(inputsize, #trainset.ivocab)):add(nn.LogSoftMax()) +``` + +Along with the [NCECriterion](https://github.com/Element-Research/dpnn#nn.NCECriterion), +the `NCEModule` implements the algorithm is described in [[1]](#nce.ref). +I won't go into the details of the algorithm as it involves a lot of math. +The way it works is that for each target word (the likelihood of which we want to maximize), +`k` words are sampled from a noise distribution, which is typically the unigram distribution. +The `unigram` above is a tensor of size 793470 where each element is the frequency of the commensurate word in the corpus. + +Sampling from such a large distribution using something like [torch.multinomial](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.multinomial) +can become a bottleneck during training. +So we implemented a more efficient version in [torch.AliasMultinomial](https://github.com/nicholas-leonard/torchx/blob/master/AliasMultinomial.lua). +The latter multinomial sampler requires more setup time than the former, but this isn't a problem as the unigram distribution is constant. + +NCE uses the noise samples to approximate a normalization term `Z` where the output distribution is `exp(x[i])/Z` and `x[i]` is the output of the `Linear` for word `i`. +For the Softmax, which NCE tries to approximate, the `Z` is the sum over the `exp(x[i'])` over all words `i'`. +For NCE, the `Z` is typically fixed to `Z=1`. +Our initial experiments found that setting `Z` to `Z=N*mean(exp(x[i]))` +(where `N` is the number of words and the `mean` is approximated over a small batch of word samples `i`) +gave much better results. + +One notable aspect of NCE papers (there are many) is that they often forget to mention the importance of parameter initialization. +Setting `Z=1` is only really possible if the `NCEModule.bias` is initialized to `bias[i] = -log(N)`. +This is what the authors of [[2]](#nce.ref) use, although it isn't mentioned in the paper (I contacted one of the authors to find out). + +Sampling `k` noise samples per time-step and per batch-row means that the `NCEModule` needs to internally use something like +[torch.baddbmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.baddbmm) to compute the `output`. +Reference [[2]](#nce.ref) implement a faster version where the noise samples are drawn once and used for the entire batch (but still once for each time-step). +This make the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used. +This faster NCE version described in [[2]](#nce.ref) is the default implementation of the `NCEModule`. Sampling per batch-row can be turned on with `NCEModule.rownoise=true`. + +## Scripts + +The experiments presented here use three scripts: two for training and one for evaluation. + +### Single-GPU Training Script + +We provide training scripts for a single gpu via the [noise-contrastive-estimate.lua](https://github.com/Element-Research/rnn/blob/master/examples/noise-contrastive-estimate.lua) script. +Running the following on a 12GB NVIDIA Titan X should resulted in a test set perplexity of 65.6 after 321 epochs: + +```bash +th examples/noise-contrastive-estimate.lua --cuda --device 2 --startlr 1 --saturate 300 --cutoff 10 --progress --uniform 0.1 --seqlen 50 --batchsize 128 --trainsize 400000 --validsize 40000 --hiddensize '{250,250}' --k 400 --minlr 0.001 --momentum 0.9 +``` + +The resulting model will look like this: + +```lua +nn.Serial @ nn.Sequential { + [input -> (1) -> (2) -> (3) -> output] + (1): nn.ParallelTable { + input + |`-> (1): nn.Sequential { + | [input -> (1) -> (2) -> (3) -> (4) -> output] + | (1): nn.LookupTableMaskZero + | (2): nn.SeqLSTM + | (3): nn.SeqLSTM + | (4): nn.SplitTable + | } + |`-> (2): nn.Identity + ... -> output + } + (2): nn.ZipTable + (3): nn.Sequencer @ nn.Recursor @ nn.MaskZero @ nn.NCEModule(250 -> 793471) +} +``` + +To use about one third less memory, you can set momentum of 0. + +### Evaluation Script + +To evaluate a saved model, you can use the [evaluate-rnnlm.lua](https://github.com/Element-Research/rnn/blob/master/scripts/evaluate-rnnlm.lua) script: + +```bash +th scripts/evaluate-rnnlm.lua --xplogpath /home/nicholas14/save/rnnlm/gbw:uranus:1466538423:1.t7 --cuda +``` + +where you should replace `/home/nicholas14/save/rnnlm/gbw:uranus:1466538423:1.t7` with the path to your own trained model. +Evaluating on the test set can take a while as it must use the less efficient `Linear` + `SoftMax`, and thus a very small batch size (so as not to use too much memory). + +The evaluation script can also be used to generate samples from the language model: + +```bash +th scripts/evaluate-rnnlm.lua --xplogpath /home/nicholas14/save/rnnlm/gbw:uranus:1466790001:1.t7 --cuda --nsample 200 --temperature 0.7 +``` + +The `--nsample` flag specifies how many tokens to sample. The first token input to the language model is the start-of-sentence tag (``). +When the end-of-sentence tag (``), the model's hidden states are set to zero, such that each sentence is sampled independently. +The `--temperature` flag can be reduced to make the sampling more deterministic. + +```xml + There were a number of players in the starting lineup during the season and in recent weeks , in recent years , some fans have been frustrated . + WASHINGTON ( Reuters ) - The government plans to cut greenhouse gases by as much as 12 % on the global economy , a new report said . + One of the most important things about the day was that the two companies had just been guilty of the same nature . + " It has been as much a bit of a public service as a public organisation . + In a nutshell , it 's not only the fate of the economy . + It was last modified at 23.31 GMT on Saturday 22 December 2009 . + He told the newspaper the prosecution had been treating the small boy as " a young man who was playing for a while . + " We are astounded that our employees are not made aware of the risks and risks they are pursuing during this period of time , " he said . + " I had a right to come up with the idea . + But the truth +``` + +### Multi-GPU Training Script + +As can be observed in the previous section, training a 2-layer LSTM with only 250 hidden units will not yield the best +generated samples. The model needs much more capacity than what can fit on a 12GB GPU. +The [multigpu-nce-rnnlm.lua](https://github.com/Element-Research/rnn/blob/master/examples/multigpu-nce-rnnlm.lua) script can be used +to train a model on four GPUs. + +It uses the [GPU](https://github.com/torch/nn/blob/master/doc/simple.md#nn.GPU) to decorate modules such that +all their operations and memory are hosted on a specified device. +The `GPU` module won't parallelize kernel execution over different GPU-devices. +But it does allow us to distribute large models over devices. + +For our LM, the input word embeddings (i.e. `LookupTableMaskZero`) and output layer (i.e. `NCEModule`) take up most of the memory. +The first was pretty easy to distribute: + +```lua +lm = nn.Sequential() +lm:add(nn.Convert()) + +-- input layer (i.e. word embedding space) +local concat = nn.Concat(3) +for device=1,2 do + local inputsize = device == 1 and torch.floor(opt.inputsize/2) or torch.ceil(opt.inputsize/2) + local lookup = nn.LookupTableMaskZero(#trainset.ivocab, inputsize) + lookup.maxnormout = -1 -- prevent weird maxnormout behaviour + concat:add(nn.GPU(lookup, device):cuda()) -- input is seqlen x batchsize +end +``` + +Basically, the embedding space is split into two tables. +For a 2048 unit embedding space, half, i.e. 1024 units, are located on each of two devices. +We use `Concat` to concatenate them back together after a `forward`. + +For the hidden layers (i.e. `SeqLSTM`), we just distribute them on the devices used by the input layer. +The hidden layers use up little memory (approximately 1GB each) so they aren't the problem. +We locate them on the same devices as the input layer as the output layer utilizes more memory (for buffers). + +```lua +local inputsize = opt.inputsize +for i,hiddensize in ipairs(opt.hiddensize) do + local rnn = nn.SeqLSTM(inputsize, hiddensize) + rnn.maskzero = true + local device = i <= #opt.hiddensize/2 and 1 or 2 + lm:add(nn.GPU(rnn, device):cuda()) + if opt.dropout > 0 then + lm:add(nn.GPU(nn.Dropout(opt.dropout), device):cuda()) + end + inputsize = hiddensize +end + +lm:add(nn.GPU(nn.SplitTable(1), 3):cuda()) +``` + +The `NCEModule` was a bit more difficult to distribute as it cannot be so easily parallelized as `LookupTableMaskZero`. +Our solution was to provide a simple [multicuda()](https://github.com/Element-Research/dpnn/blob/26edf00f7f22edd1e090619bb10528557cede4df/NCEModule.lua#L419-L439) +method to distribute the `weight` on `gradWeight` on different devices. +This is accomplished by swaping the weight tensors for our own : [torch.MultiCudaTensor](https://github.com/nicholas-leonard/torchx/blob/master/MultiCudaTensor.lua). +Lua has no severe type-checking system, so you can fake a tensor by creating a `torch.class` table with the same methods. +To save time, the current version of `MultiCudaTensor` only supports the operations required by the NCEModule. +The advantage of this approach is that it requires minimal changes to the NCEModule and maintains backward compatiblity without requiring redundant code or excessive refactoring. + +```lua +-- output layer +local unigram = trainset.wordfreq:float() +ncemodule = nn.NCEModule(inputsize, #trainset.ivocab, opt.k, unigram, opt.Z) +ncemodule:reset() -- initializes bias to get approx. Z = 1 +ncemodule.batchnoise = not opt.rownoise +-- distribute weight, gradWeight and momentum on devices 3 and 4 +ncemodule:multicuda(3,4) + +-- NCE requires {input, target} as inputs +lm = nn.Sequential() + :add(nn.ParallelTable() + :add(lm):add(nn.Identity())) + :add(nn.ZipTable()) -- {{x1,x2,...}, {t1,t2,...}} -> {{x1,t1},{x2,t2},...} + +-- encapsulate stepmodule into a Sequencer +local masked = nn.MaskZero(ncemodule, 1):cuda() +lm:add(nn.GPU(nn.Sequencer(masked), 3, opt.device):cuda()) +``` + +To reproduce the results in [[2]](#nce.ref) run the following: + +```bash +th examples/multigpu-nce-rnnlm.lua --startlr 0.7 --saturate 300 --minlr 0.001 --cutoff 10 --progress --uniform 0.1 --seqlen 50 --batchsize 128 --trainsize 400000 --validsize 40000 --hiddensize '{2048,2048,2048,2048}' --dropout 0.2 --k 400 --Z 1 --momentum -1 +``` + +Notable differences to paper are the following: + * we use a [gradient norm clipping](https://github.com/Element-Research/dpnn#nn.Module.gradParamClip) [[3]](#nce.ref) (with a `cutoff` norm of 10) to counter exploding and vanishing gradient; + * they use an adaptive learning rate schedule (which isn't specified in the paper). We linearly decay from a learning rate of 0.7 (which they also start from) such that it reaches 0.001 after 300 epochs; + * we use `k=400` samples whereas they use `k=100`. Why? I didn't see a major drop in speed, so why not? + * we use a sequence length of `seqlen=50` for Truncated BPTT. They use 100 (again, not in the paper). The average length of sentences in the dataset is 27 so 50 is more than enough. + +Like them, we use a `dropout=0.2` between LSTM layers. +This is what the resulting model looks like: + +```lua +nn.Serial @ nn.Sequential { +  [input -> (1) -> (2) -> (3) -> output] +  (1): nn.ParallelTable { +    input +      |`-> (1): nn.Sequential { +      |      [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> output] +      |      (1): nn.Convert +      |      (2): nn.GPU(2) @ nn.Concat { +      |        input +      |          |`-> (1): nn.GPU(1) @ nn.LookupTableMaskZero +      |          |`-> (2): nn.GPU(2) @ nn.LookupTableMaskZero +      |           ... -> output +      |      } +      |      (3): nn.GPU(2) @ nn.Dropout(0.2, busy) +      |      (4): nn.GPU(1) @ nn.SeqLSTM +      |      (5): nn.GPU(1) @ nn.Dropout(0.2, busy) +      |      (6): nn.GPU(1) @ nn.SeqLSTM +      |      (7): nn.GPU(1) @ nn.Dropout(0.2, busy) +      |      (8): nn.GPU(2) @ nn.SeqLSTM +      |      (9): nn.GPU(2) @ nn.Dropout(0.2, busy) +      |      (10): nn.GPU(2) @ nn.SeqLSTM +      |      (11): nn.GPU(2) @ nn.Dropout(0.2, busy) +      |      (12): nn.GPU(3) @ nn.SplitTable +      |    } +      |`-> (2): nn.Identity +       ... -> output +  } +  (2): nn.ZipTable +  (3): nn.GPU(3) @ nn.Sequencer @ nn.Recursor @ nn.MaskZero @ nn.NCEModule(2048 -> 793471) +} +``` + +## Results + +On the 4-layer LSTM with 2048 hidden units, [[1]](#nce.ref) obtain 43.2 perplexity on the GBW test set. +After early-stopping on a sub-set of the validation set (at 100 epochs of training), our model was able to reach *40.61* perplexity. + +This model was run on 4x12GB NVIDIA Titan X GPUs. +Training requires approximately 40GB of memory, distributed across the 4 GPU devices. +As in the original paper, we do not make use of momentum as it provides little benefit and requires 1/2 more memory. + +Training runs at about 3800 words/second. + +### Generated Samples + +Here are 8 sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7: + +```xml + The company said its net profit rose to $ 289 million , or 96 cents per share , in the three months ended on March 31 compared with $ 173 million , or $ 0.68 a share , a year ago . + But I 've been a bit disappointed with our performance , " said Wenger . + The first is an even bigger problem . + The next big thing for him is he will be able to tell the world he is thinking about his future . + The new rules have been added to the legislation so that they don 't have to be approved for public use . + The Pentagon 's top counter-terrorism official , who has been in charge of a new system of intelligence collection and inspection , wrote in an e-mail message that while the new system could be easily implemented , it remains an option . + " I was trying to get a glass of water . + Later he was driven to a nearby house where he was later found to be severely ill . +``` + +Not bad, right? + +### Learning Curves + + + + +## References + +1. *A Mnih, YW Teh*, [A fast and simple algorithm for training neural probabilistic language models](https://www.cs.toronto.edu/%7Eamnih/papers/ncelm.pdf) +2. *B Zoph, A Vaswani, J May, K Knight*, [Simple, Fast Noise-Contrastive Estimation for Large RNN Vocabularies](http://www.isi.edu/natural-language/mt/simple-fast-noise.pdf) +3. *R Pascanu, T Mikolov, Y Bengio*, [On the difficulty of training Recurrent Neural Networks](http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf) +4. *S Hochreiter, J Schmidhuber*, [Long Short Term Memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf) +5. *A Graves, A Mohamed, G Hinton*, [Speech Recognition with Deep Recurrent Neural Networks](http://arxiv.org/pdf/1303.5778.pdf) +6. *K Greff, RK Srivastava, J Koutník*, [LSTM: A Search Space Odyssey](http://arxiv.org/pdf/1503.04069) diff --git a/blog/_posts/images/rnnlm.png b/blog/_posts/images/rnnlm.png new file mode 100644 index 0000000..ab8b7d3 Binary files /dev/null and b/blog/_posts/images/rnnlm.png differ -- cgit v1.2.3 From 259333edd7df4be28257aa7e08b92e79e9aefbf1 Mon Sep 17 00:00:00 2001 From: nicholas-leonard Date: Wed, 20 Jul 2016 11:55:44 -0400 Subject: nce++ --- blog/_posts/2016-05-11-nce.md | 127 +++++++++++++++++++++++------------------- 1 file changed, 70 insertions(+), 57 deletions(-) diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md index 11ee839..195579c 100644 --- a/blog/_posts/2016-05-11-nce.md +++ b/blog/_posts/2016-05-11-nce.md @@ -7,12 +7,28 @@ excerpt: TODO picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/output_52iFki.gif --- - - -In the past couple of months we have seen increased interest in generative character-level -recurrent neural network (RNN) models like [char-rnn](https://github.com/karpathy/char-rnn) + + + * Word versus character level language models + * Recurrent neural network language models + * Loading the Google billion words dataset + * Building a multi-layer LSTM + * Training and evaluation scripts + * Results + +In this blog post, we use Torch to use noise contrastive estimation (NCE) [[2]](#nce.ref) +to train a multi-GPU recurrent neural network language model (RNNLM) +on the Google billion words (GBW) dataset [[7]](#nce.ref). +This blog post is the result of many months of work. +The enormity of the dataset caused us to contribute some novel open-source modules, criteria and even a multi-GPU tensor. +We also provide scripts so that you can train and evaluate your own language models. + +## Word versus character level language models + +In recent months you may have noticed increased interest in generative character-level +RNNLMs like [char-rnn](https://github.com/karpathy/char-rnn) and the more recent [torch-rnn](https://github.com/jcjohnson/torch-rnn). -These models are very interesting as they can be used to generate sequences of text like: +These models are very interesting as they can be used to generate sequences of characters like the following: ```lua @@ -27,16 +43,14 @@ the developers built or trying to run the patch to Jagex. ``` The above was generated one character at a time using a sample of [reddit](https://www.reddit.com/) comments. -As you can see for yourself, the general structure of the generated text looks good at first view. +As you can see for yourself, the general structure of the generated text looks good, at first view. The tags are opened and closed appropriately. The first sentence looks good: `I liked this game so much!!` and it is related to the subreddit of the post: `Diablo`. But reading the rest of it, we can start to see the limitations of char-level language models. The spelling of individual words looks great, but -the meaning of the next sentence is difficult to understand. - -## Word-Level vs Char-Level Language Models +the meaning of the next sentence is difficult to understand (it is also very long). In this blog post we will show how Torch can be used to train a large-scale word-level language model to generate -independent sentences. Word-level models have an important advantage of char-level models. +independent sentences. Word-level models have an important advantage over char-level models. Take the following sequence as an example (a quote from Robert A. Heinlein): ``` @@ -53,30 +67,22 @@ have a really small vocabulary. For example, the Google Billion Words dataset wi compared to 800,000 words (after pruning low-frequency tokens). In practice this means that char-level models will require less memory and have faster inference than their word-level counterparts. -## Output Layer Bottleneck +## Recurrent neural network language models -With its small vocabulary of 10000 words, the Penn Tree Bank dataset is relatively easy to use to build word-level language models. -The output layer is still tractable to compute for both training and inference, especially for GPUs. -For these smaller vocabularies, the output layer is basically a `Linear` followed by a `SoftMax`: +Our task is to build a language model which will maximize the likelihood of the +next word given the history of previous words in the sentence. +The following figure illustrates the a Simple Recurrent Neural Network (Simple RNN) language model: -```lua -outputlayer = nn.Sequential() - :add(nn.Linear(hiddensize, vocabsize)) - :add(nn.SoftMax()) -``` +![rnnlm](images/rnnlm.png) -However, when training with large vocabularies, like the 793471 words that makes up -the Google Billion Words (GBW) dataset [[1]](#nce.ref), -the output layer quickly becomes a bottle neck. -If you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` -(size of sequence to backpropagate through time), -the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `32 x 100 x 793471`. -For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory. -The number can be double for gradients, and doubled again as both Linear and SoftMax store a copy for the output. -If somehow you can find a way to put >40GB on a GPU (or distribute it over many), you then run in the problem of -forward/backward propagating through that `outputlayer` in a reasonable time-frame. +So for this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on. +The RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words. +Simple RNNs are not the only kind of model that can be used model language. +There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which +have special gated cells that facilitate the backpropagation of gradients through longer sequences. +LSTMs can learn dependencies seperated by much longer time-steps . -## GBW Data Loader +## Loading the Google billion words dataset For our word-level language model we use the GBW dataset. The dataset is different from Penn Tree Bank in that sentences are @@ -233,24 +239,9 @@ means that the data is iterated in sequence. Each sentence in the GBW dataset is encapsulated by `` and `` tokens to indicate the start and end of the sequence, respectively. Each token is mapped to an integer. So for example, you can see that `` is mapped to integer `793470` in the above example. -Now that we feel confident in our dataset, lets look at the model. - -## RNNLM - -Our task is to build a language model which will maximize the likelihood of the -next word given the history of previous words in the sentence. -The following figure illustrates the a Simple Recurrent Neural Network (Simple RNN) language model: - -![rnnlm](https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/rnnlm.png) - -So for this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on. -The RNN as an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words. -Simple RNNs are not the only kind of model that can be used model language. -There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which -have special gated cells that facilitate the backpropagation of gradients through longer sequences. -LSTMs can learn dependencies seperated by much longer time-steps . +Now that we feel confident in our dataset, lets look at the model. -## Multi-layer LSTM +## Building a multi-layer LSTM The input layer of the the `lm` model is a lookup table : @@ -287,7 +278,29 @@ each time-step: lm:add(nn.SplitTable(1)) ``` -### Noise Contrastive Estimation +### Output layer bottleneck + +With its small vocabulary of 10000 words, the Penn Tree Bank dataset is relatively easy to use to build word-level language models. +The output layer is still computationally tractable for both training and inference, especially for GPUs. +For these smaller vocabularies, the output layer is basically a `Linear` followed by a `SoftMax`: + +```lua +outputlayer = nn.Sequential() + :add(nn.Linear(hiddensize, vocabsize)) + :add(nn.SoftMax()) +``` + +However, when training with large vocabularies, like the 793471 words that makes up the GBW dataset , +the output layer quickly becomes a bottle neck. +If you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` +(size of sequence to backpropagate through time), +the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `32 x 100 x 793471`. +For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory. +The number can be double for gradients, and doubled again as both Linear and SoftMax store a copy for the output. +If somehow you can find a way to put >40GB on a GPU (or distribute it over many), you then run in the problem of +forward/backward propagating through that `outputlayer` in a reasonable time-frame. + +### Noise contrastive estimation The output layer of the LM uses Noise Contrastive Estimation (NCE) to speed up training and reduce memory consumption: @@ -340,11 +353,11 @@ Reference [[2]](#nce.ref) implement a faster version where the noise samples are This make the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used. This faster NCE version described in [[2]](#nce.ref) is the default implementation of the `NCEModule`. Sampling per batch-row can be turned on with `NCEModule.rownoise=true`. -## Scripts +## Training and evaluation scripts The experiments presented here use three scripts: two for training and one for evaluation. -### Single-GPU Training Script +### Single-GPU training script We provide training scripts for a single gpu via the [noise-contrastive-estimate.lua](https://github.com/Element-Research/rnn/blob/master/examples/noise-contrastive-estimate.lua) script. Running the following on a 12GB NVIDIA Titan X should resulted in a test set perplexity of 65.6 after 321 epochs: @@ -377,8 +390,9 @@ nn.Serial @ nn.Sequential { To use about one third less memory, you can set momentum of 0. -### Evaluation Script +### Evaluation script +The evaluation script can be used to measure perplexity on the test set or sample independent sentences. To evaluate a saved model, you can use the [evaluate-rnnlm.lua](https://github.com/Element-Research/rnn/blob/master/scripts/evaluate-rnnlm.lua) script: ```bash @@ -407,11 +421,10 @@ The `--temperature` flag can be reduced to make the sampling more deterministic. It was last modified at 23.31 GMT on Saturday 22 December 2009 . He told the newspaper the prosecution had been treating the small boy as " a young man who was playing for a while . " We are astounded that our employees are not made aware of the risks and risks they are pursuing during this period of time , " he said . - " I had a right to come up with the idea . - But the truth + " I had a right to come up with the idea . ``` -### Multi-GPU Training Script +### Multi-GPU training script As can be observed in the previous section, training a 2-layer LSTM with only 250 hidden units will not yield the best generated samples. The model needs much more capacity than what can fit on a 12GB GPU. @@ -543,7 +556,7 @@ nn.Serial @ nn.Sequential { ## Results On the 4-layer LSTM with 2048 hidden units, [[1]](#nce.ref) obtain 43.2 perplexity on the GBW test set. -After early-stopping on a sub-set of the validation set (at 100 epochs of training), our model was able to reach *40.61* perplexity. +After early-stopping on a sub-set of the validation set (at 100 epochs of training where 1 epoch is 128 sequences x 400k words/sequence), our model was able to reach *40.61* perplexity. This model was run on 4x12GB NVIDIA Titan X GPUs. Training requires approximately 40GB of memory, distributed across the 4 GPU devices. @@ -551,7 +564,7 @@ As in the original paper, we do not make use of momentum as it provides little b Training runs at about 3800 words/second. -### Generated Samples +### Generating Sentences Here are 8 sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7: @@ -566,7 +579,6 @@ Here are 8 sentences sampled independently from the 4-layer LSTM with a `tempera Later he was driven to a nearby house where he was later found to be severely ill . ``` -Not bad, right? ### Learning Curves @@ -581,3 +593,4 @@ Not bad, right? 4. *S Hochreiter, J Schmidhuber*, [Long Short Term Memory](http://web.eecs.utk.edu/~itamar/courses/ECE-692/Bobby_paper1.pdf) 5. *A Graves, A Mohamed, G Hinton*, [Speech Recognition with Deep Recurrent Neural Networks](http://arxiv.org/pdf/1303.5778.pdf) 6. *K Greff, RK Srivastava, J Koutník*, [LSTM: A Search Space Odyssey](http://arxiv.org/pdf/1503.04069) +7. *C Chelba, T Mikolov, M Schuster, Q Ge, T Brants, P Koehn, T Robinson*, [One billion word benchmark for measuring progress in statistical language modeling](http://arxiv.org/pdf/1312.3005) -- cgit v1.2.3 From 147898b12aec04f85badb85a0a6bf348ac529b51 Mon Sep 17 00:00:00 2001 From: nicholas-leonard Date: Wed, 20 Jul 2016 12:11:52 -0400 Subject: more links --- blog/_posts/2016-05-11-nce.md | 42 +++++++++++++++++++++++++++--------------- 1 file changed, 27 insertions(+), 15 deletions(-) diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md index 195579c..f1ff0bd 100644 --- a/blog/_posts/2016-05-11-nce.md +++ b/blog/_posts/2016-05-11-nce.md @@ -3,26 +3,30 @@ layout: post title: Noise Contrastive Estimation comments: True author: nicholas-leonard -excerpt: TODO +excerpt: Noise contrastive estimation is used +to train a multi-GPU recurrent neural network language model +on the Google billion words dataset. picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/output_52iFki.gif --- - * Word versus character level language models - * Recurrent neural network language models - * Loading the Google billion words dataset - * Building a multi-layer LSTM - * Training and evaluation scripts - * Results + * [Word versus character level language models](#nce.char) + * [Recurrent neural network language models](#nce.rnnlm) + * [Loading the Google billion words dataset](#nce.gbw) + * [Building a multi-layer LSTM](#nce.lstm) + * [Training and evaluation scripts](#nce.script) + * [Results](#nce.result) + * [References](#nce.ref) -In this blog post, we use Torch to use noise contrastive estimation (NCE) [[2]](#nce.ref) +In this Torch blog post, we use noise contrastive estimation (NCE) [[2]](#nce.ref) to train a multi-GPU recurrent neural network language model (RNNLM) on the Google billion words (GBW) dataset [[7]](#nce.ref). This blog post is the result of many months of work. The enormity of the dataset caused us to contribute some novel open-source modules, criteria and even a multi-GPU tensor. We also provide scripts so that you can train and evaluate your own language models. + ## Word versus character level language models In recent months you may have noticed increased interest in generative character-level @@ -60,28 +64,32 @@ Progress isn't made by early risers. It's made by lazy men trying to find easier After tokenization, the word-level model might view this sequence as containing 22 tokens. On the other hand, the char-level will view this sequence as containing 102 tokens. This longer sequence makes the task of the char-level model harder than the word-level model, as it -must take into account dependencies between more tokens over more time-steps. +must take into account dependencies between more tokens over more time-steps.[[8]](#nce.ref) The main advantage of char-level over word-level language models is that they -have a really small vocabulary. For example, the Google Billion Words dataset will contain approximately 800 characters +have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters compared to 800,000 words (after pruning low-frequency tokens). In practice this means that char-level models will require less memory and have faster inference than their word-level counterparts. + ## Recurrent neural network language models -Our task is to build a language model which will maximize the likelihood of the +Our task is to build a language model which maximizes the likelihood of the next word given the history of previous words in the sentence. -The following figure illustrates the a Simple Recurrent Neural Network (Simple RNN) language model: +The following figure illustrates the a simple recurrent neural network (Simple RNN) language model: ![rnnlm](images/rnnlm.png) -So for this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on. -The RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words. +For this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on. +The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words. Simple RNNs are not the only kind of model that can be used model language. There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which have special gated cells that facilitate the backpropagation of gradients through longer sequences. -LSTMs can learn dependencies seperated by much longer time-steps . +LSTMs can learn dependencies seperated between much longer time-steps. +Like convolutions, these LSTM layers can also be stacked to form deeper models. +In the Building + ## Loading the Google billion words dataset For our word-level language model we use the GBW dataset. @@ -241,6 +249,7 @@ start and end of the sequence, respectively. Each token is mapped to an integer. you can see that `` is mapped to integer `793470` in the above example. Now that we feel confident in our dataset, lets look at the model. + ## Building a multi-layer LSTM The input layer of the the `lm` model is a lookup table : @@ -353,6 +362,7 @@ Reference [[2]](#nce.ref) implement a faster version where the noise samples are This make the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used. This faster NCE version described in [[2]](#nce.ref) is the default implementation of the `NCEModule`. Sampling per batch-row can be turned on with `NCEModule.rownoise=true`. + ## Training and evaluation scripts The experiments presented here use three scripts: two for training and one for evaluation. @@ -553,6 +563,7 @@ nn.Serial @ nn.Sequential { } ``` + ## Results On the 4-layer LSTM with 2048 hidden units, [[1]](#nce.ref) obtain 43.2 perplexity on the GBW test set. @@ -594,3 +605,4 @@ Here are 8 sentences sampled independently from the 4-layer LSTM with a `tempera 5. *A Graves, A Mohamed, G Hinton*, [Speech Recognition with Deep Recurrent Neural Networks](http://arxiv.org/pdf/1303.5778.pdf) 6. *K Greff, RK Srivastava, J Koutník*, [LSTM: A Search Space Odyssey](http://arxiv.org/pdf/1503.04069) 7. *C Chelba, T Mikolov, M Schuster, Q Ge, T Brants, P Koehn, T Robinson*, [One billion word benchmark for measuring progress in statistical language modeling](http://arxiv.org/pdf/1312.3005) +8. *A Graves*, [Generating Sequences With Recurrent Neural Networks](http://arxiv.org/pdf/1308.0850v5.pdf) -- cgit v1.2.3 From dd653411b4670fe571eaee13b53a69b566cd04ec Mon Sep 17 00:00:00 2001 From: nicholas-leonard Date: Thu, 21 Jul 2016 12:33:20 -0400 Subject: results++ --- blog/_posts/2016-05-11-nce.md | 85 ++++++++++++++++++++++++++++------ blog/_posts/images/LSTM-NCE-curve.png | Bin 0 -> 38640 bytes 2 files changed, 72 insertions(+), 13 deletions(-) create mode 100644 blog/_posts/images/LSTM-NCE-curve.png diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md index f1ff0bd..310c63f 100644 --- a/blog/_posts/2016-05-11-nce.md +++ b/blog/_posts/2016-05-11-nce.md @@ -3,9 +3,7 @@ layout: post title: Noise Contrastive Estimation comments: True author: nicholas-leonard -excerpt: Noise contrastive estimation is used -to train a multi-GPU recurrent neural network language model -on the Google billion words dataset. +excerpt: Noise contrastive estimation is used to train a multi-GPU recurrent neural network language model on the Google billion words dataset. picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/output_52iFki.gif --- @@ -400,6 +398,7 @@ nn.Serial @ nn.Sequential { To use about one third less memory, you can set momentum of 0. + ### Evaluation script The evaluation script can be used to measure perplexity on the test set or sample independent sentences. @@ -577,23 +576,83 @@ Training runs at about 3800 words/second. ### Generating Sentences -Here are 8 sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7: +Here are some sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7: ```xml - The company said its net profit rose to $ 289 million , or 96 cents per share , in the three months ended on March 31 compared with $ 173 million , or $ 0.68 a share , a year ago . - But I 've been a bit disappointed with our performance , " said Wenger . - The first is an even bigger problem . - The next big thing for him is he will be able to tell the world he is thinking about his future . - The new rules have been added to the legislation so that they don 't have to be approved for public use . - The Pentagon 's top counter-terrorism official , who has been in charge of a new system of intelligence collection and inspection , wrote in an e-mail message that while the new system could be easily implemented , it remains an option . - " I was trying to get a glass of water . - Later he was driven to a nearby house where he was later found to be severely ill . + The first , for a lot of reasons , is the " Asian Glory " : an American military outpost in the middle of an Iranian desert . + But the first new stage of the project will be a new tunnel linking the new terminal with the new terminal at the airport . + The White House said Bush would also sign a memorandum of understanding with Iraq , which will allow the Americans to take part in the poll . + The folks who have campaigned for his nomination know that he is in a fight for survival . + The three survivors , including a woman whose name was withheld and not authorized to speak , were buried Saturday in a makeshift cemetery in the town and seven people were killed in the town of Eldoret , which lies around a dozen miles ( 40 kilometers ) southwest of Kathmandu . + The art of the garden was created by pouring water over a small brick wall and revealing that an older , more polished design was leading to the creation of a new house in the district . + She added : " The club has not made any concession to the club 's fans and was not notified of the fact they had reached an agreement with the club . + The Times has learnt that the former officer who fired the fatal shots must have known about the fatal carnage . + Obama supporters say they 're worried about the impact of the healthcare and energy policies of Congress . + Not to mention the painful changes to the way that women are treated in the workplace . + The dollar stood at 14.38 yen ( ) and Swiss francs ( ) . + The current , the more intractable , the and the about a lot of priorities . + The job , which could possibly be completed in 2011 , needs to be approved in a new compact between the two companies . + " The most important thing for me is to get back to the top , " he said . + It was a one-year ban and the right to a penalty . + The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . + The six were scheduled to return to Washington on Wednesday . + " It 's a ... mistake , " he said . + The government 's offensive against the rebels and insurgents has been criticized by the United Nations and UN agencies . + " Our model is not much different from many of its competitors , " said Richard Bangs , CEO of the National Center for Science in the Public Interest in Chicago . + He is now a large part of a group of young people who are spending less time studying and work in the city . + He said he was confident that while he and his wife would have been comfortable working with him , he would be able to get them to do so . + The summer 's financial meltdown is the worst in decades . + It was a good night for Stuart Broad , who took the ball to Ravi Bopara at short leg to leave England on 88 for five at lunch . + And even for those who worked for them , almost everything was at risk . + The new strategy is all part of a stepped-up war against Taliban and al-Qaida militants in northwest Pakistan . + The governor 's office says the proposal is based on a vision of an outsider in the town who wants to preserve the state 's image . + " The fact that there is no evidence to support the claim made by the government is entirely convincing and that Dr Mohamed will have to be detained for a further two years , " he said . + The country 's tiny nuclear power plants were the first to use nuclear technology , and the first such reactors in the world . + " What is also important about this is that we can go back to the way we worked and work and fight , " he says . + And while he has been the star of " The Wire " and " The Office , " Mr. Murphy has been a careful , intelligent , engaging competitor for years . + On our return to the water , we found a large abandoned house . + The national average for a gallon of regular gas was $ 5.99 for the week ending Jan . + The vote was a rare early start for the contest , which was held after a partial recount in 26 percent of the vote . + The first one was a show of force by a few , but the second was an attempt to show that the country was serious about peace . + It was a little more than half an hour after the first reports of a shooting . + The central bank is expected to cut interest rates further by purchasing more than $ 100 billion of commercial paper and Treasuries this week . + Easy , it 's said , to have a child with autism . + He said : " I am very disappointed with the outcome because the board has not committed itself . + " There is a great deal of tension between us , " said Mr C. + The odds that the Fed will keep its benchmark interest rate unchanged are at least half as much as they were at the end of 2008 . + For them , investors have come to see that : a ) the government will maintain a stake in banks and ( 2 ) the threat of financial regulation and supervision ; and ( 3 ) it will not be able to raise enough capital from the private sector to support the economy . + The court heard he had been drinking and drank alcohol at the time of the attack . + " The whole thing is quite a bit more intense . + This is a very important project and one that we are working closely with . + " We are confident that in this economy and in the current economy , we will continue to grow , " said John Lipsky , who chaired the IMF 's board of governors for several weeks . + The researchers said they found no differences among how men drank and whether they were obese . + Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . + The £ 7m project is a new project for the city of Milton Keynes and aims to launch a new challenge for the British Government . + But he was not without sympathy for his father . ``` +The syntax seems quite reasonable, especially when comparing it to the previous results obtained from the [single-GPU 2x250 LSTM](#nce.eval). +However, in some cases, the semantics, i.e. the meaning of the words, is not so good. +For example, the sentence ` Easy , it 's said , to have a child with autism . ` would make more sense, to me at least, by replacing `Easy` with `Not easy`. +But then again, this sentence as nice semantics: ` The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . `. +[Michelle Bachelet](https://en.wikipedia.org/wiki/Michelle_Bachelet) was actually a president of Chile. +In her earlier life, she was also [kidnapped by military men](https://www.theguardian.com/world/2005/nov/22/chile.gender), so it kind of makes sense that she would be strong on the issue of kidnappings. +Here is an example of some weird semantics : ` Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . ` +The first part about `load voice` doesn't mean anything to me. +And I fail to see how there being `many brands that have no connection to the Internet` relates to `the iPhone is a great deal for consumers`. +But of course, all these sentences are generated independently, so the LM needs to learn to generate a meaning on the fly. +This is hard as there is no context to the sentence. ### Learning Curves +The following figure outlines the learning curves for the above model. +The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`. +Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`. +![LSTM NCE Learning curves](images/LSTM-NCE-curve.png) + +As you can see, most of the learning is done in the first epochs. +Nevertheless, the training and validation error are consistently reduced training progresses. ## References @@ -605,4 +664,4 @@ Here are 8 sentences sampled independently from the 4-layer LSTM with a `tempera 5. *A Graves, A Mohamed, G Hinton*, [Speech Recognition with Deep Recurrent Neural Networks](http://arxiv.org/pdf/1303.5778.pdf) 6. *K Greff, RK Srivastava, J Koutník*, [LSTM: A Search Space Odyssey](http://arxiv.org/pdf/1503.04069) 7. *C Chelba, T Mikolov, M Schuster, Q Ge, T Brants, P Koehn, T Robinson*, [One billion word benchmark for measuring progress in statistical language modeling](http://arxiv.org/pdf/1312.3005) -8. *A Graves*, [Generating Sequences With Recurrent Neural Networks](http://arxiv.org/pdf/1308.0850v5.pdf) +8. *A Graves*, [Generating Sequences With Recurrent Neural Networks, table 1](http://arxiv.org/pdf/1308.0850v5.pdf) diff --git a/blog/_posts/images/LSTM-NCE-curve.png b/blog/_posts/images/LSTM-NCE-curve.png new file mode 100644 index 0000000..447cc56 Binary files /dev/null and b/blog/_posts/images/LSTM-NCE-curve.png differ -- cgit v1.2.3 From 44bba0b99649a02798671d02713450c0bc5fb8eb Mon Sep 17 00:00:00 2001 From: nicholas-leonard Date: Thu, 21 Jul 2016 16:19:47 -0400 Subject: final verision? --- blog/_posts/2016-05-11-nce.md | 329 +++++++++++++++++++++++++++---- blog/_posts/images/LSTM.png | Bin 0 -> 82623 bytes blog/_posts/images/small-vs-big-lstm.png | Bin 0 -> 38671 bytes 3 files changed, 293 insertions(+), 36 deletions(-) create mode 100644 blog/_posts/images/LSTM.png create mode 100644 blog/_posts/images/small-vs-big-lstm.png diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md index 310c63f..92a46dc 100644 --- a/blog/_posts/2016-05-11-nce.md +++ b/blog/_posts/2016-05-11-nce.md @@ -20,8 +20,8 @@ picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_po In this Torch blog post, we use noise contrastive estimation (NCE) [[2]](#nce.ref) to train a multi-GPU recurrent neural network language model (RNNLM) on the Google billion words (GBW) dataset [[7]](#nce.ref). -This blog post is the result of many months of work. -The enormity of the dataset caused us to contribute some novel open-source modules, criteria and even a multi-GPU tensor. +The work presented here is the result of many months of on-and-off work. +The enormity of the dataset caused us to contribute some novel open-source Torch modules, criteria and even a multi-GPU tensor. We also provide scripts so that you can train and evaluate your own language models. @@ -74,18 +74,52 @@ require less memory and have faster inference than their word-level counterparts Our task is to build a language model which maximizes the likelihood of the next word given the history of previous words in the sentence. -The following figure illustrates the a simple recurrent neural network (Simple RNN) language model: +The following figure illustrates the workings of a simple recurrent neural network (Simple RNN) language model: ![rnnlm](images/rnnlm.png) +The exact implementation is as follows: + +```lua +h[t] = σ(W[x->h]x[t] + W[h->h]h[t−1] + b[1->h]) (1) +y[t] = softmax(W[x->y]h[t] + b[1->y]) (2) +``` + For this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on. -The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words. -Simple RNNs are not the only kind of model that can be used model language. +The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the likelihood of the remaining words in the sequence. +Internally, the Simple RNN has parameters from input to hidden (word embeddings), hidden to hidden (recurrent connections) and hidden to output (output embeddings that feed into a softmax). +The input to hidden parameters consist of a `LookupTable` that learns to represent each word as a vector. +These vectors form a embeddings space for words. +The input `x[t]` to the `LookupTable` is a unique integer associated to the word `w[t]`. +The embedding vector for that word is obtained by indexing the embedding space `W[x->h]` which we represent by `W[x->h]x[t]`. +The hidden to hidden parameters model the temporal dependencies of words by generating a hidden state `h[t]` given `h[t-1]` and `x[t]`. +This is where the actual recurrence takes place as `h[t]` is a function of `h[t-1]` (and word `x[t]`). +The hidden to output layer does an affine transform (i.e. a `Linear` module: `W[x->y]h[t] + b[1->h]`) followed by a `softmax`. +This is to estimate a probability distribution `y[t]`over the next word given the previous words which is emboddied by the hidden state `h[t]`. +The criterion is to maximize the likelihood of the next word `w[t+1]` given previous words: +`P(w[t+1]|w[1],w[2],...,w[t])`. + +Simple RNNs are easy to build using the [rnn](https://github.com/Element-Research/rnn) package (see [simple RNN example](https://github.com/Element-Research/rnn/blob/master/examples/simple-recurrence-network.lua)), +but they are not the only kind of model that can be used model language. There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which have special gated cells that facilitate the backpropagation of gradients through longer sequences. -LSTMs can learn dependencies seperated between much longer time-steps. -Like convolutions, these LSTM layers can also be stacked to form deeper models. -In the Building + +![lstm](images/lstm.png) + +The exact implementation is as follows: + +```lua +i[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + b[1->i]) (3) +f[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + b[1->f]) (4) +z[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c]) (5) +c[t] = f[t]c[t−1] + i[t]z[t] (6) +o[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + b[1->o]) (7) +h[t] = o[t]tanh(c[t]) (8) +``` + +The main advantage is that LSTMs can learn dependencies between words seperated between much longer time-steps. +It isn't as prone to the problems of vanishing gradients as the different gates can preserve the gradients during back-propagation. +To create a LM, the word embeddings (`W[x->h]x[t]` in eq.1) would be fed to the LSTM and the resulting hidden state would be fed to eq. 2. ## Loading the Google billion words dataset @@ -152,10 +186,10 @@ print(inputs) [torch.DoubleTensor of size 8x2] ``` -Each column is vector containing potentially multiple sequences, i.e. a multi-sequence. -Independent sequences are seperated by zeros. We will see later how the +Each column is a vector containing potentially multiple sequences, i.e. a multi-sequence. +Independent sequences are seperated by zeros. In the next section, we will see how the [rnn](https://github.com/Element-Research/rnn) package can use these zero-masked time-steps to -efficiently forget its hidden state between independent sequences, at the granularity of columns. +efficiently forget its hidden state between independent sequences (at the granularity of columns). For now, notice how the original `sequences` are contained in the returned `inputs` and separated by zeros. The `targets` are similar to the `inputs`, but use masks of 1 to separate sequences (as `ClassNLLCriterion` will otherwise complain). @@ -250,6 +284,9 @@ Now that we feel confident in our dataset, lets look at the model. ## Building a multi-layer LSTM +In this section, we get down to the business of actually building our multi-layer LSTM. +We will introduce NCE once we get to the output layer, starting from the input layer. + The input layer of the the `lm` model is a lookup table : ```lua @@ -277,7 +314,7 @@ for i,hiddensize in ipairs(opt.hiddensize) do end ``` -The `SeqLSTM` implemention is very fast and it benchmarked by the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm). +The `SeqLSTM` implemention is very fast and it benchmarked by the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm) repository. Next we split the output of the SeqLSTM (which is a `seqlen x batchsize x outputsize` Tensor) into a table containing a `batchsize x outputsize` tensor for each time-step: @@ -285,7 +322,7 @@ each time-step: lm:add(nn.SplitTable(1)) ``` -### Output layer bottleneck +### The problem: bottleneck at the output layer With its small vocabulary of 10000 words, the Penn Tree Bank dataset is relatively easy to use to build word-level language models. The output layer is still computationally tractable for both training and inference, especially for GPUs. @@ -298,16 +335,16 @@ outputlayer = nn.Sequential() ``` However, when training with large vocabularies, like the 793471 words that makes up the GBW dataset , -the output layer quickly becomes a bottle neck. -If you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` +the output layer quickly becomes a bottleneck. +For example, if you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` (size of sequence to backpropagate through time), the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `32 x 100 x 793471`. -For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory. +For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory! The number can be double for gradients, and doubled again as both Linear and SoftMax store a copy for the output. If somehow you can find a way to put >40GB on a GPU (or distribute it over many), you then run in the problem of forward/backward propagating through that `outputlayer` in a reasonable time-frame. -### Noise contrastive estimation +### The solution: noise contrastive estimation The output layer of the LM uses Noise Contrastive Estimation (NCE) to speed up training and reduce memory consumption: @@ -333,11 +370,39 @@ nn.Sequential():add(nn.Linear(inputsize, #trainset.ivocab)):add(nn.LogSoftMax()) Along with the [NCECriterion](https://github.com/Element-Research/dpnn#nn.NCECriterion), the `NCEModule` implements the algorithm is described in [[1]](#nce.ref). -I won't go into the details of the algorithm as it involves a lot of math. +I won't go into the details of the algorithm as it involves a lot of math which is more appropriately detailed in the reference papers. The way it works is that for each target word (the likelihood of which we want to maximize), `k` words are sampled from a noise distribution, which is typically the unigram distribution. -The `unigram` above is a tensor of size 793470 where each element is the frequency of the commensurate word in the corpus. +Remember that a softmax is basically: + +```lua + exp(x[i]) +y[i] = --------------------------------- (9) + exp(x[1])+exp(x[2])+...+exp(x[n]) +``` + +where `x[i]` is the `i`-th output of the output `Linear` layer. +The above denominator is the cause of the bottleneck as the `Linear` needs to be computed for each output `x[i]`. +For a `n=797470` vocabulary, this is prohibitively expensive. +NCE goes around this problem by replacing the denominator of eq. 9 with a constant `Z` during training: + +```lua + exp(x[i]) +y[i] = ------------ (10) + Z +``` + +Now this is not what actually happens during training as back-propagating through the above will not produce gradients +for the `x[j]` where `j~=i` (`j` not equal `i`). +Notice that backpropagating through eq. 9 will produce gradients for all outputs `x` of the `Linear` (i.e. for all `i`). +Another problem with eq. 10 is that nothing is pushing `exp(x[1])+exp(x[2])+...+exp(x[n])` to approximate `Z`. +What NCE does is formulate the problem such that `k` noise samples can be included in the equation to +both make sure that some (at most `k`) negative samples (i.e. `x[j]` where `j`) get gradients and that the denominator of eq. 9 approximates the denominator of eq. 10. +The `k` noise samples are sampled from a noise distribution, i.e. the unigram distribution. +The output layer `Linear` need only be computed for the target and noise-sampled words, which is where the efficiency is gained. + +The `unigram` variable above is a tensor of size 793470 where each element is the frequency of the commensurate word in the corpus. Sampling from such a large distribution using something like [torch.multinomial](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.multinomial) can become a bottleneck during training. So we implemented a more efficient version in [torch.AliasMultinomial](https://github.com/nicholas-leonard/torchx/blob/master/AliasMultinomial.lua). @@ -348,22 +413,25 @@ For the Softmax, which NCE tries to approximate, the `Z` is the sum over the `ex For NCE, the `Z` is typically fixed to `Z=1`. Our initial experiments found that setting `Z` to `Z=N*mean(exp(x[i]))` (where `N` is the number of words and the `mean` is approximated over a small batch of word samples `i`) -gave much better results. +gave much better results, but this is because we weren't appropriately initializing the output layer parameters. -One notable aspect of NCE papers (there are many) is that they often forget to mention the importance of parameter initialization. +One notable aspect of NCE papers (there are many) is that they often forget to mention the importance of this parameter initialization. Setting `Z=1` is only really possible if the `NCEModule.bias` is initialized to `bias[i] = -log(N)`. This is what the authors of [[2]](#nce.ref) use, although it isn't mentioned in the paper (I contacted one of the authors to find out). Sampling `k` noise samples per time-step and per batch-row means that the `NCEModule` needs to internally use something like [torch.baddbmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.baddbmm) to compute the `output`. Reference [[2]](#nce.ref) implement a faster version where the noise samples are drawn once and used for the entire batch (but still once for each time-step). -This make the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used. +This makes the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used instead of `torch.baddbmm`. This faster NCE version described in [[2]](#nce.ref) is the default implementation of the `NCEModule`. Sampling per batch-row can be turned on with `NCEModule.rownoise=true`. ## Training and evaluation scripts -The experiments presented here use three scripts: two for training and one for evaluation. +The experiments presented here use three scripts: two for training (you only need to use one) and one for evaluation. +The training scripts only differ in the amount of GPUs to use. +Both train a language model on the training set and do early-stopping on the validation set. +The evaluation script is used to measure the perplexity of a trained model on the test set, or to generate sentences. ### Single-GPU training script @@ -492,7 +560,7 @@ method to distribute the `weight` on `gradWeight` on different devices. This is accomplished by swaping the weight tensors for our own : [torch.MultiCudaTensor](https://github.com/nicholas-leonard/torchx/blob/master/MultiCudaTensor.lua). Lua has no severe type-checking system, so you can fake a tensor by creating a `torch.class` table with the same methods. To save time, the current version of `MultiCudaTensor` only supports the operations required by the NCEModule. -The advantage of this approach is that it requires minimal changes to the NCEModule and maintains backward compatiblity without requiring redundant code or excessive refactoring. +The advantage of this approach is that it requires minimal changes to the `NCEModule` and maintains backward compatiblity without requiring redundant code or excessive refactoring. ```lua -- output layer @@ -569,11 +637,29 @@ On the 4-layer LSTM with 2048 hidden units, [[1]](#nce.ref) obtain 43.2 perplexi After early-stopping on a sub-set of the validation set (at 100 epochs of training where 1 epoch is 128 sequences x 400k words/sequence), our model was able to reach *40.61* perplexity. This model was run on 4x12GB NVIDIA Titan X GPUs. -Training requires approximately 40GB of memory, distributed across the 4 GPU devices. +Training requires approximately 40GB of memory distributed across the 4 GPU devices, and 2-3 weeks of training. As in the original paper, we do not make use of momentum as it provides little benefit and requires 1/2 more memory. Training runs at about 3800 words/second. +### Learning Curves + +The following figure outlines the learning curves for the above 4x2048 LSTM model. +The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`. +Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`. + +![LSTM NCE Learning curves](images/LSTM-NCE-curve.png) + +As you can see, most of the learning is done in the first epochs. +Nevertheless, the training and validation error are consistently reduced training progresses. + +The following figure compares the valiation learning curves (again, NCE error) for a small 2x250 LSTM (no dropout) and big 4x2048 LSTM (with dropout). + +![Small vs Big LSTM](images/small-vs-big-lstm.png) + +What I find impressive about this figure is how quickly the higher-capacity model bests the lower-capacity model. +This clearly demonstrates the importance of capacity when optimizing large-scale language models. + ### Generating Sentences Here are some sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7: @@ -633,26 +719,197 @@ Here are some sentences sampled independently from the 4-layer LSTM with a `temp The syntax seems quite reasonable, especially when comparing it to the previous results obtained from the [single-GPU 2x250 LSTM](#nce.eval). However, in some cases, the semantics, i.e. the meaning of the words, is not so good. -For example, the sentence ` Easy , it 's said , to have a child with autism . ` would make more sense, to me at least, by replacing `Easy` with `Not easy`. -But then again, this sentence as nice semantics: ` The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . `. +For example, the sentence +```xml + Easy , it 's said , to have a child with autism . +``` +would make more sense, to me at least, by replacing `Easy` with `Not easy`. + +On the other hand, sentences like this one demonstrate good semantics: + +```xml + The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . `. +``` + [Michelle Bachelet](https://en.wikipedia.org/wiki/Michelle_Bachelet) was actually a president of Chile. In her earlier life, she was also [kidnapped by military men](https://www.theguardian.com/world/2005/nov/22/chile.gender), so it kind of makes sense that she would be strong on the issue of kidnappings. -Here is an example of some weird semantics : ` Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . ` + +Here is an example of some weird semantics : + +```xml + Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . +``` + The first part about `load voice` doesn't mean anything to me. And I fail to see how there being `many brands that have no connection to the Internet` relates to `the iPhone is a great deal for consumers`. But of course, all these sentences are generated independently, so the LM needs to learn to generate a meaning on the fly. -This is hard as there is no context to the sentence. +This is hard as there is no context to the sentence being generated. -### Learning Curves +In any case, I am quite happy with the results as they are definitely some of the most natural-looking synthetic sentences I have seen so far. -The following figure outlines the learning curves for the above model. -The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`. -Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`. +## Future Work -![LSTM NCE Learning curves](images/LSTM-NCE-curve.png) +I am currently working on a language modeling dataset based on one month of [reddit.com](https://www.reddit.com/) data. +Each sequence is basically a reddit submission consisting of a title, selftext, url, score, author and a thread of comments. +These sequences are much longer (average of 205 tokens) than the sentences that make up the GBW dataset (average of 26 tokens). +Training is still underway, but to pique your interest, this is an example of generated data (indentation and line breaks added for clarity): -As you can see, most of the learning is done in the first epochs. -Nevertheless, the training and validation error are consistently reduced training progresses. +```xml + + http://www.reddit.com/u/[deleted] + 0 + + [ WP ] You take a picture of a big bang . + You discover an alien that lives in the center of the planet in an unknown way . + You can say " what the fuck is that ? " + + + + 2 + http://www.reddit.com/u/Nev2k + + I have a question . + When i was younger , my parents had a house that had a living room in it . + One that was only a small portion of an entire level . + This was a month before i got my money . + If i was living in a house with a " legacy " i would make some mistakes . + When i was a child , i did n't know how to do shit about the house . + My parents got me into my own house and i never found a place to live . + So i decide to go to college . + I was so freaked out , i didnt have the drive to see them . + I never had a job , i was n't going anywhere . + I was so happy . + I knew i was going to be there . + I gave myself a job and my parents came . + That 's when i realized that i was in the wrong . + So i started to go . + I couldnt decide how long i wanted to live in this country . + I was so excited about the future . + I had a job . + I saved my money . + I did n't have a job . + I went to a highschool in a small town . + I had a job . + A job . + I did n't know what to do . + I was terrified of losing my job . + So i borrowed my $ 1000 in an hour . + I could n't afford to pay my rent . + I was so low on money . + I had my parents and i got into a free college . + I got in touch with my parents . + All of my friends were dead . + I was still with my family for a week . + I became a good parent . + I was a good choice . + When i got on my HSS i was going to go to my parents ' house . + I started to judge my parents . + I had a minor problem . + My parents . + I was so fucking bad . + My sister had a voice that was very loud . + I 'm sure my cousins were in a place where i could just hear my voice . + I felt like i was supposed to be angry . + I was so angry . + To cope with this . + My dad and i were both on break and i felt so alone . + I got unconscious and my mum left . + When I got to college , i was back in school . + I was a good kid . + I was happy . + And I told myself I was ready . + I told my parents . + They always talked about how they were going to be a good mom , and that I was going to be ready for that . + They always wanted to help me . + I did n't know what to do . + I had to . + I tried to go back to my dad , because I knew a lot about my mom . + I loved her . + I cared about her . + We cared for our family . + The time together was my only relationship . + I loved my heart . + And I hated my mother . + I chose it . + I cried . I cried . I cried . I cried . I cried . I cried . I cried . + The tears were gone . + I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . + I do n't know how to do it . + I do n't know how to deal with it . + I ca n't feel my emotions . + I ca n't get out of bed . + I ca n't sleep . + I ca n't tell my friends . + I just need to leave . + I want to leave . + I hate myself . + I hate feeling like I 'm being selfish . + I feel like I 'm not good enough anymore . + I need to find a new job . + I hate that I have to get my shit together . + I love my job . + I 'm having a hard time . + Why do I need to get a job ? + I have no job . + I have n't been feeling good lately . + I feel like I 'm going to be so much worse in the long run . + I feel so alone . + I ca n't believe I 'm so sad about going through my entire life . + + http://www.reddit.com/u/Scarbarella + + + http://www.reddit.com/r/offmychest + + I do n't know what to do anymore . + I feel like I 'm going to die and I 'm going to be sick because I have no more friends . + I do n't know what to do about my depression and I do n't know where to go from here . + I do n't know how I do because I know I 'm scared of being alone . + Any advice would be appreciated . + Love . + + +``` + +This particular sample is a little depressing, but incredibly human, which is one of the reasons I am so interested in reddit for language modeling. +But that might just be the nature of the `offmychest` subreddit. + +A less depressing sample is the following, which concerns the Destiny game. + +```xml + + http://www.reddit.com/r/DestinyTheGame + + Does anyone have a link to the Destiny Grimoire that I can use to get my Xbox 360 to play ? + + + + http://www.reddit.com/u/CursedSun + + I 'd love to have a weekly reset . + + 1 + + + 0 + + I have a few friends who are willing to help me out . + If I get to the point where I 'm not going to have to go through all the weekly raids , I 'll have to " complete " the raid . + I 'm doing the Weekly strike and then doing the Weekly ( and hopefully also the Weekly ) on Monday . + I 'm not planning to get the chest , but I am getting my first exotic that I just got done from my first Crota raid . + I 'm not sure how well it would work for the Nightfall and Weekly , but I do n't want to loose my progress . + I 'd love to get some other people to help me , and I 'm open to all suggestions . + I have a lot of experience with this stuff , so I figured it 's a good idea to know if I 'm getting the right answer . + I 'm truly sorry for the inconvenience . + + + +``` + +The particular model (a 4x1500 LSTM with dropout) only backpropagates through 50 time-steps. +What would like to see is for the comments to actually answer the question posed by the title and selftext. +This is a very difficult semantic problem which I hope the Reddit dataset will help solve. +More to follow in my next Troch blog post. ## References diff --git a/blog/_posts/images/LSTM.png b/blog/_posts/images/LSTM.png new file mode 100644 index 0000000..80c6067 Binary files /dev/null and b/blog/_posts/images/LSTM.png differ diff --git a/blog/_posts/images/small-vs-big-lstm.png b/blog/_posts/images/small-vs-big-lstm.png new file mode 100644 index 0000000..b580afd Binary files /dev/null and b/blog/_posts/images/small-vs-big-lstm.png differ -- cgit v1.2.3 From 1ed35ad412b8cbaf55c460e27179c7440a97cfe6 Mon Sep 17 00:00:00 2001 From: nicholas-leonard Date: Thu, 21 Jul 2016 16:21:16 -0400 Subject: fix LSTM.png --- blog/_posts/2016-05-11-nce.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md index 92a46dc..97c0e9a 100644 --- a/blog/_posts/2016-05-11-nce.md +++ b/blog/_posts/2016-05-11-nce.md @@ -104,7 +104,7 @@ but they are not the only kind of model that can be used model language. There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which have special gated cells that facilitate the backpropagation of gradients through longer sequences. -![lstm](images/lstm.png) +![lstm](images/LSTM.png) The exact implementation is as follows: -- cgit v1.2.3 From d68e87389470e2c3e1a739169e95d9e62544d617 Mon Sep 17 00:00:00 2001 From: nicholas-leonard Date: Fri, 22 Jul 2016 11:22:08 -0400 Subject: fixed title --- blog/_posts/2016-05-11-nce.md | 59 ++++++++++++++++++++++++++----------------- 1 file changed, 36 insertions(+), 23 deletions(-) diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md index 97c0e9a..8492d70 100644 --- a/blog/_posts/2016-05-11-nce.md +++ b/blog/_posts/2016-05-11-nce.md @@ -1,13 +1,13 @@ --- layout: post -title: Noise Contrastive Estimation +title: Language modeling a billion words comments: True author: nicholas-leonard excerpt: Noise contrastive estimation is used to train a multi-GPU recurrent neural network language model on the Google billion words dataset. picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/output_52iFki.gif --- - + * [Word versus character level language models](#nce.char) * [Recurrent neural network language models](#nce.rnnlm) @@ -16,6 +16,7 @@ picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_po * [Training and evaluation scripts](#nce.script) * [Results](#nce.result) * [References](#nce.ref) + * [Future Word](#nce.future) In this Torch blog post, we use noise contrastive estimation (NCE) [[2]](#nce.ref) to train a multi-GPU recurrent neural network language model (RNNLM) @@ -63,11 +64,14 @@ After tokenization, the word-level model might view this sequence as containing On the other hand, the char-level will view this sequence as containing 102 tokens. This longer sequence makes the task of the char-level model harder than the word-level model, as it must take into account dependencies between more tokens over more time-steps.[[8]](#nce.ref) +Another issue with character language models is that they need to learn spelling in +addition to syntax, semantics, etc. -The main advantage of char-level over word-level language models is that they +The main advantage of character over word level language models is that they have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters compared to 800,000 words (after pruning low-frequency tokens). In practice this means that char-level models will require less memory and have faster inference than their word-level counterparts. +Another advantage is that they do not require tokenization as a preprocessing step. ## Recurrent neural network language models @@ -642,7 +646,7 @@ As in the original paper, we do not make use of momentum as it provides little b Training runs at about 3800 words/second. -### Learning Curves +### Learning curves The following figure outlines the learning curves for the above 4x2048 LSTM model. The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`. @@ -660,7 +664,7 @@ The following figure compares the valiation learning curves (again, NCE error) f What I find impressive about this figure is how quickly the higher-capacity model bests the lower-capacity model. This clearly demonstrates the importance of capacity when optimizing large-scale language models. -### Generating Sentences +### Generating sentences Here are some sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7: @@ -747,10 +751,11 @@ This is hard as there is no context to the sentence being generated. In any case, I am quite happy with the results as they are definitely some of the most natural-looking synthetic sentences I have seen so far. -## Future Work + +## Future work I am currently working on a language modeling dataset based on one month of [reddit.com](https://www.reddit.com/) data. -Each sequence is basically a reddit submission consisting of a title, selftext, url, score, author and a thread of comments. +Each sequence is basically a reddit submission consisting of a `TITLE`, `SELFTEXT` (or `URL`), `SCORE`, `AUTHOR` and a thread of `COMMENTS`. These sequences are much longer (average of 205 tokens) than the sentences that make up the GBW dataset (average of 26 tokens). Training is still underway, but to pique your interest, this is an example of generated data (indentation and line breaks added for clarity): @@ -871,10 +876,12 @@ Training is still underway, but to pique your interest, this is an example of ge ``` -This particular sample is a little depressing, but incredibly human, which is one of the reasons I am so interested in reddit for language modeling. -But that might just be the nature of the `offmychest` subreddit. +This particular sample is a little depressing, but that might just be the nature of the `offmychest` subreddit... +Conditioned on the opening `` token, this generated sequence is incredibly human. +Reading through the comment, I feel like I am reading a story written by an actual person. +The ability to similuate human creativity is one of the reasons I am so interested in using reddit data for language modeling. -A less depressing sample is the following, which concerns the Destiny game. +A less depressing sample is the following, which concerns the [Destiny](https://en.wikipedia.org/wiki/Destiny_(video_game)) video game: ```xml @@ -891,25 +898,31 @@ A less depressing sample is the following, which concerns the Destiny game. 1 - 0 - - I have a few friends who are willing to help me out . - If I get to the point where I 'm not going to have to go through all the weekly raids , I 'll have to " complete " the raid . - I 'm doing the Weekly strike and then doing the Weekly ( and hopefully also the Weekly ) on Monday . - I 'm not planning to get the chest , but I am getting my first exotic that I just got done from my first Crota raid . - I 'm not sure how well it would work for the Nightfall and Weekly , but I do n't want to loose my progress . - I 'd love to get some other people to help me , and I 'm open to all suggestions . - I have a lot of experience with this stuff , so I figured it 's a good idea to know if I 'm getting the right answer . - I 'm truly sorry for the inconvenience . - + 0 + + I have a few friends who are willing to help me out . + If I get to the point where I 'm not going to have to go through all the weekly raids , I 'll have to " complete " the raid . + I 'm doing the Weekly strike and then doing the Weekly ( and hopefully also the Weekly ) on Monday . + I 'm not planning to get the chest , but I am getting my first exotic that I just got done from my first Crota raid . + I 'm not sure how well it would work for the Nightfall and Weekly , but I do n't want to loose my progress . + I 'd love to get some other people to help me , and I 'm open to all suggestions . + I have a lot of experience with this stuff , so I figured it 's a good idea to know if I 'm getting the right answer . + I 'm truly sorry for the inconvenience . + ``` +For those not familiar with this game, terms like +[Grimoire](http://destiny.wikia.com/wiki/Grimoire), [weekly reset](https://www.vg247.com/tag/destiny-weekly-reset/), +[raids](http://destiny.wikia.com/wiki/Raid), [Nightfall stike](http://destiny.wikia.com/wiki/Weekly_Nightfall_Strike), +[exotics](http://destiny.wikia.com/wiki/Exotic) and [Crota raid](http://destiny.wikia.com/wiki/Crota%27s_End) +may seem odd. But these are all part of the game vocabulary. + The particular model (a 4x1500 LSTM with dropout) only backpropagates through 50 time-steps. -What would like to see is for the comments to actually answer the question posed by the title and selftext. +What I would like to see is for the `COMMENTS` to actually answer the question posed by the `TITLE` and `SELFTEXT`. This is a very difficult semantic problem which I hope the Reddit dataset will help solve. -More to follow in my next Troch blog post. +More to follow in my next Torch blog post. ## References -- cgit v1.2.3 From d20bca37c2e99dc7fc4f8e906324483ebc050c8b Mon Sep 17 00:00:00 2001 From: nicholas-leonard Date: Fri, 22 Jul 2016 15:13:27 -0400 Subject: added figures --- blog/_posts/2016-05-11-nce.md | 73 ++++++++++++++++++++++++++++----------- blog/_posts/images/LM-Linear.png | Bin 0 -> 10672 bytes blog/_posts/images/LM-NCE.png | Bin 0 -> 5671 bytes blog/_posts/images/LM-params.png | Bin 0 -> 8196 bytes 4 files changed, 53 insertions(+), 20 deletions(-) create mode 100644 blog/_posts/images/LM-Linear.png create mode 100644 blog/_posts/images/LM-NCE.png create mode 100644 blog/_posts/images/LM-params.png diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md index 8492d70..12c94fc 100644 --- a/blog/_posts/2016-05-11-nce.md +++ b/blog/_posts/2016-05-11-nce.md @@ -9,7 +9,7 @@ picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_po - * [Word versus character level language models](#nce.char) + * [Word versus character language models](#nce.char) * [Recurrent neural network language models](#nce.rnnlm) * [Loading the Google billion words dataset](#nce.gbw) * [Building a multi-layer LSTM](#nce.lstm) @@ -26,7 +26,7 @@ The enormity of the dataset caused us to contribute some novel open-source Torch We also provide scripts so that you can train and evaluate your own language models. -## Word versus character level language models +## Word versus character language models In recent months you may have noticed increased interest in generative character-level RNNLMs like [char-rnn](https://github.com/karpathy/char-rnn) @@ -62,15 +62,16 @@ Progress isn't made by early risers. It's made by lazy men trying to find easier After tokenization, the word-level model might view this sequence as containing 22 tokens. On the other hand, the char-level will view this sequence as containing 102 tokens. -This longer sequence makes the task of the char-level model harder than the word-level model, as it -must take into account dependencies between more tokens over more time-steps.[[8]](#nce.ref) +This longer sequence makes the task of the character model harder than the word model, as it +must take into account dependencies between more tokens over more time-steps. Another issue with character language models is that they need to learn spelling in addition to syntax, semantics, etc. +In any case, word language models will typically have lower error than character models.[[8]](#nce.ref) -The main advantage of character over word level language models is that they +The main advantage of character over word language models is that they have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters -compared to 800,000 words (after pruning low-frequency tokens). In practice this means that char-level models will -require less memory and have faster inference than their word-level counterparts. +compared to 800,000 words (after pruning low-frequency tokens). In practice this means that character models will +require less memory and have faster inference than their word counterparts. Another advantage is that they do not require tokenization as a preprocessing step. @@ -125,6 +126,22 @@ The main advantage is that LSTMs can learn dependencies between words seperated It isn't as prone to the problems of vanishing gradients as the different gates can preserve the gradients during back-propagation. To create a LM, the word embeddings (`W[x->h]x[t]` in eq.1) would be fed to the LSTM and the resulting hidden state would be fed to eq. 2. +The error of language model is traditionally measured using perplexity. +Perplexity is a measure of how surprised the model is to see a sequence of text. +If you feed it in a sequence of words, and for each successive word the model is able to +predict with high likelihood what word comes next, it will have low perplexity. +If the next word in the sequence `s` of length `T` is indexed by `s[t]` and the model-inferred likelihood is `y[t]` such that +the likelihood of that word is `y[t][s[t]]`, then the perplexity of that sequence of words is: + +``` + log(y[1][s[1]) + log(y[2][s[2]) + ... + log(y[T][s[T]) +PPL(s,y) = exp( -------------------------------------------------------- ) + -T +``` + +The lower the perplexity, the better. + + ## Loading the Google billion words dataset @@ -303,7 +320,7 @@ lm:add(lookup) -- input is seqlen x batchsize A sub-class of `LookupTable`, we use the [LookupTableMaskZero](https://github.com/Element-Research/rnn#rnn.LookupTableMaskZero) to learn word embeddings. The main difference is that it supports zero-indexes, which are forwarded as zero-tensors. -Then we have the actual multi-layer LSTM implementation, which uses the fast [SeqLSTM](https://github.com/Element-Research/rnn#rnn.SeqLSTM) module: +Then we have the actual multi-layer LSTM implementation, which uses the [SeqLSTM](https://github.com/Element-Research/rnn#rnn.SeqLSTM) module: ```lua local inputsize = opt.inputsize @@ -318,7 +335,7 @@ for i,hiddensize in ipairs(opt.hiddensize) do end ``` -The `SeqLSTM` implemention is very fast and it benchmarked by the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm) repository. +As demonstrated in the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm) repository, the `SeqLSTM` implemention is very fast. Next we split the output of the SeqLSTM (which is a `seqlen x batchsize x outputsize` Tensor) into a table containing a `batchsize x outputsize` tensor for each time-step: @@ -340,17 +357,23 @@ outputlayer = nn.Sequential() However, when training with large vocabularies, like the 793471 words that makes up the GBW dataset , the output layer quickly becomes a bottleneck. -For example, if you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` +For example, if you are training your model with a `batchsize = 128` (number of sequences per batch) and a `seqlen = 50` (size of sequence to backpropagate through time), -the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `32 x 100 x 793471`. -For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory! -The number can be double for gradients, and doubled again as both Linear and SoftMax store a copy for the output. -If somehow you can find a way to put >40GB on a GPU (or distribute it over many), you then run in the problem of +the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `128 x 50 x 793471`. +For a `FloatTensor` or `CudaTensor`, that single tensor will take up 20GB of memory! +The number can be double for `gradInput` (i.e. gradients with respect to input), +and double again as both `Linear` and `SoftMax` store a copy for the `output`. + +![Scale of output layer buffers with Linear](images/LM-Linear.png) + +Excluding parameters and their gradients, the above figure outlines the approximate memory consumption of a 4-layer LSTM with 2048 units with a `seqlen=50`. +Even if somehow you can find a way to put 80GB on a GPU (or distribute it over many), you still run into the problem of forward/backward propagating through that `outputlayer` in a reasonable time-frame. + ### The solution: noise contrastive estimation -The output layer of the LM uses Noise Contrastive Estimation (NCE) to speed up training and reduce memory consumption: +The output layer of the LM uses NCE to speed up training and reduce memory consumption: ```lua local unigram = trainset.wordfreq:float() @@ -372,6 +395,11 @@ The [NCEModule](https://github.com/Element-Research/dpnn#nn.NCEModule) is a more nn.Sequential():add(nn.Linear(inputsize, #trainset.ivocab)):add(nn.LogSoftMax()) ``` +For evaluating perplexity, the model still implements `Linear` + `SoftMax`. +NCE is useful for reducing the memory consumption during training (compare to the figure above): + +![Scale of output layer buffers with NCE](images/LM-NCE.png) + Along with the [NCECriterion](https://github.com/Element-Research/dpnn#nn.NCECriterion), the `NCEModule` implements the algorithm is described in [[1]](#nce.ref). I won't go into the details of the algorithm as it involves a lot of math which is more appropriately detailed in the reference papers. @@ -509,10 +537,15 @@ The `--temperature` flag can be reduced to make the sampling more deterministic. As can be observed in the previous section, training a 2-layer LSTM with only 250 hidden units will not yield the best generated samples. The model needs much more capacity than what can fit on a 12GB GPU. -The [multigpu-nce-rnnlm.lua](https://github.com/Element-Research/rnn/blob/master/examples/multigpu-nce-rnnlm.lua) script can be used -to train a model on four GPUs. +For parameters and their gradients, a 4x2048 LSTM model requires the following: + +![LM parameter memory consumption](images/LM-params.png) + +This doesn't include all the intermediate buffers required for the different modules (outlined in [NCE section](#nce.nce)). +The solution was of course to distribution the model over more GPUs. +The [multigpu-nce-rnnlm.lua](https://github.com/Element-Research/rnn/blob/master/examples/multigpu-nce-rnnlm.lua) script is thus provided to train a language model on four GPUs. -It uses the [GPU](https://github.com/torch/nn/blob/master/doc/simple.md#nn.GPU) to decorate modules such that +It uses the [GPU](https://github.com/torch/nn/blob/master/doc/simple.md#nn.GPU) (which we contributed it to the [nn](https://github.com/torch/nn)) to decorate modules such that all their operations and memory are hosted on a specified device. The `GPU` module won't parallelize kernel execution over different GPU-devices. But it does allow us to distribute large models over devices. @@ -536,7 +569,7 @@ end Basically, the embedding space is split into two tables. For a 2048 unit embedding space, half, i.e. 1024 units, are located on each of two devices. -We use `Concat` to concatenate them back together after a `forward`. +We use [Concat](https://github.com/torch/nn/blob/master/doc/containers.md#nn.Concat) to concatenate them back together after a `forward`. For the hidden layers (i.e. `SeqLSTM`), we just distribute them on the devices used by the input layer. The hidden layers use up little memory (approximately 1GB each) so they aren't the problem. @@ -919,7 +952,7 @@ For those not familiar with this game, terms like [exotics](http://destiny.wikia.com/wiki/Exotic) and [Crota raid](http://destiny.wikia.com/wiki/Crota%27s_End) may seem odd. But these are all part of the game vocabulary. -The particular model (a 4x1500 LSTM with dropout) only backpropagates through 50 time-steps. +The particular model (a 4x1572 LSTM with dropout) only backpropagates through 50 time-steps. What I would like to see is for the `COMMENTS` to actually answer the question posed by the `TITLE` and `SELFTEXT`. This is a very difficult semantic problem which I hope the Reddit dataset will help solve. More to follow in my next Torch blog post. diff --git a/blog/_posts/images/LM-Linear.png b/blog/_posts/images/LM-Linear.png new file mode 100644 index 0000000..46c92d9 Binary files /dev/null and b/blog/_posts/images/LM-Linear.png differ diff --git a/blog/_posts/images/LM-NCE.png b/blog/_posts/images/LM-NCE.png new file mode 100644 index 0000000..39b6fad Binary files /dev/null and b/blog/_posts/images/LM-NCE.png differ diff --git a/blog/_posts/images/LM-params.png b/blog/_posts/images/LM-params.png new file mode 100644 index 0000000..1ae0e05 Binary files /dev/null and b/blog/_posts/images/LM-params.png differ -- cgit v1.2.3 From 7992ccb23ef065bb994ca36ad0eb973c15da273f Mon Sep 17 00:00:00 2001 From: nicholas-leonard Date: Fri, 22 Jul 2016 15:24:52 -0400 Subject: add TLDR to results --- blog/_posts/2016-05-11-nce.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md index 12c94fc..8a7a3ea 100644 --- a/blog/_posts/2016-05-11-nce.md +++ b/blog/_posts/2016-05-11-nce.md @@ -15,8 +15,8 @@ picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_po * [Building a multi-layer LSTM](#nce.lstm) * [Training and evaluation scripts](#nce.script) * [Results](#nce.result) + * [Future work](#nce.future) * [References](#nce.ref) - * [Future Word](#nce.future) In this Torch blog post, we use noise contrastive estimation (NCE) [[2]](#nce.ref) to train a multi-GPU recurrent neural network language model (RNNLM) @@ -25,6 +25,8 @@ The work presented here is the result of many months of on-and-off work. The enormity of the dataset caused us to contribute some novel open-source Torch modules, criteria and even a multi-GPU tensor. We also provide scripts so that you can train and evaluate your own language models. +If you are only interested in generated samples, perplexity and learning curves, please jump to the [results section](#nce.result). + ## Word versus character language models -- cgit v1.2.3 From 3aac64502c9083d7db4c6c0b6c1397eef5d9ba76 Mon Sep 17 00:00:00 2001 From: nicholas-leonard Date: Mon, 25 Jul 2016 15:38:48 -0400 Subject: fixed image --- blog/_posts/2016-05-11-nce.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md index 8a7a3ea..70a6bcb 100644 --- a/blog/_posts/2016-05-11-nce.md +++ b/blog/_posts/2016-05-11-nce.md @@ -4,7 +4,7 @@ title: Language modeling a billion words comments: True author: nicholas-leonard excerpt: Noise contrastive estimation is used to train a multi-GPU recurrent neural network language model on the Google billion words dataset. -picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/output_52iFki.gif +picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/rnnlm.png --- @@ -911,9 +911,9 @@ Training is still underway, but to pique your interest, this is an example of ge ``` -This particular sample is a little depressing, but that might just be the nature of the `offmychest` subreddit... -Conditioned on the opening `` token, this generated sequence is incredibly human. -Reading through the comment, I feel like I am reading a story written by an actual person. +This particular sample is a little depressing, but that might just be the nature of the `offmychest` subreddit. +Conditioned on the opening `` token, this generated sequence, although imperfect, is incredibly human. +Reading through the comment, I feel like I am reading a story written by an actual (somewhat schizophrenic) person. The ability to similuate human creativity is one of the reasons I am so interested in using reddit data for language modeling. A less depressing sample is the following, which concerns the [Destiny](https://en.wikipedia.org/wiki/Destiny_(video_game)) video game: -- cgit v1.2.3