diff options
author | nicholas-leonard <nick@nikopia.org> | 2016-07-21 23:19:47 +0300 |
---|---|---|
committer | nicholas-leonard <nick@nikopia.org> | 2016-07-21 23:19:47 +0300 |
commit | 44bba0b99649a02798671d02713450c0bc5fb8eb (patch) | |
tree | 8d3c080f46ccf7e08efc74324aafcb3f8d829af4 | |
parent | dd653411b4670fe571eaee13b53a69b566cd04ec (diff) |
final verision?
-rw-r--r-- | blog/_posts/2016-05-11-nce.md | 329 | ||||
-rw-r--r-- | blog/_posts/images/LSTM.png | bin | 0 -> 82623 bytes | |||
-rw-r--r-- | blog/_posts/images/small-vs-big-lstm.png | bin | 0 -> 38671 bytes |
3 files changed, 293 insertions, 36 deletions
diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md index 310c63f..92a46dc 100644 --- a/blog/_posts/2016-05-11-nce.md +++ b/blog/_posts/2016-05-11-nce.md @@ -20,8 +20,8 @@ picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_po In this Torch blog post, we use noise contrastive estimation (NCE) [[2]](#nce.ref) to train a multi-GPU recurrent neural network language model (RNNLM) on the Google billion words (GBW) dataset [[7]](#nce.ref). -This blog post is the result of many months of work. -The enormity of the dataset caused us to contribute some novel open-source modules, criteria and even a multi-GPU tensor. +The work presented here is the result of many months of on-and-off work. +The enormity of the dataset caused us to contribute some novel open-source Torch modules, criteria and even a multi-GPU tensor. We also provide scripts so that you can train and evaluate your own language models. <a name='nce.char'></a> @@ -74,18 +74,52 @@ require less memory and have faster inference than their word-level counterparts Our task is to build a language model which maximizes the likelihood of the next word given the history of previous words in the sentence. -The following figure illustrates the a simple recurrent neural network (Simple RNN) language model: +The following figure illustrates the workings of a simple recurrent neural network (Simple RNN) language model: ![rnnlm](images/rnnlm.png) +The exact implementation is as follows: + +```lua +h[t] = σ(W[x->h]x[t] + W[h->h]h[t−1] + b[1->h]) (1) +y[t] = softmax(W[x->y]h[t] + b[1->y]) (2) +``` + For this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on. -The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words. -Simple RNNs are not the only kind of model that can be used model language. +The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the likelihood of the remaining words in the sequence. +Internally, the Simple RNN has parameters from input to hidden (word embeddings), hidden to hidden (recurrent connections) and hidden to output (output embeddings that feed into a softmax). +The input to hidden parameters consist of a `LookupTable` that learns to represent each word as a vector. +These vectors form a embeddings space for words. +The input `x[t]` to the `LookupTable` is a unique integer associated to the word `w[t]`. +The embedding vector for that word is obtained by indexing the embedding space `W[x->h]` which we represent by `W[x->h]x[t]`. +The hidden to hidden parameters model the temporal dependencies of words by generating a hidden state `h[t]` given `h[t-1]` and `x[t]`. +This is where the actual recurrence takes place as `h[t]` is a function of `h[t-1]` (and word `x[t]`). +The hidden to output layer does an affine transform (i.e. a `Linear` module: `W[x->y]h[t] + b[1->h]`) followed by a `softmax`. +This is to estimate a probability distribution `y[t]`over the next word given the previous words which is emboddied by the hidden state `h[t]`. +The criterion is to maximize the likelihood of the next word `w[t+1]` given previous words: +`P(w[t+1]|w[1],w[2],...,w[t])`. + +Simple RNNs are easy to build using the [rnn](https://github.com/Element-Research/rnn) package (see [simple RNN example](https://github.com/Element-Research/rnn/blob/master/examples/simple-recurrence-network.lua)), +but they are not the only kind of model that can be used model language. There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which have special gated cells that facilitate the backpropagation of gradients through longer sequences. -LSTMs can learn dependencies seperated between much longer time-steps. -Like convolutions, these LSTM layers can also be stacked to form deeper models. -In the Building + +![lstm](images/lstm.png) + +The exact implementation is as follows: + +```lua +i[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + b[1->i]) (3) +f[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + b[1->f]) (4) +z[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c]) (5) +c[t] = f[t]c[t−1] + i[t]z[t] (6) +o[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + b[1->o]) (7) +h[t] = o[t]tanh(c[t]) (8) +``` + +The main advantage is that LSTMs can learn dependencies between words seperated between much longer time-steps. +It isn't as prone to the problems of vanishing gradients as the different gates can preserve the gradients during back-propagation. +To create a LM, the word embeddings (`W[x->h]x[t]` in eq.1) would be fed to the LSTM and the resulting hidden state would be fed to eq. 2. <a name='nce.gbw'></a> ## Loading the Google billion words dataset @@ -152,10 +186,10 @@ print(inputs) [torch.DoubleTensor of size 8x2] ``` -Each column is vector containing potentially multiple sequences, i.e. a multi-sequence. -Independent sequences are seperated by zeros. We will see later how the +Each column is a vector containing potentially multiple sequences, i.e. a multi-sequence. +Independent sequences are seperated by zeros. In the next section, we will see how the [rnn](https://github.com/Element-Research/rnn) package can use these zero-masked time-steps to -efficiently forget its hidden state between independent sequences, at the granularity of columns. +efficiently forget its hidden state between independent sequences (at the granularity of columns). For now, notice how the original `sequences` are contained in the returned `inputs` and separated by zeros. The `targets` are similar to the `inputs`, but use masks of 1 to separate sequences (as `ClassNLLCriterion` will otherwise complain). @@ -250,6 +284,9 @@ Now that we feel confident in our dataset, lets look at the model. <a name='nce.lstm'></a> ## Building a multi-layer LSTM +In this section, we get down to the business of actually building our multi-layer LSTM. +We will introduce NCE once we get to the output layer, starting from the input layer. + The input layer of the the `lm` model is a lookup table : ```lua @@ -277,7 +314,7 @@ for i,hiddensize in ipairs(opt.hiddensize) do end ``` -The `SeqLSTM` implemention is very fast and it benchmarked by the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm). +The `SeqLSTM` implemention is very fast and it benchmarked by the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm) repository. Next we split the output of the SeqLSTM (which is a `seqlen x batchsize x outputsize` Tensor) into a table containing a `batchsize x outputsize` tensor for each time-step: @@ -285,7 +322,7 @@ each time-step: lm:add(nn.SplitTable(1)) ``` -### Output layer bottleneck +### The problem: bottleneck at the output layer With its small vocabulary of 10000 words, the Penn Tree Bank dataset is relatively easy to use to build word-level language models. The output layer is still computationally tractable for both training and inference, especially for GPUs. @@ -298,16 +335,16 @@ outputlayer = nn.Sequential() ``` However, when training with large vocabularies, like the 793471 words that makes up the GBW dataset , -the output layer quickly becomes a bottle neck. -If you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` +the output layer quickly becomes a bottleneck. +For example, if you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` (size of sequence to backpropagate through time), the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `32 x 100 x 793471`. -For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory. +For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory! The number can be double for gradients, and doubled again as both Linear and SoftMax store a copy for the output. If somehow you can find a way to put >40GB on a GPU (or distribute it over many), you then run in the problem of forward/backward propagating through that `outputlayer` in a reasonable time-frame. -### Noise contrastive estimation +### The solution: noise contrastive estimation The output layer of the LM uses Noise Contrastive Estimation (NCE) to speed up training and reduce memory consumption: @@ -333,11 +370,39 @@ nn.Sequential():add(nn.Linear(inputsize, #trainset.ivocab)):add(nn.LogSoftMax()) Along with the [NCECriterion](https://github.com/Element-Research/dpnn#nn.NCECriterion), the `NCEModule` implements the algorithm is described in [[1]](#nce.ref). -I won't go into the details of the algorithm as it involves a lot of math. +I won't go into the details of the algorithm as it involves a lot of math which is more appropriately detailed in the reference papers. The way it works is that for each target word (the likelihood of which we want to maximize), `k` words are sampled from a noise distribution, which is typically the unigram distribution. -The `unigram` above is a tensor of size 793470 where each element is the frequency of the commensurate word in the corpus. +Remember that a softmax is basically: + +```lua + exp(x[i]) +y[i] = --------------------------------- (9) + exp(x[1])+exp(x[2])+...+exp(x[n]) +``` + +where `x[i]` is the `i`-th output of the output `Linear` layer. +The above denominator is the cause of the bottleneck as the `Linear` needs to be computed for each output `x[i]`. +For a `n=797470` vocabulary, this is prohibitively expensive. +NCE goes around this problem by replacing the denominator of eq. 9 with a constant `Z` during training: + +```lua + exp(x[i]) +y[i] = ------------ (10) + Z +``` + +Now this is not what actually happens during training as back-propagating through the above will not produce gradients +for the `x[j]` where `j~=i` (`j` not equal `i`). +Notice that backpropagating through eq. 9 will produce gradients for all outputs `x` of the `Linear` (i.e. for all `i`). +Another problem with eq. 10 is that nothing is pushing `exp(x[1])+exp(x[2])+...+exp(x[n])` to approximate `Z`. +What NCE does is formulate the problem such that `k` noise samples can be included in the equation to +both make sure that some (at most `k`) negative samples (i.e. `x[j]` where `j`) get gradients and that the denominator of eq. 9 approximates the denominator of eq. 10. +The `k` noise samples are sampled from a noise distribution, i.e. the unigram distribution. +The output layer `Linear` need only be computed for the target and noise-sampled words, which is where the efficiency is gained. + +The `unigram` variable above is a tensor of size 793470 where each element is the frequency of the commensurate word in the corpus. Sampling from such a large distribution using something like [torch.multinomial](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.multinomial) can become a bottleneck during training. So we implemented a more efficient version in [torch.AliasMultinomial](https://github.com/nicholas-leonard/torchx/blob/master/AliasMultinomial.lua). @@ -348,22 +413,25 @@ For the Softmax, which NCE tries to approximate, the `Z` is the sum over the `ex For NCE, the `Z` is typically fixed to `Z=1`. Our initial experiments found that setting `Z` to `Z=N*mean(exp(x[i]))` (where `N` is the number of words and the `mean` is approximated over a small batch of word samples `i`) -gave much better results. +gave much better results, but this is because we weren't appropriately initializing the output layer parameters. -One notable aspect of NCE papers (there are many) is that they often forget to mention the importance of parameter initialization. +One notable aspect of NCE papers (there are many) is that they often forget to mention the importance of this parameter initialization. Setting `Z=1` is only really possible if the `NCEModule.bias` is initialized to `bias[i] = -log(N)`. This is what the authors of [[2]](#nce.ref) use, although it isn't mentioned in the paper (I contacted one of the authors to find out). Sampling `k` noise samples per time-step and per batch-row means that the `NCEModule` needs to internally use something like [torch.baddbmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.baddbmm) to compute the `output`. Reference [[2]](#nce.ref) implement a faster version where the noise samples are drawn once and used for the entire batch (but still once for each time-step). -This make the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used. +This makes the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used instead of `torch.baddbmm`. This faster NCE version described in [[2]](#nce.ref) is the default implementation of the `NCEModule`. Sampling per batch-row can be turned on with `NCEModule.rownoise=true`. <a name='nce.script'></a> ## Training and evaluation scripts -The experiments presented here use three scripts: two for training and one for evaluation. +The experiments presented here use three scripts: two for training (you only need to use one) and one for evaluation. +The training scripts only differ in the amount of GPUs to use. +Both train a language model on the training set and do early-stopping on the validation set. +The evaluation script is used to measure the perplexity of a trained model on the test set, or to generate sentences. ### Single-GPU training script @@ -492,7 +560,7 @@ method to distribute the `weight` on `gradWeight` on different devices. This is accomplished by swaping the weight tensors for our own : [torch.MultiCudaTensor](https://github.com/nicholas-leonard/torchx/blob/master/MultiCudaTensor.lua). Lua has no severe type-checking system, so you can fake a tensor by creating a `torch.class` table with the same methods. To save time, the current version of `MultiCudaTensor` only supports the operations required by the NCEModule. -The advantage of this approach is that it requires minimal changes to the NCEModule and maintains backward compatiblity without requiring redundant code or excessive refactoring. +The advantage of this approach is that it requires minimal changes to the `NCEModule` and maintains backward compatiblity without requiring redundant code or excessive refactoring. ```lua -- output layer @@ -569,11 +637,29 @@ On the 4-layer LSTM with 2048 hidden units, [[1]](#nce.ref) obtain 43.2 perplexi After early-stopping on a sub-set of the validation set (at 100 epochs of training where 1 epoch is 128 sequences x 400k words/sequence), our model was able to reach *40.61* perplexity. This model was run on 4x12GB NVIDIA Titan X GPUs. -Training requires approximately 40GB of memory, distributed across the 4 GPU devices. +Training requires approximately 40GB of memory distributed across the 4 GPU devices, and 2-3 weeks of training. As in the original paper, we do not make use of momentum as it provides little benefit and requires 1/2 more memory. Training runs at about 3800 words/second. +### Learning Curves + +The following figure outlines the learning curves for the above 4x2048 LSTM model. +The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`. +Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`. + +![LSTM NCE Learning curves](images/LSTM-NCE-curve.png) + +As you can see, most of the learning is done in the first epochs. +Nevertheless, the training and validation error are consistently reduced training progresses. + +The following figure compares the valiation learning curves (again, NCE error) for a small 2x250 LSTM (no dropout) and big 4x2048 LSTM (with dropout). + +![Small vs Big LSTM](images/small-vs-big-lstm.png) + +What I find impressive about this figure is how quickly the higher-capacity model bests the lower-capacity model. +This clearly demonstrates the importance of capacity when optimizing large-scale language models. + ### Generating Sentences Here are some sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7: @@ -633,26 +719,197 @@ Here are some sentences sampled independently from the 4-layer LSTM with a `temp The syntax seems quite reasonable, especially when comparing it to the previous results obtained from the [single-GPU 2x250 LSTM](#nce.eval). However, in some cases, the semantics, i.e. the meaning of the words, is not so good. -For example, the sentence `<S> Easy , it 's said , to have a child with autism . </S>` would make more sense, to me at least, by replacing `Easy` with `Not easy`. -But then again, this sentence as nice semantics: `<S> The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . </S>`. +For example, the sentence +```xml +<S> Easy , it 's said , to have a child with autism . </S> +``` +would make more sense, to me at least, by replacing `Easy` with `Not easy`. + +On the other hand, sentences like this one demonstrate good semantics: + +```xml +<S> The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . </S>`. +``` + [Michelle Bachelet](https://en.wikipedia.org/wiki/Michelle_Bachelet) was actually a president of Chile. In her earlier life, she was also [kidnapped by military men](https://www.theguardian.com/world/2005/nov/22/chile.gender), so it kind of makes sense that she would be strong on the issue of kidnappings. -Here is an example of some weird semantics : `<S> Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . </S>` + +Here is an example of some weird semantics : + +```xml +<S> Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . </S> +``` + The first part about `load voice` doesn't mean anything to me. And I fail to see how there being `many brands that have no connection to the Internet` relates to `the iPhone is a great deal for consumers`. But of course, all these sentences are generated independently, so the LM needs to learn to generate a meaning on the fly. -This is hard as there is no context to the sentence. +This is hard as there is no context to the sentence being generated. -### Learning Curves +In any case, I am quite happy with the results as they are definitely some of the most natural-looking synthetic sentences I have seen so far. -The following figure outlines the learning curves for the above model. -The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`. -Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`. +## Future Work -![LSTM NCE Learning curves](images/LSTM-NCE-curve.png) +I am currently working on a language modeling dataset based on one month of [reddit.com](https://www.reddit.com/) data. +Each sequence is basically a reddit submission consisting of a title, selftext, url, score, author and a thread of comments. +These sequences are much longer (average of 205 tokens) than the sentences that make up the GBW dataset (average of 26 tokens). +Training is still underway, but to pique your interest, this is an example of generated data (indentation and line breaks added for clarity): -As you can see, most of the learning is done in the first epochs. -Nevertheless, the training and validation error are consistently reduced training progresses. +```xml +<SUBMISSION> + <AUTHOR> http://www.reddit.com/u/[deleted] </AUTHOR> + <SCORE> 0 </SCORE> + <TITLE> + [ WP ] You take a picture of a big bang . + You discover an alien that lives in the center of the planet in an unknown way . + You can say " what the fuck is that ? " + </TITLE> + <COMMENTS> + <CoMMeNT> + <ScoRE> 2 </ScoRE> + <AuTHoR> http://www.reddit.com/u/Nev2k </AuTHoR> + <BodY> + I have a question . + When i was younger , my parents had a house that had a living room in it . + One that was only a small portion of an entire level . + This was a month before i got my money . + If i was living in a house with a " legacy " i would make some mistakes . + When i was a child , i did n't know how to do shit about the house . + My parents got me into my own house and i never found a place to live . + So i decide to go to college . + I was so freaked out , i didnt have the drive to see them . + I never had a job , i was n't going anywhere . + I was so happy . + I knew i was going to be there . + I gave myself a job and my parents came . + That 's when i realized that i was in the wrong . + So i started to go . + I couldnt decide how long i wanted to live in this country . + I was so excited about the future . + I had a job . + I saved my money . + I did n't have a job . + I went to a highschool in a small town . + I had a job . + A job . + I did n't know what to do . + I was terrified of losing my job . + So i borrowed my $ 1000 in an hour . + I could n't afford to pay my rent . + I was so low on money . + I had my parents and i got into a free college . + I got in touch with my parents . + All of my friends were dead . + I was still with my family for a week . + I became a good parent . + I was a good choice . + When i got on my HSS i was going to go to my parents ' house . + I started to judge my parents . + I had a minor problem . + My parents . + I was so fucking bad . + My sister had a voice that was very loud . + I 'm sure my cousins were in a place where i could just hear my voice . + I felt like i was supposed to be angry . + I was so angry . + To cope with this . + My dad and i were both on break and i felt so alone . + I got unconscious and my mum left . + When I got to college , i was back in school . + I was a good kid . + I was happy . + And I told myself I was ready . + I told my parents . + They always talked about how they were going to be a good mom , and that I was going to be ready for that . + They always wanted to help me . + I did n't know what to do . + I had to . + I tried to go back to my dad , because I knew a lot about my mom . + I loved her . + I cared about her . + We cared for our family . + The time together was my only relationship . + I loved my heart . + And I hated my mother . + I chose it . + I cried . I cried . I cried . I cried . I cried . I cried . I cried . + The tears were gone . + I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . + I do n't know how to do it . + I do n't know how to deal with it . + I ca n't feel my emotions . + I ca n't get out of bed . + I ca n't sleep . + I ca n't tell my friends . + I just need to leave . + I want to leave . + I hate myself . + I hate feeling like I 'm being selfish . + I feel like I 'm not good enough anymore . + I need to find a new job . + I hate that I have to get my shit together . + I love my job . + I 'm having a hard time . + Why do I need to get a job ? + I have no job . + I have n't been feeling good lately . + I feel like I 'm going to be so much worse in the long run . + I feel so alone . + I ca n't believe I 'm so sad about going through my entire life . + </BodY> + <AuTHoR> http://www.reddit.com/u/Scarbarella </AuTHoR> + </CoMMeNT> + </COMMENTS> + <SUBREDDIT> http://www.reddit.com/r/offmychest </SUBREDDIT> + <SELFTEXT> + I do n't know what to do anymore . + I feel like I 'm going to die and I 'm going to be sick because I have no more friends . + I do n't know what to do about my depression and I do n't know where to go from here . + I do n't know how I do because I know I 'm scared of being alone . + Any advice would be appreciated . + Love . + </SELFTEXT> +</SUBMISSION> +``` + +This particular sample is a little depressing, but incredibly human, which is one of the reasons I am so interested in reddit for language modeling. +But that might just be the nature of the `offmychest` subreddit. + +A less depressing sample is the following, which concerns the Destiny game. + +```xml +<SUBMISSION> + <SUBREDDIT> http://www.reddit.com/r/DestinyTheGame </SUBREDDIT> + <TITLE> + Does anyone have a link to the Destiny Grimoire that I can use to get my Xbox 360 to play ? + </TITLE> + <COMMENTS> + <CoMMeNT> + <AuTHoR> http://www.reddit.com/u/CursedSun </AuTHoR> + <BodY> + I 'd love to have a weekly reset . + </BodY> + <ScoRE> 1 </ScoRE> + </CoMMeNT> + </COMMENTS> + <SCORE> 0 </SCORE> + <SELFTEXT> + I have a few friends who are willing to help me out . + If I get to the point where I 'm not going to have to go through all the weekly raids , I 'll have to " complete " the raid . + I 'm doing the Weekly strike and then doing the Weekly ( and hopefully also the Weekly ) on Monday . + I 'm not planning to get the chest , but I am getting my first exotic that I just got done from my first Crota raid . + I 'm not sure how well it would work for the Nightfall and Weekly , but I do n't want to loose my progress . + I 'd love to get some other people to help me , and I 'm open to all suggestions . + I have a lot of experience with this stuff , so I figured it 's a good idea to know if I 'm getting the right answer . + I 'm truly sorry for the inconvenience . + </SELFTEXT> + <AUTHOR> <OOV> </AUTHOR> +</SUBMISSION> +``` + +The particular model (a 4x1500 LSTM with dropout) only backpropagates through 50 time-steps. +What would like to see is for the comments to actually answer the question posed by the title and selftext. +This is a very difficult semantic problem which I hope the Reddit dataset will help solve. +More to follow in my next Troch blog post. <a name='nce.ref'></a> ## References diff --git a/blog/_posts/images/LSTM.png b/blog/_posts/images/LSTM.png Binary files differnew file mode 100644 index 0000000..80c6067 --- /dev/null +++ b/blog/_posts/images/LSTM.png diff --git a/blog/_posts/images/small-vs-big-lstm.png b/blog/_posts/images/small-vs-big-lstm.png Binary files differnew file mode 100644 index 0000000..b580afd --- /dev/null +++ b/blog/_posts/images/small-vs-big-lstm.png |