final verision?

author: nicholas-leonard <nick@nikopia.org> 2016-07-21 23:19:47 +0300
committer: nicholas-leonard <nick@nikopia.org> 2016-07-21 23:19:47 +0300
commit: 44bba0b99649a02798671d02713450c0bc5fb8eb (patch)
tree: 8d3c080f46ccf7e08efc74324aafcb3f8d829af4
parent: dd653411b4670fe571eaee13b53a69b566cd04ec (diff)
3 files changed, 293 insertions, 36 deletions
diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md
index 310c63f..92a46dc 100644
--- a/blog/_posts/2016-05-11-nce.md
+++ b/blog/_posts/2016-05-11-nce.md
@@ -20,8 +20,8 @@ picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_po
 In this Torch blog post, we use noise contrastive estimation (NCE) [[2]](#nce.ref)
 to train a multi-GPU recurrent neural network language model (RNNLM) 
 on the Google billion words (GBW) dataset [[7]](#nce.ref). 
-This blog post is the result of many months of work. 
-The enormity of the dataset caused us to contribute some novel open-source modules, criteria and even a multi-GPU tensor.
+The work presented here is the result of many months of on-and-off work. 
+The enormity of the dataset caused us to contribute some novel open-source Torch modules, criteria and even a multi-GPU tensor.
 We also provide scripts so that you can train and evaluate your own language models.
 
 <a name='nce.char'></a>
@@ -74,18 +74,52 @@ require less memory and have faster inference than their word-level counterparts
 
 Our task is to build a language model which maximizes the likelihood of the 
 next word given the history of previous words in the sentence. 
-The following figure illustrates the a simple recurrent neural network (Simple RNN) language model:
+The following figure illustrates the workings of a simple recurrent neural network (Simple RNN) language model:
 
 ![rnnlm](images/rnnlm.png)
 
+The exact implementation is as follows:
+
+```lua
+h[t] = σ(W[x->h]x[t] + W[h->h]h[t−1] + b[1->h])                      (1)
+y[t] = softmax(W[x->y]h[t] + b[1->y])                                (2)
+``` 
+
 For this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on.
-The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words.
-Simple RNNs are not the only kind of model that can be used model language.
+The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the likelihood of the remaining words in the sequence.
+Internally, the Simple RNN has parameters from input to hidden (word embeddings), hidden to hidden (recurrent connections) and hidden to output (output embeddings that feed into a softmax).
+The input to hidden parameters consist of a `LookupTable` that learns to represent each word as a vector.
+These vectors form a embeddings space for words. 
+The input `x[t]` to the `LookupTable` is a unique integer associated to the word `w[t]`. 
+The embedding vector for that word is obtained by indexing the embedding space `W[x->h]` which we represent by `W[x->h]x[t]`.
+The hidden to hidden parameters model the temporal dependencies of words by generating a hidden state `h[t]` given `h[t-1]` and `x[t]`.
+This is where the actual recurrence takes place as `h[t]` is a function of `h[t-1]` (and word `x[t]`).
+The hidden to output layer does an affine transform (i.e. a `Linear` module: `W[x->y]h[t] + b[1->h]`) followed by a `softmax`.
+This is to estimate a probability distribution `y[t]`over the next word given the previous words which is emboddied by the hidden state `h[t]`.
+The criterion is to maximize the likelihood of the next word `w[t+1]` given previous words: 
+`P(w[t+1]|w[1],w[2],...,w[t])`.
+
+Simple RNNs are easy to build using the [rnn](https://github.com/Element-Research/rnn) package (see [simple RNN example](https://github.com/Element-Research/rnn/blob/master/examples/simple-recurrence-network.lua)),
+but they are not the only kind of model that can be used model language.
 There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which 
 have special gated cells that facilitate the backpropagation of gradients through longer sequences.
-LSTMs can learn dependencies seperated between much longer time-steps.
-Like convolutions, these LSTM layers can also be stacked to form deeper models.
-In the Building 
+
+![lstm](images/lstm.png)
+
+The exact implementation is as follows:
+
+```lua
+i[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + b[1->i])                      (3)
+f[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + b[1->f])                      (4)
+z[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c])                   (5)
+c[t] = f[t]c[t−1] + i[t]z[t]                                         (6)
+o[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + b[1->o])                      (7)
+h[t] = o[t]tanh(c[t])                                                (8)
+``` 
+
+The main advantage is that LSTMs can learn dependencies between words seperated between much longer time-steps.
+It isn't as prone to the problems of vanishing gradients as the different gates can preserve the gradients during back-propagation.
+To create a LM, the word embeddings (`W[x->h]x[t]` in eq.1) would be fed to the LSTM and the resulting hidden state would be fed to eq. 2.
 
 <a name='nce.gbw'></a>
 ## Loading the Google billion words dataset
@@ -152,10 +186,10 @@ print(inputs)
 [torch.DoubleTensor of size 8x2]
 ``` 
 
-Each column is vector containing potentially multiple sequences, i.e. a multi-sequence.
-Independent sequences are seperated by zeros. We will see later how the 
+Each column is a vector containing potentially multiple sequences, i.e. a multi-sequence.
+Independent sequences are seperated by zeros. In the next section, we will see how the 
 [rnn](https://github.com/Element-Research/rnn) package can use these zero-masked time-steps to
-efficiently forget its hidden state between independent sequences, at the granularity of columns.
+efficiently forget its hidden state between independent sequences (at the granularity of columns).
 For now, notice how the original `sequences` are contained in the returned `inputs` and separated by zeros.
 
 The `targets` are similar to the `inputs`, but use masks of 1 to separate sequences (as `ClassNLLCriterion` will otherwise complain).
@@ -250,6 +284,9 @@ Now that we feel confident in our dataset, lets look at the model.
 <a name='nce.lstm'></a>
 ## Building a multi-layer LSTM
 
+In this section, we get down to the business of actually building our multi-layer LSTM.
+We will introduce NCE once we get to the output layer, starting from the input layer.
+
 The input layer of the the `lm` model is a lookup table :
 
 ```lua
@@ -277,7 +314,7 @@ for i,hiddensize in ipairs(opt.hiddensize) do
 end
 ``` 
 
-The `SeqLSTM` implemention is very fast and it benchmarked by the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm).
+The `SeqLSTM` implemention is very fast and it benchmarked by the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm) repository.
 Next we split the output of the SeqLSTM (which is a `seqlen x batchsize x outputsize` Tensor) into a table containing a `batchsize x outputsize` tensor for 
 each time-step:
 
@@ -285,7 +322,7 @@ each time-step:
 lm:add(nn.SplitTable(1))
 ``` 
 
-### Output layer bottleneck
+### The problem: bottleneck at the output layer
 
 With its small vocabulary of 10000 words, the Penn Tree Bank dataset is relatively easy to use to build word-level language models. 
 The output layer is still computationally tractable for both training and inference, especially for GPUs.
@@ -298,16 +335,16 @@ outputlayer = nn.Sequential()
 ``` 
 
 However, when training with large vocabularies, like the 793471 words that makes up the GBW dataset ,
-the output layer quickly becomes a bottle neck. 
-If you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` 
+the output layer quickly becomes a bottleneck. 
+For example, if you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100` 
 (size of sequence to backpropagate through time),
 the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `32 x 100 x 793471`.
-For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory. 
+For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory!
 The number can be double for gradients, and doubled again as both Linear and SoftMax store a copy for the output.
 If somehow you can find a way to put >40GB on a GPU (or distribute it over many), you then run in the problem of 
 forward/backward propagating through that `outputlayer` in a reasonable time-frame. 
 
-### Noise contrastive estimation
+### The solution: noise contrastive estimation
 
 The output layer of the LM uses Noise Contrastive Estimation (NCE) to speed up training and reduce memory consumption:
 
@@ -333,11 +370,39 @@ nn.Sequential():add(nn.Linear(inputsize, #trainset.ivocab)):add(nn.LogSoftMax())
 
 Along with the [NCECriterion](https://github.com/Element-Research/dpnn#nn.NCECriterion), 
 the `NCEModule` implements the algorithm is described in [[1]](#nce.ref). 
-I won't go into the details of the algorithm as it involves a lot of math.
+I won't go into the details of the algorithm as it involves a lot of math which is more appropriately detailed in the reference papers.
 The way it works is that for each target word (the likelihood of which we want to maximize), 
 `k` words are sampled from a noise distribution, which is typically the unigram distribution.
-The `unigram`  above is a tensor of size 793470 where each element is the frequency of the commensurate word in the corpus.
 
+Remember that a softmax is basically:
+
+```lua
+                  exp(x[i])
+y[i] = ---------------------------------                             (9)
+       exp(x[1])+exp(x[2])+...+exp(x[n])
+``` 
+
+where `x[i]` is the `i`-th output of the output `Linear` layer. 
+The above denominator is the cause of the bottleneck as the `Linear` needs to be computed for each output `x[i]`.
+For a `n=797470` vocabulary, this is prohibitively expensive.
+NCE goes around this problem by replacing the denominator of eq. 9 with a constant `Z` during training:
+
+```lua
+         exp(x[i])
+y[i] = ------------                                                  (10)
+             Z
+``` 
+
+Now this is not what actually happens during training as back-propagating through the above will not produce gradients
+for the `x[j]` where `j~=i` (`j` not equal `i`). 
+Notice that backpropagating through eq. 9 will produce gradients for all outputs `x` of the `Linear` (i.e. for all `i`).
+Another problem with eq. 10 is that nothing is pushing `exp(x[1])+exp(x[2])+...+exp(x[n])` to approximate `Z`.
+What NCE does is formulate the problem such that `k` noise samples can be included in the equation to 
+both make sure that some (at most `k`) negative samples (i.e. `x[j]` where `j`) get gradients and that the denominator of eq. 9 approximates the denominator of eq. 10.
+The `k` noise samples are sampled from a noise distribution, i.e. the unigram distribution.
+The output layer `Linear` need only be computed for the target and noise-sampled words, which is where the efficiency is gained.
+
+The `unigram` variable above is a tensor of size 793470 where each element is the frequency of the commensurate word in the corpus.
 Sampling from such a large distribution using something like [torch.multinomial](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.multinomial)
 can become a bottleneck during training. 
 So we implemented a more efficient version in [torch.AliasMultinomial](https://github.com/nicholas-leonard/torchx/blob/master/AliasMultinomial.lua). 
@@ -348,22 +413,25 @@ For the Softmax, which NCE tries to approximate, the `Z` is the sum over the `ex
 For NCE, the `Z` is typically fixed to `Z=1`. 
 Our initial experiments found that setting `Z` to `Z=N*mean(exp(x[i]))` 
 (where `N` is the number of words and the `mean` is approximated over a small batch of word samples `i`)
-gave much better results.
+gave much better results, but this is because we weren't appropriately initializing the output layer parameters. 
 
-One notable aspect of NCE papers (there are many) is that they often forget to mention the importance of parameter initialization.
+One notable aspect of NCE papers (there are many) is that they often forget to mention the importance of this parameter initialization.
 Setting `Z=1` is only really possible if the `NCEModule.bias` is initialized to `bias[i] = -log(N)`. 
 This is what the authors of [[2]](#nce.ref) use, although it isn't mentioned in the paper (I contacted one of the authors to find out).
 
 Sampling `k` noise samples per time-step and per batch-row means that the `NCEModule` needs to internally use something like 
 [torch.baddbmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.baddbmm) to compute the `output`.
 Reference [[2]](#nce.ref) implement a faster version where the noise samples are drawn once and used for the entire batch (but still once for each time-step).
-This make the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used.
+This makes the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used instead of `torch.baddbmm`.
 This faster NCE version described in [[2]](#nce.ref) is the default implementation of the `NCEModule`. Sampling per batch-row can be turned on with `NCEModule.rownoise=true`.
 
 <a name='nce.script'></a>
 ## Training and evaluation scripts
 
-The experiments presented here use three scripts: two for training and one for evaluation.
+The experiments presented here use three scripts: two for training (you only need to use one) and one for evaluation.
+The training scripts only differ in the amount of GPUs to use.
+Both train a language model on the training set and do early-stopping on the validation set.
+The evaluation script is used to measure the perplexity of a trained model on the test set, or to generate sentences.
 
 ### Single-GPU training script
 
@@ -492,7 +560,7 @@ method to distribute the `weight` on `gradWeight` on different devices.
 This is accomplished by swaping the weight tensors for our own : [torch.MultiCudaTensor](https://github.com/nicholas-leonard/torchx/blob/master/MultiCudaTensor.lua). 
 Lua has no severe type-checking system, so you can fake a tensor by creating a `torch.class` table with the same methods. 
 To save time, the current version of `MultiCudaTensor` only supports the operations required by the NCEModule.
-The advantage of this approach is that it requires minimal changes to the NCEModule and maintains backward compatiblity without requiring redundant code or excessive refactoring.
+The advantage of this approach is that it requires minimal changes to the `NCEModule` and maintains backward compatiblity without requiring redundant code or excessive refactoring.
 
 ```lua
 -- output layer
@@ -569,11 +637,29 @@ On the 4-layer LSTM with 2048 hidden units, [[1]](#nce.ref) obtain 43.2 perplexi
 After early-stopping on a sub-set of the validation set (at 100 epochs of training where 1 epoch is 128 sequences x 400k words/sequence), our model was able to reach *40.61* perplexity.
 
 This model was run on 4x12GB NVIDIA Titan X GPUs. 
-Training requires approximately 40GB of memory, distributed across the 4 GPU devices.
+Training requires approximately 40GB of memory distributed across the 4 GPU devices, and 2-3 weeks of training.
 As in the original paper, we do not make use of momentum as it provides little benefit and requires 1/2 more memory.
 
 Training runs at about 3800 words/second.
 
+### Learning Curves
+
+The following figure outlines the learning curves for the above 4x2048 LSTM model. 
+The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`.
+Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`.
+
+![LSTM NCE Learning curves](images/LSTM-NCE-curve.png)
+
+As you can see, most of the learning is done in the first epochs. 
+Nevertheless, the training and validation error are consistently reduced training progresses.
+
+The following figure compares the valiation learning curves (again, NCE error) for a small 2x250 LSTM (no dropout) and big 4x2048 LSTM (with dropout).
+
+![Small vs Big LSTM](images/small-vs-big-lstm.png)
+
+What I find impressive about this figure is how quickly the higher-capacity model bests the lower-capacity model.
+This clearly demonstrates the importance of capacity when optimizing large-scale language models.
+
 ### Generating Sentences
 
 Here are some sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7:
@@ -633,26 +719,197 @@ Here are some sentences sampled independently from the 4-layer LSTM with a `temp
 
 The syntax seems quite reasonable, especially when comparing it to the previous results obtained from the [single-GPU 2x250 LSTM](#nce.eval). 
 However, in some cases, the semantics, i.e. the meaning of the words, is not so good. 
-For example, the sentence `<S> Easy , it 's said , to have a child with autism . </S>` would make more sense, to me at least, by replacing `Easy` with `Not easy`.
-But then again, this sentence as nice semantics: `<S> The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . </S>`.
+For example, the sentence 
+```xml
+<S> Easy , it 's said , to have a child with autism . </S>
+``` 
+would make more sense, to me at least, by replacing `Easy` with `Not easy`.
+
+On the other hand, sentences like this one demonstrate good semantics: 
+
+```xml
+<S> The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . </S>`.
+``` 
+
 [Michelle Bachelet](https://en.wikipedia.org/wiki/Michelle_Bachelet) was actually a president of Chile.
 In her earlier life, she was also [kidnapped by military men](https://www.theguardian.com/world/2005/nov/22/chile.gender), so it kind of makes sense that she would be strong on the issue of kidnappings.
-Here is an example of some weird semantics : `<S> Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . </S>`
+
+Here is an example of some weird semantics : 
+
+```xml
+<S> Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . </S>
+``` 
+
 The first part about `load voice` doesn't mean anything to me. 
 And I fail to see how there being `many brands that have no connection to the Internet` relates to `the iPhone is a great deal for consumers`.
 But of course, all these sentences are generated independently, so the LM needs to learn to generate a meaning on the fly.
-This is hard as there is no context to the sentence.
+This is hard as there is no context to the sentence being generated.
 
-### Learning Curves
+In any case, I am quite happy with the results as they are definitely some of the most natural-looking synthetic sentences I have seen so far.
 
-The following figure outlines the learning curves for the above model. 
-The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`.
-Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`.
+## Future Work
 
-![LSTM NCE Learning curves](images/LSTM-NCE-curve.png)
+I am currently working on a language modeling dataset based on one month of [reddit.com](https://www.reddit.com/) data.
+Each sequence is basically a reddit submission consisting of a title, selftext, url, score, author and a thread of comments.
+These sequences are much longer (average of 205 tokens) than the sentences that make up the GBW dataset (average of 26 tokens). 
+Training is still underway, but to pique your interest, this is an example of generated data (indentation and line breaks added for clarity):
 
-As you can see, most of the learning is done in the first epochs. 
-Nevertheless, the training and validation error are consistently reduced training progresses.
+```xml
+<SUBMISSION>
+   <AUTHOR> http://www.reddit.com/u/[deleted] </AUTHOR> 
+   <SCORE> 0 </SCORE> 
+   <TITLE> 
+      [ WP ] You take a picture of a big bang . 
+      You discover an alien that lives in the center of the planet in an unknown way . 
+      You can say " what the fuck is that ? " 
+   </TITLE> 
+   <COMMENTS>
+      <CoMMeNT> 
+         <ScoRE> 2 </ScoRE> 
+         <AuTHoR> http://www.reddit.com/u/Nev2k </AuTHoR>
+         <BodY> 
+            I have a question . 
+            When i was younger , my parents had a house that had a living room in it . 
+            One that was only a small portion of an entire level . 
+            This was a month before i got my money . 
+            If i was living in a house with a " legacy " i would make some mistakes . 
+            When i was a child , i did n't know how to do shit about the house . 
+            My parents got me into my own house and i never found a place to live . 
+            So i decide to go to college . 
+            I was so freaked out , i didnt have the drive to see them . 
+            I never had a job , i was n't going anywhere . 
+            I was so happy . 
+            I knew i was going to be there . 
+            I gave myself a job and my parents came . 
+            That 's when i realized that i was in the wrong . 
+            So i started to go . 
+            I couldnt decide how long i wanted to live in this country . 
+            I was so excited about the future . 
+            I had a job . 
+            I saved my money . 
+            I did n't have a job . 
+            I went to a highschool in a small town . 
+            I had a job . 
+            A job . 
+            I did n't know what to do . 
+            I was terrified of losing my job . 
+            So i borrowed my $ 1000 in an hour . 
+            I could n't afford to pay my rent . 
+            I was so low on money . 
+            I had my parents and i got into a free college . 
+            I got in touch with my parents . 
+            All of my friends were dead . 
+            I was still with my family for a week . 
+            I became a good parent . 
+            I was a good choice . 
+            When i got on my HSS i was going to go to my parents ' house . 
+            I started to judge my parents . 
+            I had a minor problem . 
+            My parents . 
+            I was so fucking bad . 
+            My sister had a voice that was very loud . 
+            I 'm sure my cousins were in a place where i could just hear my voice . 
+            I felt like i was supposed to be angry . 
+            I was so angry . 
+            To cope with this . 
+            My dad and i were both on break and i felt so alone . 
+            I got unconscious and my mum left . 
+            When I got to college , i was back in school . 
+            I was a good kid . 
+            I was happy . 
+            And I told myself I was ready . 
+            I told my parents . 
+            They always talked about how they were going to be a good mom , and that I was going to be ready for that . 
+            They always wanted to help me . 
+            I did n't know what to do . 
+            I had to . 
+            I tried to go back to my dad , because I knew a lot about my mom . 
+            I loved her . 
+            I cared about her . 
+            We cared for our family . 
+            The time together was my only relationship . 
+            I loved my heart . 
+            And I hated my mother . 
+            I chose it . 
+            I cried . I cried . I cried . I cried . I cried . I cried . I cried . 
+            The tears were gone . 
+            I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . 
+            I do n't know how to do it . 
+            I do n't know how to deal with it . 
+            I ca n't feel my emotions . 
+            I ca n't get out of bed . 
+            I ca n't sleep . 
+            I ca n't tell my friends . 
+            I just need to leave . 
+            I want to leave . 
+            I hate myself . 
+            I hate feeling like I 'm being selfish . 
+            I feel like I 'm not good enough anymore . 
+            I need to find a new job . 
+            I hate that I have to get my shit together . 
+            I love my job . 
+            I 'm having a hard time .
+            Why do I need to get a job ? 
+            I have no job . 
+            I have n't been feeling good lately . 
+            I feel like I 'm going to be so much worse in the long run . 
+            I feel so alone . 
+            I ca n't believe I 'm so sad about going through my entire life . 
+         </BodY> 
+         <AuTHoR> http://www.reddit.com/u/Scarbarella </AuTHoR> 
+      </CoMMeNT> 
+   </COMMENTS> 
+   <SUBREDDIT> http://www.reddit.com/r/offmychest </SUBREDDIT> 
+   <SELFTEXT> 
+      I do n't know what to do anymore . 
+      I feel like I 'm going to die and I 'm going to be sick because I have no more friends . 
+      I do n't know what to do about my depression and I do n't know where to go from here . 
+      I do n't know how I do because I know I 'm scared of being alone . 
+      Any advice would be appreciated . 
+      Love . 
+   </SELFTEXT> 
+</SUBMISSION> 
+```
+
+This particular sample is a little depressing, but incredibly human, which is one of the reasons I am so interested in reddit for language modeling.
+But that might just be the nature of the `offmychest` subreddit.
+
+A less depressing sample is the following, which concerns the Destiny game.
+
+```xml
+<SUBMISSION>
+   <SUBREDDIT> http://www.reddit.com/r/DestinyTheGame </SUBREDDIT> 
+   <TITLE> 
+      Does anyone have a link to the Destiny Grimoire that I can use to get my Xbox 360 to play ? 
+   </TITLE> 
+   <COMMENTS> 
+      <CoMMeNT> 
+         <AuTHoR> http://www.reddit.com/u/CursedSun </AuTHoR> 
+         <BodY> 
+            I 'd love to have a weekly reset . 
+         </BodY> 
+         <ScoRE> 1 </ScoRE> 
+      </CoMMeNT> 
+   </COMMENTS> 
+      <SCORE> 0 </SCORE> 
+      <SELFTEXT> 
+         I have a few friends who are willing to help me out . 
+         If I get to the point where I 'm not going to have to go through all the weekly raids , I 'll have to " complete " the raid . 
+         I 'm doing the Weekly strike and then doing the Weekly ( and hopefully also the Weekly ) on Monday . 
+         I 'm not planning to get the chest , but I am getting my first exotic that I just got done from my first Crota raid . 
+         I 'm not sure how well it would work for the Nightfall and Weekly , but I do n't want to loose my progress . 
+         I 'd love to get some other people to help me , and I 'm open to all suggestions . 
+         I have a lot of experience with this stuff , so I figured it 's a good idea to know if I 'm getting the right answer . 
+         I 'm truly sorry for the inconvenience . 
+      </SELFTEXT> 
+   <AUTHOR> <OOV> </AUTHOR> 
+</SUBMISSION>
+``` 
+
+The particular model (a 4x1500 LSTM with dropout) only backpropagates through 50 time-steps.
+What would like to see is for the comments to actually answer the question posed by the title and selftext.
+This is a very difficult semantic problem which I hope the Reddit dataset will help solve. 
+More to follow in my next Troch blog post. 
 
 <a name='nce.ref'></a>
 ## References
diff --git a/blog/_posts/images/LSTM.png b/blog/_posts/images/LSTM.png
new file mode 100644
index 0000000..80c6067
--- /dev/null
+++ b/blog/_posts/images/LSTM.png
diff --git a/blog/_posts/images/small-vs-big-lstm.png b/blog/_posts/images/small-vs-big-lstm.png
new file mode 100644
index 0000000..b580afd
--- /dev/null
+++ b/blog/_posts/images/small-vs-big-lstm.png
author	nicholas-leonard <nick@nikopia.org>	2016-07-21 23:19:47 +0300
committer	nicholas-leonard <nick@nikopia.org>	2016-07-21 23:19:47 +0300
commit	44bba0b99649a02798671d02713450c0bc5fb8eb (patch)
tree	8d3c080f46ccf7e08efc74324aafcb3f8d829af4
parent	dd653411b4670fe571eaee13b53a69b566cd04ec (diff)