Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/torch/torch.github.io.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authornicholas-leonard <nick@nikopia.org>2016-07-21 23:19:47 +0300
committernicholas-leonard <nick@nikopia.org>2016-07-21 23:19:47 +0300
commit44bba0b99649a02798671d02713450c0bc5fb8eb (patch)
tree8d3c080f46ccf7e08efc74324aafcb3f8d829af4
parentdd653411b4670fe571eaee13b53a69b566cd04ec (diff)
final verision?
-rw-r--r--blog/_posts/2016-05-11-nce.md329
-rw-r--r--blog/_posts/images/LSTM.pngbin0 -> 82623 bytes
-rw-r--r--blog/_posts/images/small-vs-big-lstm.pngbin0 -> 38671 bytes
3 files changed, 293 insertions, 36 deletions
diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md
index 310c63f..92a46dc 100644
--- a/blog/_posts/2016-05-11-nce.md
+++ b/blog/_posts/2016-05-11-nce.md
@@ -20,8 +20,8 @@ picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_po
In this Torch blog post, we use noise contrastive estimation (NCE) [[2]](#nce.ref)
to train a multi-GPU recurrent neural network language model (RNNLM)
on the Google billion words (GBW) dataset [[7]](#nce.ref).
-This blog post is the result of many months of work.
-The enormity of the dataset caused us to contribute some novel open-source modules, criteria and even a multi-GPU tensor.
+The work presented here is the result of many months of on-and-off work.
+The enormity of the dataset caused us to contribute some novel open-source Torch modules, criteria and even a multi-GPU tensor.
We also provide scripts so that you can train and evaluate your own language models.
<a name='nce.char'></a>
@@ -74,18 +74,52 @@ require less memory and have faster inference than their word-level counterparts
Our task is to build a language model which maximizes the likelihood of the
next word given the history of previous words in the sentence.
-The following figure illustrates the a simple recurrent neural network (Simple RNN) language model:
+The following figure illustrates the workings of a simple recurrent neural network (Simple RNN) language model:
![rnnlm](images/rnnlm.png)
+The exact implementation is as follows:
+
+```lua
+h[t] = σ(W[x->h]x[t] + W[h->h]h[t−1] + b[1->h]) (1)
+y[t] = softmax(W[x->y]h[t] + b[1->y]) (2)
+```
+
For this particular example, the model should maximize "is" given "what", and then "the" given "is" and so on.
-The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the following target words.
-Simple RNNs are not the only kind of model that can be used model language.
+The Simple RNN has an internal hidden state `h[t]` which summarizes the sequence fed in so far, as it relates to maximizing the likelihood of the remaining words in the sequence.
+Internally, the Simple RNN has parameters from input to hidden (word embeddings), hidden to hidden (recurrent connections) and hidden to output (output embeddings that feed into a softmax).
+The input to hidden parameters consist of a `LookupTable` that learns to represent each word as a vector.
+These vectors form a embeddings space for words.
+The input `x[t]` to the `LookupTable` is a unique integer associated to the word `w[t]`.
+The embedding vector for that word is obtained by indexing the embedding space `W[x->h]` which we represent by `W[x->h]x[t]`.
+The hidden to hidden parameters model the temporal dependencies of words by generating a hidden state `h[t]` given `h[t-1]` and `x[t]`.
+This is where the actual recurrence takes place as `h[t]` is a function of `h[t-1]` (and word `x[t]`).
+The hidden to output layer does an affine transform (i.e. a `Linear` module: `W[x->y]h[t] + b[1->h]`) followed by a `softmax`.
+This is to estimate a probability distribution `y[t]`over the next word given the previous words which is emboddied by the hidden state `h[t]`.
+The criterion is to maximize the likelihood of the next word `w[t+1]` given previous words:
+`P(w[t+1]|w[1],w[2],...,w[t])`.
+
+Simple RNNs are easy to build using the [rnn](https://github.com/Element-Research/rnn) package (see [simple RNN example](https://github.com/Element-Research/rnn/blob/master/examples/simple-recurrence-network.lua)),
+but they are not the only kind of model that can be used model language.
There are also the more advanced Long Short Term Memory (LSTM) models [[3],[4],[5]](#nce.ref), which
have special gated cells that facilitate the backpropagation of gradients through longer sequences.
-LSTMs can learn dependencies seperated between much longer time-steps.
-Like convolutions, these LSTM layers can also be stacked to form deeper models.
-In the Building
+
+![lstm](images/lstm.png)
+
+The exact implementation is as follows:
+
+```lua
+i[t] = σ(W[x->i]x[t] + W[h->i]h[t−1] + b[1->i]) (3)
+f[t] = σ(W[x->f]x[t] + W[h->f]h[t−1] + b[1->f]) (4)
+z[t] = tanh(W[x->c]x[t] + W[h->c]h[t−1] + b[1->c]) (5)
+c[t] = f[t]c[t−1] + i[t]z[t] (6)
+o[t] = σ(W[x->o]x[t] + W[h->o]h[t−1] + b[1->o]) (7)
+h[t] = o[t]tanh(c[t]) (8)
+```
+
+The main advantage is that LSTMs can learn dependencies between words seperated between much longer time-steps.
+It isn't as prone to the problems of vanishing gradients as the different gates can preserve the gradients during back-propagation.
+To create a LM, the word embeddings (`W[x->h]x[t]` in eq.1) would be fed to the LSTM and the resulting hidden state would be fed to eq. 2.
<a name='nce.gbw'></a>
## Loading the Google billion words dataset
@@ -152,10 +186,10 @@ print(inputs)
[torch.DoubleTensor of size 8x2]
```
-Each column is vector containing potentially multiple sequences, i.e. a multi-sequence.
-Independent sequences are seperated by zeros. We will see later how the
+Each column is a vector containing potentially multiple sequences, i.e. a multi-sequence.
+Independent sequences are seperated by zeros. In the next section, we will see how the
[rnn](https://github.com/Element-Research/rnn) package can use these zero-masked time-steps to
-efficiently forget its hidden state between independent sequences, at the granularity of columns.
+efficiently forget its hidden state between independent sequences (at the granularity of columns).
For now, notice how the original `sequences` are contained in the returned `inputs` and separated by zeros.
The `targets` are similar to the `inputs`, but use masks of 1 to separate sequences (as `ClassNLLCriterion` will otherwise complain).
@@ -250,6 +284,9 @@ Now that we feel confident in our dataset, lets look at the model.
<a name='nce.lstm'></a>
## Building a multi-layer LSTM
+In this section, we get down to the business of actually building our multi-layer LSTM.
+We will introduce NCE once we get to the output layer, starting from the input layer.
+
The input layer of the the `lm` model is a lookup table :
```lua
@@ -277,7 +314,7 @@ for i,hiddensize in ipairs(opt.hiddensize) do
end
```
-The `SeqLSTM` implemention is very fast and it benchmarked by the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm).
+The `SeqLSTM` implemention is very fast and it benchmarked by the [rnn-benchmarks](https://github.com/glample/rnn-benchmarks#lstm) repository.
Next we split the output of the SeqLSTM (which is a `seqlen x batchsize x outputsize` Tensor) into a table containing a `batchsize x outputsize` tensor for
each time-step:
@@ -285,7 +322,7 @@ each time-step:
lm:add(nn.SplitTable(1))
```
-### Output layer bottleneck
+### The problem: bottleneck at the output layer
With its small vocabulary of 10000 words, the Penn Tree Bank dataset is relatively easy to use to build word-level language models.
The output layer is still computationally tractable for both training and inference, especially for GPUs.
@@ -298,16 +335,16 @@ outputlayer = nn.Sequential()
```
However, when training with large vocabularies, like the 793471 words that makes up the GBW dataset ,
-the output layer quickly becomes a bottle neck.
-If you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100`
+the output layer quickly becomes a bottleneck.
+For example, if you are training your model with a `batchsize = 32` (number of sequences per batch) and a `seqlen = 100`
(size of sequence to backpropagate through time),
the output of that layer will have shape `seqlen x batchsize x vocabsize`, or `32 x 100 x 793471`.
-For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory.
+For a `FloatTensor` or `CudaTensor`, that single tensor will take up 10.156GB of memory!
The number can be double for gradients, and doubled again as both Linear and SoftMax store a copy for the output.
If somehow you can find a way to put >40GB on a GPU (or distribute it over many), you then run in the problem of
forward/backward propagating through that `outputlayer` in a reasonable time-frame.
-### Noise contrastive estimation
+### The solution: noise contrastive estimation
The output layer of the LM uses Noise Contrastive Estimation (NCE) to speed up training and reduce memory consumption:
@@ -333,11 +370,39 @@ nn.Sequential():add(nn.Linear(inputsize, #trainset.ivocab)):add(nn.LogSoftMax())
Along with the [NCECriterion](https://github.com/Element-Research/dpnn#nn.NCECriterion),
the `NCEModule` implements the algorithm is described in [[1]](#nce.ref).
-I won't go into the details of the algorithm as it involves a lot of math.
+I won't go into the details of the algorithm as it involves a lot of math which is more appropriately detailed in the reference papers.
The way it works is that for each target word (the likelihood of which we want to maximize),
`k` words are sampled from a noise distribution, which is typically the unigram distribution.
-The `unigram` above is a tensor of size 793470 where each element is the frequency of the commensurate word in the corpus.
+Remember that a softmax is basically:
+
+```lua
+ exp(x[i])
+y[i] = --------------------------------- (9)
+ exp(x[1])+exp(x[2])+...+exp(x[n])
+```
+
+where `x[i]` is the `i`-th output of the output `Linear` layer.
+The above denominator is the cause of the bottleneck as the `Linear` needs to be computed for each output `x[i]`.
+For a `n=797470` vocabulary, this is prohibitively expensive.
+NCE goes around this problem by replacing the denominator of eq. 9 with a constant `Z` during training:
+
+```lua
+ exp(x[i])
+y[i] = ------------ (10)
+ Z
+```
+
+Now this is not what actually happens during training as back-propagating through the above will not produce gradients
+for the `x[j]` where `j~=i` (`j` not equal `i`).
+Notice that backpropagating through eq. 9 will produce gradients for all outputs `x` of the `Linear` (i.e. for all `i`).
+Another problem with eq. 10 is that nothing is pushing `exp(x[1])+exp(x[2])+...+exp(x[n])` to approximate `Z`.
+What NCE does is formulate the problem such that `k` noise samples can be included in the equation to
+both make sure that some (at most `k`) negative samples (i.e. `x[j]` where `j`) get gradients and that the denominator of eq. 9 approximates the denominator of eq. 10.
+The `k` noise samples are sampled from a noise distribution, i.e. the unigram distribution.
+The output layer `Linear` need only be computed for the target and noise-sampled words, which is where the efficiency is gained.
+
+The `unigram` variable above is a tensor of size 793470 where each element is the frequency of the commensurate word in the corpus.
Sampling from such a large distribution using something like [torch.multinomial](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.multinomial)
can become a bottleneck during training.
So we implemented a more efficient version in [torch.AliasMultinomial](https://github.com/nicholas-leonard/torchx/blob/master/AliasMultinomial.lua).
@@ -348,22 +413,25 @@ For the Softmax, which NCE tries to approximate, the `Z` is the sum over the `ex
For NCE, the `Z` is typically fixed to `Z=1`.
Our initial experiments found that setting `Z` to `Z=N*mean(exp(x[i]))`
(where `N` is the number of words and the `mean` is approximated over a small batch of word samples `i`)
-gave much better results.
+gave much better results, but this is because we weren't appropriately initializing the output layer parameters.
-One notable aspect of NCE papers (there are many) is that they often forget to mention the importance of parameter initialization.
+One notable aspect of NCE papers (there are many) is that they often forget to mention the importance of this parameter initialization.
Setting `Z=1` is only really possible if the `NCEModule.bias` is initialized to `bias[i] = -log(N)`.
This is what the authors of [[2]](#nce.ref) use, although it isn't mentioned in the paper (I contacted one of the authors to find out).
Sampling `k` noise samples per time-step and per batch-row means that the `NCEModule` needs to internally use something like
[torch.baddbmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.baddbmm) to compute the `output`.
Reference [[2]](#nce.ref) implement a faster version where the noise samples are drawn once and used for the entire batch (but still once for each time-step).
-This make the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used.
+This makes the code a bit faster as the more efficient [torch.addmm](https://github.com/torch/torch7/blob/master/doc/maths.md#torch.addmm) can be used instead of `torch.baddbmm`.
This faster NCE version described in [[2]](#nce.ref) is the default implementation of the `NCEModule`. Sampling per batch-row can be turned on with `NCEModule.rownoise=true`.
<a name='nce.script'></a>
## Training and evaluation scripts
-The experiments presented here use three scripts: two for training and one for evaluation.
+The experiments presented here use three scripts: two for training (you only need to use one) and one for evaluation.
+The training scripts only differ in the amount of GPUs to use.
+Both train a language model on the training set and do early-stopping on the validation set.
+The evaluation script is used to measure the perplexity of a trained model on the test set, or to generate sentences.
### Single-GPU training script
@@ -492,7 +560,7 @@ method to distribute the `weight` on `gradWeight` on different devices.
This is accomplished by swaping the weight tensors for our own : [torch.MultiCudaTensor](https://github.com/nicholas-leonard/torchx/blob/master/MultiCudaTensor.lua).
Lua has no severe type-checking system, so you can fake a tensor by creating a `torch.class` table with the same methods.
To save time, the current version of `MultiCudaTensor` only supports the operations required by the NCEModule.
-The advantage of this approach is that it requires minimal changes to the NCEModule and maintains backward compatiblity without requiring redundant code or excessive refactoring.
+The advantage of this approach is that it requires minimal changes to the `NCEModule` and maintains backward compatiblity without requiring redundant code or excessive refactoring.
```lua
-- output layer
@@ -569,11 +637,29 @@ On the 4-layer LSTM with 2048 hidden units, [[1]](#nce.ref) obtain 43.2 perplexi
After early-stopping on a sub-set of the validation set (at 100 epochs of training where 1 epoch is 128 sequences x 400k words/sequence), our model was able to reach *40.61* perplexity.
This model was run on 4x12GB NVIDIA Titan X GPUs.
-Training requires approximately 40GB of memory, distributed across the 4 GPU devices.
+Training requires approximately 40GB of memory distributed across the 4 GPU devices, and 2-3 weeks of training.
As in the original paper, we do not make use of momentum as it provides little benefit and requires 1/2 more memory.
Training runs at about 3800 words/second.
+### Learning Curves
+
+The following figure outlines the learning curves for the above 4x2048 LSTM model.
+The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`.
+Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`.
+
+![LSTM NCE Learning curves](images/LSTM-NCE-curve.png)
+
+As you can see, most of the learning is done in the first epochs.
+Nevertheless, the training and validation error are consistently reduced training progresses.
+
+The following figure compares the valiation learning curves (again, NCE error) for a small 2x250 LSTM (no dropout) and big 4x2048 LSTM (with dropout).
+
+![Small vs Big LSTM](images/small-vs-big-lstm.png)
+
+What I find impressive about this figure is how quickly the higher-capacity model bests the lower-capacity model.
+This clearly demonstrates the importance of capacity when optimizing large-scale language models.
+
### Generating Sentences
Here are some sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7:
@@ -633,26 +719,197 @@ Here are some sentences sampled independently from the 4-layer LSTM with a `temp
The syntax seems quite reasonable, especially when comparing it to the previous results obtained from the [single-GPU 2x250 LSTM](#nce.eval).
However, in some cases, the semantics, i.e. the meaning of the words, is not so good.
-For example, the sentence `<S> Easy , it 's said , to have a child with autism . </S>` would make more sense, to me at least, by replacing `Easy` with `Not easy`.
-But then again, this sentence as nice semantics: `<S> The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . </S>`.
+For example, the sentence
+```xml
+<S> Easy , it 's said , to have a child with autism . </S>
+```
+would make more sense, to me at least, by replacing `Easy` with `Not easy`.
+
+On the other hand, sentences like this one demonstrate good semantics:
+
+```xml
+<S> The government of president Michelle Bachelet has promised to maintain a " strong and systematic " military presence in key areas and to tackle any issue of violence , including kidnappings . </S>`.
+```
+
[Michelle Bachelet](https://en.wikipedia.org/wiki/Michelle_Bachelet) was actually a president of Chile.
In her earlier life, she was also [kidnapped by military men](https://www.theguardian.com/world/2005/nov/22/chile.gender), so it kind of makes sense that she would be strong on the issue of kidnappings.
-Here is an example of some weird semantics : `<S> Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . </S>`
+
+Here is an example of some weird semantics :
+
+```xml
+<S> Even though there are many brands that have low voice and no connection to the Internet , the iPhone is a great deal for consumers . </S>
+```
+
The first part about `load voice` doesn't mean anything to me.
And I fail to see how there being `many brands that have no connection to the Internet` relates to `the iPhone is a great deal for consumers`.
But of course, all these sentences are generated independently, so the LM needs to learn to generate a meaning on the fly.
-This is hard as there is no context to the sentence.
+This is hard as there is no context to the sentence being generated.
-### Learning Curves
+In any case, I am quite happy with the results as they are definitely some of the most natural-looking synthetic sentences I have seen so far.
-The following figure outlines the learning curves for the above model.
-The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`.
-Test set error isn't plotted as doing so for any epoch requires about 3 hours because test set inference uses `Linear` + `SoftMax` with `batchsize=1`.
+## Future Work
-![LSTM NCE Learning curves](images/LSTM-NCE-curve.png)
+I am currently working on a language modeling dataset based on one month of [reddit.com](https://www.reddit.com/) data.
+Each sequence is basically a reddit submission consisting of a title, selftext, url, score, author and a thread of comments.
+These sequences are much longer (average of 205 tokens) than the sentences that make up the GBW dataset (average of 26 tokens).
+Training is still underway, but to pique your interest, this is an example of generated data (indentation and line breaks added for clarity):
-As you can see, most of the learning is done in the first epochs.
-Nevertheless, the training and validation error are consistently reduced training progresses.
+```xml
+<SUBMISSION>
+ <AUTHOR> http://www.reddit.com/u/[deleted] </AUTHOR>
+ <SCORE> 0 </SCORE>
+ <TITLE>
+ [ WP ] You take a picture of a big bang .
+ You discover an alien that lives in the center of the planet in an unknown way .
+ You can say " what the fuck is that ? "
+ </TITLE>
+ <COMMENTS>
+ <CoMMeNT>
+ <ScoRE> 2 </ScoRE>
+ <AuTHoR> http://www.reddit.com/u/Nev2k </AuTHoR>
+ <BodY>
+ I have a question .
+ When i was younger , my parents had a house that had a living room in it .
+ One that was only a small portion of an entire level .
+ This was a month before i got my money .
+ If i was living in a house with a " legacy " i would make some mistakes .
+ When i was a child , i did n't know how to do shit about the house .
+ My parents got me into my own house and i never found a place to live .
+ So i decide to go to college .
+ I was so freaked out , i didnt have the drive to see them .
+ I never had a job , i was n't going anywhere .
+ I was so happy .
+ I knew i was going to be there .
+ I gave myself a job and my parents came .
+ That 's when i realized that i was in the wrong .
+ So i started to go .
+ I couldnt decide how long i wanted to live in this country .
+ I was so excited about the future .
+ I had a job .
+ I saved my money .
+ I did n't have a job .
+ I went to a highschool in a small town .
+ I had a job .
+ A job .
+ I did n't know what to do .
+ I was terrified of losing my job .
+ So i borrowed my $ 1000 in an hour .
+ I could n't afford to pay my rent .
+ I was so low on money .
+ I had my parents and i got into a free college .
+ I got in touch with my parents .
+ All of my friends were dead .
+ I was still with my family for a week .
+ I became a good parent .
+ I was a good choice .
+ When i got on my HSS i was going to go to my parents ' house .
+ I started to judge my parents .
+ I had a minor problem .
+ My parents .
+ I was so fucking bad .
+ My sister had a voice that was very loud .
+ I 'm sure my cousins were in a place where i could just hear my voice .
+ I felt like i was supposed to be angry .
+ I was so angry .
+ To cope with this .
+ My dad and i were both on break and i felt so alone .
+ I got unconscious and my mum left .
+ When I got to college , i was back in school .
+ I was a good kid .
+ I was happy .
+ And I told myself I was ready .
+ I told my parents .
+ They always talked about how they were going to be a good mom , and that I was going to be ready for that .
+ They always wanted to help me .
+ I did n't know what to do .
+ I had to .
+ I tried to go back to my dad , because I knew a lot about my mom .
+ I loved her .
+ I cared about her .
+ We cared for our family .
+ The time together was my only relationship .
+ I loved my heart .
+ And I hated my mother .
+ I chose it .
+ I cried . I cried . I cried . I cried . I cried . I cried . I cried .
+ The tears were gone .
+ I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried . I cried .
+ I do n't know how to do it .
+ I do n't know how to deal with it .
+ I ca n't feel my emotions .
+ I ca n't get out of bed .
+ I ca n't sleep .
+ I ca n't tell my friends .
+ I just need to leave .
+ I want to leave .
+ I hate myself .
+ I hate feeling like I 'm being selfish .
+ I feel like I 'm not good enough anymore .
+ I need to find a new job .
+ I hate that I have to get my shit together .
+ I love my job .
+ I 'm having a hard time .
+ Why do I need to get a job ?
+ I have no job .
+ I have n't been feeling good lately .
+ I feel like I 'm going to be so much worse in the long run .
+ I feel so alone .
+ I ca n't believe I 'm so sad about going through my entire life .
+ </BodY>
+ <AuTHoR> http://www.reddit.com/u/Scarbarella </AuTHoR>
+ </CoMMeNT>
+ </COMMENTS>
+ <SUBREDDIT> http://www.reddit.com/r/offmychest </SUBREDDIT>
+ <SELFTEXT>
+ I do n't know what to do anymore .
+ I feel like I 'm going to die and I 'm going to be sick because I have no more friends .
+ I do n't know what to do about my depression and I do n't know where to go from here .
+ I do n't know how I do because I know I 'm scared of being alone .
+ Any advice would be appreciated .
+ Love .
+ </SELFTEXT>
+</SUBMISSION>
+```
+
+This particular sample is a little depressing, but incredibly human, which is one of the reasons I am so interested in reddit for language modeling.
+But that might just be the nature of the `offmychest` subreddit.
+
+A less depressing sample is the following, which concerns the Destiny game.
+
+```xml
+<SUBMISSION>
+ <SUBREDDIT> http://www.reddit.com/r/DestinyTheGame </SUBREDDIT>
+ <TITLE>
+ Does anyone have a link to the Destiny Grimoire that I can use to get my Xbox 360 to play ?
+ </TITLE>
+ <COMMENTS>
+ <CoMMeNT>
+ <AuTHoR> http://www.reddit.com/u/CursedSun </AuTHoR>
+ <BodY>
+ I 'd love to have a weekly reset .
+ </BodY>
+ <ScoRE> 1 </ScoRE>
+ </CoMMeNT>
+ </COMMENTS>
+ <SCORE> 0 </SCORE>
+ <SELFTEXT>
+ I have a few friends who are willing to help me out .
+ If I get to the point where I 'm not going to have to go through all the weekly raids , I 'll have to " complete " the raid .
+ I 'm doing the Weekly strike and then doing the Weekly ( and hopefully also the Weekly ) on Monday .
+ I 'm not planning to get the chest , but I am getting my first exotic that I just got done from my first Crota raid .
+ I 'm not sure how well it would work for the Nightfall and Weekly , but I do n't want to loose my progress .
+ I 'd love to get some other people to help me , and I 'm open to all suggestions .
+ I have a lot of experience with this stuff , so I figured it 's a good idea to know if I 'm getting the right answer .
+ I 'm truly sorry for the inconvenience .
+ </SELFTEXT>
+ <AUTHOR> <OOV> </AUTHOR>
+</SUBMISSION>
+```
+
+The particular model (a 4x1500 LSTM with dropout) only backpropagates through 50 time-steps.
+What would like to see is for the comments to actually answer the question posed by the title and selftext.
+This is a very difficult semantic problem which I hope the Reddit dataset will help solve.
+More to follow in my next Troch blog post.
<a name='nce.ref'></a>
## References
diff --git a/blog/_posts/images/LSTM.png b/blog/_posts/images/LSTM.png
new file mode 100644
index 0000000..80c6067
--- /dev/null
+++ b/blog/_posts/images/LSTM.png
Binary files differ
diff --git a/blog/_posts/images/small-vs-big-lstm.png b/blog/_posts/images/small-vs-big-lstm.png
new file mode 100644
index 0000000..b580afd
--- /dev/null
+++ b/blog/_posts/images/small-vs-big-lstm.png
Binary files differ