Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/torch/torch.github.io.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authornicholas-leonard <nick@nikopia.org>2016-07-22 18:22:08 +0300
committernicholas-leonard <nick@nikopia.org>2016-07-22 18:22:08 +0300
commitd68e87389470e2c3e1a739169e95d9e62544d617 (patch)
tree82c3d617561fb67747fccf6d475a5bd61557f0fd
parent1ed35ad412b8cbaf55c460e27179c7440a97cfe6 (diff)
fixed title
-rw-r--r--blog/_posts/2016-05-11-nce.md59
1 files changed, 36 insertions, 23 deletions
diff --git a/blog/_posts/2016-05-11-nce.md b/blog/_posts/2016-05-11-nce.md
index 97c0e9a..8492d70 100644
--- a/blog/_posts/2016-05-11-nce.md
+++ b/blog/_posts/2016-05-11-nce.md
@@ -1,13 +1,13 @@
---
layout: post
-title: Noise Contrastive Estimation
+title: Language modeling a billion words
comments: True
author: nicholas-leonard
excerpt: Noise contrastive estimation is used to train a multi-GPU recurrent neural network language model on the Google billion words dataset.
picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_posts/images/output_52iFki.gif
---
-<!---# Noise contrastive estimation for the Google billion words dataset -->
+<!---# Language modeling a billion words -->
* [Word versus character level language models](#nce.char)
* [Recurrent neural network language models](#nce.rnnlm)
@@ -16,6 +16,7 @@ picture: https://raw.githubusercontent.com/torch/torch.github.io/master/blog/_po
* [Training and evaluation scripts](#nce.script)
* [Results](#nce.result)
* [References](#nce.ref)
+ * [Future Word](#nce.future)
In this Torch blog post, we use noise contrastive estimation (NCE) [[2]](#nce.ref)
to train a multi-GPU recurrent neural network language model (RNNLM)
@@ -63,11 +64,14 @@ After tokenization, the word-level model might view this sequence as containing
On the other hand, the char-level will view this sequence as containing 102 tokens.
This longer sequence makes the task of the char-level model harder than the word-level model, as it
must take into account dependencies between more tokens over more time-steps.[[8]](#nce.ref)
+Another issue with character language models is that they need to learn spelling in
+addition to syntax, semantics, etc.
-The main advantage of char-level over word-level language models is that they
+The main advantage of character over word level language models is that they
have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters
compared to 800,000 words (after pruning low-frequency tokens). In practice this means that char-level models will
require less memory and have faster inference than their word-level counterparts.
+Another advantage is that they do not require tokenization as a preprocessing step.
<a name='nce.rnnlm'></a>
## Recurrent neural network language models
@@ -642,7 +646,7 @@ As in the original paper, we do not make use of momentum as it provides little b
Training runs at about 3800 words/second.
-### Learning Curves
+### Learning curves
The following figure outlines the learning curves for the above 4x2048 LSTM model.
The figure plots the NCE training and validation error for the model, which is the error output but the `NCEModule`.
@@ -660,7 +664,7 @@ The following figure compares the valiation learning curves (again, NCE error) f
What I find impressive about this figure is how quickly the higher-capacity model bests the lower-capacity model.
This clearly demonstrates the importance of capacity when optimizing large-scale language models.
-### Generating Sentences
+### Generating sentences
Here are some sentences sampled independently from the 4-layer LSTM with a `temperature` or 0.7:
@@ -747,10 +751,11 @@ This is hard as there is no context to the sentence being generated.
In any case, I am quite happy with the results as they are definitely some of the most natural-looking synthetic sentences I have seen so far.
-## Future Work
+<a name='nce.future'></a>
+## Future work
I am currently working on a language modeling dataset based on one month of [reddit.com](https://www.reddit.com/) data.
-Each sequence is basically a reddit submission consisting of a title, selftext, url, score, author and a thread of comments.
+Each sequence is basically a reddit submission consisting of a `TITLE`, `SELFTEXT` (or `URL`), `SCORE`, `AUTHOR` and a thread of `COMMENTS`.
These sequences are much longer (average of 205 tokens) than the sentences that make up the GBW dataset (average of 26 tokens).
Training is still underway, but to pique your interest, this is an example of generated data (indentation and line breaks added for clarity):
@@ -871,10 +876,12 @@ Training is still underway, but to pique your interest, this is an example of ge
</SUBMISSION>
```
-This particular sample is a little depressing, but incredibly human, which is one of the reasons I am so interested in reddit for language modeling.
-But that might just be the nature of the `offmychest` subreddit.
+This particular sample is a little depressing, but that might just be the nature of the `offmychest` subreddit...
+Conditioned on the opening `<SUBMISSION>` token, this generated sequence is incredibly human.
+Reading through the comment, I feel like I am reading a story written by an actual person.
+The ability to similuate human creativity is one of the reasons I am so interested in using reddit data for language modeling.
-A less depressing sample is the following, which concerns the Destiny game.
+A less depressing sample is the following, which concerns the [Destiny](https://en.wikipedia.org/wiki/Destiny_(video_game)) video game:
```xml
<SUBMISSION>
@@ -891,25 +898,31 @@ A less depressing sample is the following, which concerns the Destiny game.
<ScoRE> 1 </ScoRE>
</CoMMeNT>
</COMMENTS>
- <SCORE> 0 </SCORE>
- <SELFTEXT>
- I have a few friends who are willing to help me out .
- If I get to the point where I 'm not going to have to go through all the weekly raids , I 'll have to " complete " the raid .
- I 'm doing the Weekly strike and then doing the Weekly ( and hopefully also the Weekly ) on Monday .
- I 'm not planning to get the chest , but I am getting my first exotic that I just got done from my first Crota raid .
- I 'm not sure how well it would work for the Nightfall and Weekly , but I do n't want to loose my progress .
- I 'd love to get some other people to help me , and I 'm open to all suggestions .
- I have a lot of experience with this stuff , so I figured it 's a good idea to know if I 'm getting the right answer .
- I 'm truly sorry for the inconvenience .
- </SELFTEXT>
+ <SCORE> 0 </SCORE>
+ <SELFTEXT>
+ I have a few friends who are willing to help me out .
+ If I get to the point where I 'm not going to have to go through all the weekly raids , I 'll have to " complete " the raid .
+ I 'm doing the Weekly strike and then doing the Weekly ( and hopefully also the Weekly ) on Monday .
+ I 'm not planning to get the chest , but I am getting my first exotic that I just got done from my first Crota raid .
+ I 'm not sure how well it would work for the Nightfall and Weekly , but I do n't want to loose my progress .
+ I 'd love to get some other people to help me , and I 'm open to all suggestions .
+ I have a lot of experience with this stuff , so I figured it 's a good idea to know if I 'm getting the right answer .
+ I 'm truly sorry for the inconvenience .
+ </SELFTEXT>
<AUTHOR> <OOV> </AUTHOR>
</SUBMISSION>
```
+For those not familiar with this game, terms like
+[Grimoire](http://destiny.wikia.com/wiki/Grimoire), [weekly reset](https://www.vg247.com/tag/destiny-weekly-reset/),
+[raids](http://destiny.wikia.com/wiki/Raid), [Nightfall stike](http://destiny.wikia.com/wiki/Weekly_Nightfall_Strike),
+[exotics](http://destiny.wikia.com/wiki/Exotic) and [Crota raid](http://destiny.wikia.com/wiki/Crota%27s_End)
+may seem odd. But these are all part of the game vocabulary.
+
The particular model (a 4x1500 LSTM with dropout) only backpropagates through 50 time-steps.
-What would like to see is for the comments to actually answer the question posed by the title and selftext.
+What I would like to see is for the `COMMENTS` to actually answer the question posed by the `TITLE` and `SELFTEXT`.
This is a very difficult semantic problem which I hope the Reddit dataset will help solve.
-More to follow in my next Troch blog post.
+More to follow in my next Torch blog post.
<a name='nce.ref'></a>
## References