Copy nplm-0.1 after removing some executable bits

author: Kenneth Heafield <github@kheafield.com> 2013-10-29 22:00:37 +0400
committer: Kenneth Heafield <github@kheafield.com> 2013-10-29 22:00:37 +0400
commit: 78eecfdd7ef4cc0aef575c828c6fef747c63da19 (patch)
tree: cbd1e84c871306a35e1352286f7749ccac4f60bc /README.md
parent: e4138ba17732e70bfe9ad8e806173c083a9ddd0e (diff)
1 files changed, 185 insertions, 3 deletions
diff --git a/README.md b/README.md
index 59fe5a6..d818557 100644
--- a/README.md
+++ b/README.md
@@ -1,4 +1,186 @@
-nplm
-====
+2013-07-30
+
+Prerequisites
+-------------
+
+Before compiling, you must have the following:
+
+A C++ compiler and GNU make
+
+Boost 1.47.0 or later
+http://www.boost.org
+
+Eigen 3.1.x
+http://eigen.tuxfamily.org
+
+Optional:
+
+Intel MKL 11.x
+http://software.intel.com/en-us/intel-mkl
+Recommended for better performance.
+
+Python 2.7.x, not 3.x
+http://python.org
+
+Cython 0.19.x
+http://cython.org
+Needed only for building Python bindings.
+
+Building
+--------
+
+To compile, edit the Makefile to reflect the locations of the Boost
+and Eigen include directories.
+
+If you want to use the Intel MKL library (recommended if you have it),
+uncomment the line
+    MKL=/path/to/mkl
+editing it to point to the MKL root directory.
+
+By default, multithreading using OpenMP is enabled. To turn it off,
+comment out the line
+    OMP=1
+
+Then run 'make install'. This creates several programs in the bin/
+directory and a library lib/neuralLM.a.
+
+Notes on particular configurations:
+
+- Intel C++ compiler and OpenMP. With version 12, you may get a
+  "pragma not found" error. This is reportedly fixed in ComposerXE
+  update 9.
+
+- Mac OS X and OpenMP. The Clang compiler (/usr/bin/c++) doesn't
+  support OpenMP. If the g++ that comes with XCode doesn't work
+  either, try the one installed by MacPorts (/opt/local/bin/g++ or
+  /opt/local/bin/g++-mp-*).
+
+Training a language model
+-------------------------
+
+Building a language model requires some preprocessing. In addition to
+any preprocessing of your own (tokenization, lowercasing, mapping of
+digits, etc.), prepareNeuralLM (run with --help for options) does the
+following:
+
+- Splits into training and validation data. The training data is used
+  to actually train the model, while the validation data is used to
+  check its performance.
+- Creates a vocabulary of the k most frequent words, mapping all other
+  words to <unk>.
+- Adds start <s> and stop </s> symbols.
+- Converts to numberized n-grams.
+
+A typical invocation would be:
+
+    prepareNeuralLM --train_text mydata.txt --ngram_size 3 \
+                    --n_vocab 5000 --words_file words \
+                    --train_file train.ngrams \
+                    --validation_size 500 --validation_file validation.ngrams
+
+which would generate the files train.ngrams, validation.ngrams, and words.
+
+These files are fed into trainNeuralNetwork (run with --help for
+options). A typical invocation would be:
+
+    trainNeuralNetwork --train_file train.ngrams \
+                       --validation_file validation.ngrams \
+                       --num_epochs 10 \
+                       --words_file words \
+                       --model_prefix model
+
+After each pass through the data, the trainer will print the
+log-likelihood of both the training data and validation data (higher
+is better) and generate a series of model files called model.1,
+model.2, and so on. You choose which model you want based on the
+validation log-likelihood.
+
+You can find a working example in the example/ directory. The Makefile
+there generates a language model from a raw text file.
+
+Notes:
+
+- Vocabulary. You should set --n_vocab to something less than the
+  actual vocabulary size of the training data (and will receive a
+  warning if it's not). Otherwise, no probability will be learned for
+  unknown words. On the other hand, there is no need to limit n_vocab
+  for the sake of speed. At present, we have tested it up to 100000.
+
+- Normalization. Most of the computational cost normally (no pun
+  intended) associated with a large vocabulary has to do with
+  normalization of the conditional probability distribution P(word |
+  context). The trainer uses noise-contrastive estimation to avoid
+  this cost during training (Gutmann and Hyvärinen, 2010), and, by
+  default, sets the normalization factors to one to avoid this cost
+  during testing (Mnih and Hinton, 2009).
+
+  If you set --normalization 1, the trainer will try to learn the
+  normalization factors, and you should accordingly turn on
+  normalization when using the resulting model. The default initial
+  value --normalization_init 0 should be fine; you can try setting it
+  a little higher, but not lower.
+
+- Validation. The trainer computes the log-likelihood of a validation
+  data set (which should be disjoint from the training data). You use
+  this to decide when to stop training, and the trainer also uses it
+  to throttle the learning rate. This computation always uses exact
+  normalization and is therefore much slower, per instance, than
+  training. Therefore, you should make the validation data
+  (--validation_size) as small as you can. (For example, Section 00 of
+  the Penn Treebank has about 2000 sentences and 50,000 words.)
+
+Python code
+-----------
+
+prepareNeuralLM.py performs the same function as prepareNeuralLM, but in
+Python. This may be handy if you want to make modifications.
+
+nplm.py is a pure Python module for reading and using language models
+created by trainNeuralNetwork. See testNeuralLM.py for example usage.
+
+In src/python are Python bindings (using Cython) for the C++ code. To
+build them, run 'make python/nplm.so'.
+
+Using in a decoder
+------------------
+
+To use the language model in a decoder, include neuralLM.h and link
+against neuralLM.a. This provides a class nplm::neuralLM, with the
+following methods:
+
+    void set_normalization(bool normalization);
+
+Turn normalization on or off (default: off). If normalization is off,
+the probabilities output by the model will not be normalized. In
+general, this means that summing over all possible words will not give
+a probability of one. If normalization is on, computes exact
+probabilities (too slow to be recommended for decoding).
+
+    void set_map_digits(char c);
+
+Map all digits (0-9) to the specified character. This should match
+whatever mapping you used during preprocessing.
+
+    void set_log_base(double base);
+
+Set the base of the log-probabilities returned by lookup_ngram. The
+default is e (natural log), whereas most other language modeling
+toolkits use base 10.
+
+    void read(const string &filename);
+
+Read model from file.
+
+    int get_order();
+
+Return the order of the language model.
+
+    int lookup_word(const string &word);
+
+Map a word to an index for use with lookup_ngram().
+
+    double lookup_ngram(const vector<int> &ngram);
+    double lookup_ngram(const int *ngram, int n);
+
+Look up the log-probability of ngram.
 
-Fork of http://nlg.isi.edu/software/nplm/ for threadsafety and efficiency.
author	Kenneth Heafield <github@kheafield.com>	2013-10-29 22:00:37 +0400
committer	Kenneth Heafield <github@kheafield.com>	2013-10-29 22:00:37 +0400
commit	78eecfdd7ef4cc0aef575c828c6fef747c63da19 (patch)
tree	cbd1e84c871306a35e1352286f7749ccac4f60bc /README.md
parent	e4138ba17732e70bfe9ad8e806173c083a9ddd0e (diff)