KenLM b1daeaf for clang

author: Kenneth Heafield <github@kheafield.com> 2012-05-05 08:55:46 +0400
committer: Kenneth Heafield <github@kheafield.com> 2012-05-05 08:55:46 +0400
commit: f8d88920a1e0b2487429624f4a2c07ea717ee2aa (patch)
tree: abd4327970556084a7c7777e09fa5bcef0a86caf /lm
parent: 97cb9d7aaa29a23d421a5e873cb0ca85d47f8891 (diff)
7 files changed, 42 insertions, 27 deletions
diff --git a/lm/README b/lm/README
index d9307ed05..03c2da8f5 100644
--- a/lm/README
+++ b/lm/README
@@ -1,29 +1,44 @@
-Language model inference code by Kenneth Heafield <infer at kheafield.com>
-The official website is http://kheafield.com/code/kenlm/ .  If you're a decoder developer, please download the latest version from there instead of copying from Moses.  
+Language model inference code by Kenneth Heafield <kenlm at kheafield.com>
 
-While the primary means of building kenlm for use in Moses is the Moses build system, you can also compile independently using:
-./compile.sh to compile the code
-./test.sh to compile and run tests; requires Boost
-./clean.sh to clean
+THE GIT REPOSITORY https://github.com/kpu/kenlm IS WHERE ACTIVE DEVELOPMENT HAPPENS.  IT MAY RETURN SILENTLY WRONG ANSWERS OR BE SILENTLY BINARY-INCOMPATIBLE WITH STABLE RELEASES.  
 
-The rest of the documentation is directed at decoder developers.  
+The website http://kheafield.com/code/kenlm/ has more documentation.  If you're a decoder developer, please download the latest version from there instead of copying from another decoder.  
 
-Binary format via mmap is supported.  Run ./build_binary to make one then pass the binary file name instead.  
+Two data structures are supported: probing and trie.  Probing is a probing hash table with keys that ere 64-bit hashes of n-grams and floats as values.  Trie is a fairly standard trie but with bit-level packing so it uses the minimum number of bits to store word indices and pointers.  The trie node entries are sorted by word index.  Probing is the fastest and uses the most memory.  Trie uses the least memory and a bit slower.  
 
-Currently, it assumes POSIX APIs for errno, sterror_r, open, close, mmap, munmap, ftruncate, fstat, and read.  This is tested on Linux and the non-UNIX Mac OS X.  I welcome submissions porting (via #ifdef) to other systems (e.g. Windows) but proudly have no machine on which to test it.  
+With trie, resident memory is 58% of IRST's smallest version and 21% of SRI's compact version.  Simultaneously, trie CPU's use is 81% of IRST's fastest version and 84% of SRI's fast version.  KenLM's probing hash table implementation goes even faster at the expense of using more memory.  See http://kheafield.com/code/kenlm/benchmark/.  
 
-A brief note to Mac OS X users: your gcc is too old to recognize the pack pragma.  The warning effectively means that, on 64-bit machines, the model will use 16 bytes instead of 12 bytes per n-gram of maximum order (those of lower order are already 16 bytes) in the probing and sorted models.  The trie is not impacted by this.  
+Binary format via mmap is supported.  Run ./build_binary to make one then pass the binary file name to the appropriate Model constructor.   
 
-It does not depend on Boost or ICU.  However, if you use Boost and/or ICU in the rest of your code, you should define HAVE_BOOST and/or HAVE_ICU in util/have.hh.  Defining HAVE_BOOST will let you hash StringPiece.  Defining HAVE_ICU will use ICU's StringPiece to prevent a conflict with the one provided here.  
 
-The recommend way to use this:
-Copy the code and distribute with your decoder.  
-Set HAVE_ICU and HAVE_BOOST at the top of util/have.hh as instructed above.  
-Look at compile.sh and reimplement using your build system.  
-Use either the interface in lm/model.hh or lm/virtual_interface.hh
-Interface documentation is in comments of lm/virtual_interface.hh (including for lm/model.hh).  
+PLATFORMS
+murmur_hash.cc and bit_packing.hh perform unaligned reads and writes that make the code architecture-dependent.  
+It has been sucessfully tested on x86_64, x86, and PPC64.  
+ARM support is reportedly working, at least on the iphone, but I cannot test this. 
 
-I recommend copying the code and distributing it with your decoder.  However, please send improvements to me so that they can be integrated into the package.  
+Runs on Linux, OS X, Cygwin, and MinGW.  
 
-Also included:
-A wrapper to SRI with the same interface.  
+Hideo Okuma and Tomoyuki Yoshimura from NICT contributed ports to ARM and MinGW.  Hieu Hoang is working on a native Windows port.  
+
+
+DECODER DEVELOPERS
+- I recommend copying the code and distributing it with your decoder.  However, please send improvements upstream as indicated in CONTRIBUTORS.  
+
+- It does not depend on Boost or ICU.  If you use ICU, define HAVE_ICU in util/have.hh (uncomment the line) to avoid a name conflict.  Defining HAVE_BOOST will let you hash StringPiece.  
+
+- Most people have zlib.  If you don't want to depend on that, comment out #define HAVE_ZLIB in util/have.hh.  This will disable loading gzipped ARPA files.  
+
+- There are two build systems: compile.sh and Jamroot+Jamfile.  They're pretty simple and are intended to be reimplemented in your build system.  
+
+- Use either the interface in lm/model.hh or lm/virtual_interface.hh.  Interface documentation is in comments of lm/virtual_interface.hh and lm/model.hh.  
+
+- There are several possible data structures in model.hh.  Use RecognizeBinary in binary_format.hh to determine which one a user has provided.  You probably already implement feature functions as an abstract virtual base class with several children.  I suggest you co-opt this existing virtual dispatch by templatizing the language model feature implementation on the KenLM model identified by RecognizeBinary.  This is the strategy used in Moses and cdec.
+
+- See lm/config.hh for tuning options.  
+
+
+CONTRIBUTORS
+Contributions to KenLM are welcome.  Please base your contributions on https://github.com/kpu/kenlm and send pull requests (or I might give you commit access).  Downstream copies in Moses and cdec are maintained by overwriting them so do not make changes there.  
+
+
+The name was Hieu Hoang's idea, not mine.  
diff --git a/lm/bhiksha.hh b/lm/bhiksha.hh
index 5182ee2e7..9734f3abd 100644
--- a/lm/bhiksha.hh
+++ b/lm/bhiksha.hh
@@ -23,7 +23,7 @@
 
 namespace lm {
 namespace ngram {
-class Config;
+struct Config;
 
 namespace trie {
 
diff --git a/lm/left.hh b/lm/left.hh
index a07f98038..308644228 100644
--- a/lm/left.hh
+++ b/lm/left.hh
@@ -78,7 +78,7 @@ struct Left {
 };
 
 inline size_t hash_value(const Left &left) {
-  return util::MurmurHashNative(&left.length, 1, left.pointers[left.length - 1]);
+  return util::MurmurHashNative(&left.length, 1, left.length ? left.pointers[left.length - 1] : 0);
 }
 
 struct ChartState {
diff --git a/lm/quantize.hh b/lm/quantize.hh
index 6d130a577..a81fe3aa2 100644
--- a/lm/quantize.hh
+++ b/lm/quantize.hh
@@ -16,7 +16,7 @@
 namespace lm {
 namespace ngram {
 
-class Config;
+struct Config;
 
 /* Store values directly and don't quantize. */
 class DontQuantize {
diff --git a/lm/trie.hh b/lm/trie.hh
index ebe9910f0..8fcd995ec 100644
--- a/lm/trie.hh
+++ b/lm/trie.hh
@@ -10,7 +10,7 @@
 
 namespace lm {
 namespace ngram {
-class Config;
+struct Config;
 namespace trie {
 
 struct NodeRange {
diff --git a/lm/trie_sort.hh b/lm/trie_sort.hh
index 3036319df..6ef17eb9f 100644
--- a/lm/trie_sort.hh
+++ b/lm/trie_sort.hh
@@ -25,7 +25,7 @@ namespace lm {
 class PositiveProbWarn;
 namespace ngram {
 class SortedVocabulary;
-class Config;
+struct Config;
 
 namespace trie {
 
diff --git a/lm/vocab.hh b/lm/vocab.hh
index 06fdefe49..343fc98a5 100644
--- a/lm/vocab.hh
+++ b/lm/vocab.hh
@@ -13,11 +13,11 @@
 #include <vector>
 
 namespace lm {
-class ProbBackoff;
+struct ProbBackoff;
 class EnumerateVocab;
 
 namespace ngram {
-class Config;
+struct Config;
 
 namespace detail {
 uint64_t HashForVocab(const char *str, std::size_t len);
author	Kenneth Heafield <github@kheafield.com>	2012-05-05 08:55:46 +0400
committer	Kenneth Heafield <github@kheafield.com>	2012-05-05 08:55:46 +0400
commit	f8d88920a1e0b2487429624f4a2c07ea717ee2aa (patch)
tree	abd4327970556084a7c7777e09fa5bcef0a86caf /lm
parent	97cb9d7aaa29a23d421a5e873cb0ca85d47f8891 (diff)