Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/moses-smt/mosesdecoder.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/doc
diff options
context:
space:
mode:
authorUlrich Germann <ugermann@inf.ed.ac.uk>2014-07-22 03:28:10 +0400
committerUlrich Germann <ugermann@inf.ed.ac.uk>2014-07-22 03:28:10 +0400
commitd097e31038f38e682029ff13275881597563be08 (patch)
tree2d6317452eb1b5142e3a52b06debcf3d44012c96 /doc
parentab06edda5b5ce64d6f624aca234146e6d222d407 (diff)
Added how-to for memory-mapped suffix array phrase tables.
Diffstat (limited to 'doc')
-rw-r--r--doc/Mmsapt.howto31
1 files changed, 31 insertions, 0 deletions
diff --git a/doc/Mmsapt.howto b/doc/Mmsapt.howto
new file mode 100644
index 000000000..6a48fa9c6
--- /dev/null
+++ b/doc/Mmsapt.howto
@@ -0,0 +1,31 @@
+How to use memory-mapped suffix array phrase tables in the moses decoder
+(phrase-based decoding only)
+
+1. Compile with the bjam switch --with-mm
+
+2. You need
+ - sentences aligned text files
+ - the word alignment between these files in symal output format
+
+3. Build binary files
+
+ Let
+ ${L1} be the extension of the language that you are translating from,
+ ${L2} the extension of the language that you want to translate into, and
+ ${CORPUS} the name of the word-aligned training corpus
+
+ % zcat ${CORPUS}.${L1}.gz | mtt-build -i -o /some/path/${CORPUS}.${L1}
+ % zcat ${CORPUS}.${L2}.gz | mtt-build -i -o /some/path/${CORPUS}.${L2}
+ % zcat ${CORPUS}.${L1}-${L2}.symal.gz | symal2mam /some/path/${CORPUS}.${L1}-${L2}.mam
+ % mmlex-build /some/path/${CORPUS} ${L1} ${L2} -o /some/path/${CORPUS}.${L1}-${L2}.lex -c /some/path/${CORPUS}.${L1}-${L2}.coc
+
+4. Define line in moses.ini
+
+ The best configuration of phrase table features is still under investigation.
+ For the time being, try this:
+
+ Mmsapt name=PT0 output-factor=0 num-features=9 base=/some/path/${CORPUS} L1=${L1} L2=${L2} pfwd=g pbwd=g smooth=0 sample=1000 workers=1
+
+ You can increase the number of workers for sampling (a bit faster),
+ but you'll lose replicability of the translation output.
+