diff options
author | Ulrich Germann <ugermann@inf.ed.ac.uk> | 2014-07-22 03:28:10 +0400 |
---|---|---|
committer | Ulrich Germann <ugermann@inf.ed.ac.uk> | 2014-07-22 03:28:10 +0400 |
commit | d097e31038f38e682029ff13275881597563be08 (patch) | |
tree | 2d6317452eb1b5142e3a52b06debcf3d44012c96 /doc | |
parent | ab06edda5b5ce64d6f624aca234146e6d222d407 (diff) |
Added how-to for memory-mapped suffix array phrase tables.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/Mmsapt.howto | 31 |
1 files changed, 31 insertions, 0 deletions
diff --git a/doc/Mmsapt.howto b/doc/Mmsapt.howto new file mode 100644 index 000000000..6a48fa9c6 --- /dev/null +++ b/doc/Mmsapt.howto @@ -0,0 +1,31 @@ +How to use memory-mapped suffix array phrase tables in the moses decoder +(phrase-based decoding only) + +1. Compile with the bjam switch --with-mm + +2. You need + - sentences aligned text files + - the word alignment between these files in symal output format + +3. Build binary files + + Let + ${L1} be the extension of the language that you are translating from, + ${L2} the extension of the language that you want to translate into, and + ${CORPUS} the name of the word-aligned training corpus + + % zcat ${CORPUS}.${L1}.gz | mtt-build -i -o /some/path/${CORPUS}.${L1} + % zcat ${CORPUS}.${L2}.gz | mtt-build -i -o /some/path/${CORPUS}.${L2} + % zcat ${CORPUS}.${L1}-${L2}.symal.gz | symal2mam /some/path/${CORPUS}.${L1}-${L2}.mam + % mmlex-build /some/path/${CORPUS} ${L1} ${L2} -o /some/path/${CORPUS}.${L1}-${L2}.lex -c /some/path/${CORPUS}.${L1}-${L2}.coc + +4. Define line in moses.ini + + The best configuration of phrase table features is still under investigation. + For the time being, try this: + + Mmsapt name=PT0 output-factor=0 num-features=9 base=/some/path/${CORPUS} L1=${L1} L2=${L2} pfwd=g pbwd=g smooth=0 sample=1000 workers=1 + + You can increase the number of workers for sampling (a bit faster), + but you'll lose replicability of the translation output. + |