Welcome to mirror list, hosted at ThFree Co, Russian Federation.

PhraseDictionaryBitextSampling.howto « doc - github.com/moses-smt/mosesdecoder.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
blob: 143a634edf7770111a4d96dc7dfe846c6f97d3e3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
How to use memory-mapped suffix array phrase tables in the moses decoder 
(phrase-based decoding only)

1. Compile with the bjam switch --with-mm

2. You need 
   - sentences aligned text files
   - the word alignment between these files in symal output format

3. Build binary files

   Let 
   ${L1} be the extension of the language that you are translating from,
   ${L2} the extension of the language that you want to translate into, and 
   ${CORPUS} the name of the word-aligned training corpus

   % zcat ${CORPUS}.${L1}.gz  | mtt-build -i -o /some/path/${CORPUS}.${L1}
   % zcat ${CORPUS}.${L2}.gz  | mtt-build -i -o /some/path/${CORPUS}.${L2}
   % zcat ${CORPUS}.${L1}-${L2}.symal.gz | symal2mam /some/path/${CORPUS}.${L1}-${L2}.mam
   % mmlex-build /some/path/${CORPUS} ${L1} ${L2} -o /some/path/${CORPUS}.${L1}-${L2}.lex -c /some/path/${CORPUS}.${L1}-${L2}.coc

4. Define line in moses.ini

   The best configuration of phrase table features is still under investigation. 
   For the time being, try this:

   PhraseDictionaryBitextSampling name=PT0 output-factor=0 num-features=9 path=/some/path/${CORPUS} L1=${L1} L2=${L2} pfwd=g pbwd=g smooth=0 sample=1000 workers=1 

   You can increase the number of workers for sampling (a bit faster), 
   but you'll lose replicability of the translation output.