diff options
author | Rico Sennrich <rico.sennrich@gmx.ch> | 2012-09-06 13:48:54 +0400 |
---|---|---|
committer | Rico Sennrich <rico.sennrich@gmx.ch> | 2012-09-06 13:48:54 +0400 |
commit | 4e2fc82854688e21e88a159b12d8fea26eb49f56 (patch) | |
tree | 91eb606359db3e575fa83d86b225045ae46e1e3a /contrib | |
parent | e9198acd41d215bf4c83407cea38d2638896e339 (diff) |
new training option -write-lexical-counts
(creates additional files lex.counts.e2f and lex.counts.f2e)
Diffstat (limited to 'contrib')
-rw-r--r-- | contrib/tmcombine/README.md | 2 | ||||
-rwxr-xr-x | contrib/tmcombine/tmcombine.py | 6 | ||||
-rw-r--r-- | contrib/tmcombine/train_model.patch | 24 |
3 files changed, 4 insertions, 28 deletions
diff --git a/contrib/tmcombine/README.md b/contrib/tmcombine/README.md index 2d21b95c8..2cbc83299 100644 --- a/contrib/tmcombine/README.md +++ b/contrib/tmcombine/README.md @@ -58,7 +58,7 @@ Regression tests (check if the output files (`test/phrase-table_testN`) differ f FURTHER NOTES ------------- - - Different combination algorithms require different statistics. To be on the safe side, apply `train_model.patch` to `train_model.perl` and use the option `-phrase-word-alignment` when training models. + - Different combination algorithms require different statistics. To be on the safe side, use the options `-phrase-word-alignment` and `-write-lexical-counts` when training models. - The script assumes that phrase tables are sorted (to allow incremental, more memory-friendly processing). Sort the tables with `LC_ALL=C`. Phrase tables produced by Moses are sorted correctly. diff --git a/contrib/tmcombine/tmcombine.py b/contrib/tmcombine/tmcombine.py index 3c02eaf45..24343b4bf 100755 --- a/contrib/tmcombine/tmcombine.py +++ b/contrib/tmcombine/tmcombine.py @@ -15,7 +15,7 @@ # Some general things to note: -# - Different combination algorithms require different statistics. To be on the safe side, apply train_model.patch to train_model.perl and use the option -phrase-word-alignment for training all models. +# - Different combination algorithms require different statistics. To be on the safe side, use the options `-phrase-word-alignment` and `-write-lexical-counts` when training models. # - The script assumes that phrase tables are sorted (to allow incremental, more memory-friendly processing). sort with LC_ALL=C. # - Some configurations require additional statistics that are loaded in memory (lexical tables; complete list of target phrases). If memory consumption is a problem, use the option --lowmem (slightly slower and writes temporary files to disk), or consider pruning your phrase table before combining (e.g. using Johnson et al. 2007). # - The script can read/write gzipped files, but the Python implementation is slow. You're better off unzipping the files on the command line and working with the unzipped files. @@ -1233,7 +1233,7 @@ def handle_file(filename,action,fileobj=None,mode='r'): if 'counts' in filename and os.path.exists(os.path.isdir(filename)): sys.stderr.write('For a weighted counts combination, we need statistics that Moses doesn\'t write to disk by default.\n') - sys.stderr.write('Apply train_model.patch to train_model.perl and repeat step 4 of Moses training for all models.\n') + sys.stderr.write('Repeat step 4 of Moses training for all models with the option -write-lexical-counts.\n') exit() @@ -1327,7 +1327,7 @@ class Combine_TMs(): output_lexical: If defined, also writes combined lexical tables. Writes to output_lexical.e2f and output_lexical.f2e, or output_lexical.counts.e2f in mode 'counts'. mode: declares the basic mixture-model algorithm. there are currently three options: - 'counts': weighted counts (requires some statistics that Moses doesn't produce. Apply train_model.patch to train_model.perl and repeat step 4 of Moses training to obtain them.) + 'counts': weighted counts (requires some statistics that Moses doesn't produce. Repeat step 4 of Moses training with the option -write-lexical-counts to obtain them.) Only the standard Moses features are recomputed from weighted counts; additional features are linearly interpolated (see number_of_features to allow more features, and i_e2f etc. if the standard features are in a non-standard position) 'interpolate': linear interpolation diff --git a/contrib/tmcombine/train_model.patch b/contrib/tmcombine/train_model.patch deleted file mode 100644 index d422a1628..000000000 --- a/contrib/tmcombine/train_model.patch +++ /dev/null @@ -1,24 +0,0 @@ ---- train-model.perl 2011-11-01 15:17:04.763230934 +0100 -+++ train-model.perl 2011-11-01 15:17:00.033229220 +0100 -@@ -1185,15 +1185,21 @@ - - open(F2E,">$lexical_file.f2e") or die "ERROR: Can't write $lexical_file.f2e"; - open(E2F,">$lexical_file.e2f") or die "ERROR: Can't write $lexical_file.e2f"; -+ open(F2E2,">$lexical_file.counts.f2e") or die "ERROR: Can't write $lexical_file.counts.f2e"; -+ open(E2F2,">$lexical_file.counts.e2f") or die "ERROR: Can't write $lexical_file.counts.e2f"; - - foreach my $f (keys %WORD_TRANSLATION) { - foreach my $e (keys %{$WORD_TRANSLATION{$f}}) { - printf F2E "%s %s %.7f\n",$e,$f,$WORD_TRANSLATION{$f}{$e}/$TOTAL_FOREIGN{$f}; - printf E2F "%s %s %.7f\n",$f,$e,$WORD_TRANSLATION{$f}{$e}/$TOTAL_ENGLISH{$e}; -+ printf F2E2 "%s %s %i %i\n",$e,$f,$WORD_TRANSLATION{$f}{$e},$TOTAL_FOREIGN{$f}; -+ printf E2F2 "%s %s %i %i\n",$f,$e,$WORD_TRANSLATION{$f}{$e},$TOTAL_ENGLISH{$e}; - } - } - close(E2F); - close(F2E); -+ close(E2F2); -+ close(F2E2); - print STDERR "Saved: $lexical_file.f2e and $lexical_file.e2f\n"; - } - |