Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/moses-smt/mosesdecoder.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorRico Sennrich <rico.sennrich@gmx.ch>2012-09-06 13:48:54 +0400
committerRico Sennrich <rico.sennrich@gmx.ch>2012-09-06 13:48:54 +0400
commit4e2fc82854688e21e88a159b12d8fea26eb49f56 (patch)
tree91eb606359db3e575fa83d86b225045ae46e1e3a /contrib
parente9198acd41d215bf4c83407cea38d2638896e339 (diff)
new training option -write-lexical-counts
(creates additional files lex.counts.e2f and lex.counts.f2e)
Diffstat (limited to 'contrib')
-rw-r--r--contrib/tmcombine/README.md2
-rwxr-xr-xcontrib/tmcombine/tmcombine.py6
-rw-r--r--contrib/tmcombine/train_model.patch24
3 files changed, 4 insertions, 28 deletions
diff --git a/contrib/tmcombine/README.md b/contrib/tmcombine/README.md
index 2d21b95c8..2cbc83299 100644
--- a/contrib/tmcombine/README.md
+++ b/contrib/tmcombine/README.md
@@ -58,7 +58,7 @@ Regression tests (check if the output files (`test/phrase-table_testN`) differ f
FURTHER NOTES
-------------
- - Different combination algorithms require different statistics. To be on the safe side, apply `train_model.patch` to `train_model.perl` and use the option `-phrase-word-alignment` when training models.
+ - Different combination algorithms require different statistics. To be on the safe side, use the options `-phrase-word-alignment` and `-write-lexical-counts` when training models.
- The script assumes that phrase tables are sorted (to allow incremental, more memory-friendly processing). Sort the tables with `LC_ALL=C`. Phrase tables produced by Moses are sorted correctly.
diff --git a/contrib/tmcombine/tmcombine.py b/contrib/tmcombine/tmcombine.py
index 3c02eaf45..24343b4bf 100755
--- a/contrib/tmcombine/tmcombine.py
+++ b/contrib/tmcombine/tmcombine.py
@@ -15,7 +15,7 @@
# Some general things to note:
-# - Different combination algorithms require different statistics. To be on the safe side, apply train_model.patch to train_model.perl and use the option -phrase-word-alignment for training all models.
+# - Different combination algorithms require different statistics. To be on the safe side, use the options `-phrase-word-alignment` and `-write-lexical-counts` when training models.
# - The script assumes that phrase tables are sorted (to allow incremental, more memory-friendly processing). sort with LC_ALL=C.
# - Some configurations require additional statistics that are loaded in memory (lexical tables; complete list of target phrases). If memory consumption is a problem, use the option --lowmem (slightly slower and writes temporary files to disk), or consider pruning your phrase table before combining (e.g. using Johnson et al. 2007).
# - The script can read/write gzipped files, but the Python implementation is slow. You're better off unzipping the files on the command line and working with the unzipped files.
@@ -1233,7 +1233,7 @@ def handle_file(filename,action,fileobj=None,mode='r'):
if 'counts' in filename and os.path.exists(os.path.isdir(filename)):
sys.stderr.write('For a weighted counts combination, we need statistics that Moses doesn\'t write to disk by default.\n')
- sys.stderr.write('Apply train_model.patch to train_model.perl and repeat step 4 of Moses training for all models.\n')
+ sys.stderr.write('Repeat step 4 of Moses training for all models with the option -write-lexical-counts.\n')
exit()
@@ -1327,7 +1327,7 @@ class Combine_TMs():
output_lexical: If defined, also writes combined lexical tables. Writes to output_lexical.e2f and output_lexical.f2e, or output_lexical.counts.e2f in mode 'counts'.
mode: declares the basic mixture-model algorithm. there are currently three options:
- 'counts': weighted counts (requires some statistics that Moses doesn't produce. Apply train_model.patch to train_model.perl and repeat step 4 of Moses training to obtain them.)
+ 'counts': weighted counts (requires some statistics that Moses doesn't produce. Repeat step 4 of Moses training with the option -write-lexical-counts to obtain them.)
Only the standard Moses features are recomputed from weighted counts; additional features are linearly interpolated
(see number_of_features to allow more features, and i_e2f etc. if the standard features are in a non-standard position)
'interpolate': linear interpolation
diff --git a/contrib/tmcombine/train_model.patch b/contrib/tmcombine/train_model.patch
deleted file mode 100644
index d422a1628..000000000
--- a/contrib/tmcombine/train_model.patch
+++ /dev/null
@@ -1,24 +0,0 @@
---- train-model.perl 2011-11-01 15:17:04.763230934 +0100
-+++ train-model.perl 2011-11-01 15:17:00.033229220 +0100
-@@ -1185,15 +1185,21 @@
-
- open(F2E,">$lexical_file.f2e") or die "ERROR: Can't write $lexical_file.f2e";
- open(E2F,">$lexical_file.e2f") or die "ERROR: Can't write $lexical_file.e2f";
-+ open(F2E2,">$lexical_file.counts.f2e") or die "ERROR: Can't write $lexical_file.counts.f2e";
-+ open(E2F2,">$lexical_file.counts.e2f") or die "ERROR: Can't write $lexical_file.counts.e2f";
-
- foreach my $f (keys %WORD_TRANSLATION) {
- foreach my $e (keys %{$WORD_TRANSLATION{$f}}) {
- printf F2E "%s %s %.7f\n",$e,$f,$WORD_TRANSLATION{$f}{$e}/$TOTAL_FOREIGN{$f};
- printf E2F "%s %s %.7f\n",$f,$e,$WORD_TRANSLATION{$f}{$e}/$TOTAL_ENGLISH{$e};
-+ printf F2E2 "%s %s %i %i\n",$e,$f,$WORD_TRANSLATION{$f}{$e},$TOTAL_FOREIGN{$f};
-+ printf E2F2 "%s %s %i %i\n",$f,$e,$WORD_TRANSLATION{$f}{$e},$TOTAL_ENGLISH{$e};
- }
- }
- close(E2F);
- close(F2E);
-+ close(E2F2);
-+ close(F2E2);
- print STDERR "Saved: $lexical_file.f2e and $lexical_file.e2f\n";
- }
-