new training option -write-lexical-counts

(creates additional files lex.counts.e2f and lex.counts.f2e)
author: Rico Sennrich <rico.sennrich@gmx.ch> 2012-09-06 13:48:54 +0400
committer: Rico Sennrich <rico.sennrich@gmx.ch> 2012-09-06 13:48:54 +0400
commit: 4e2fc82854688e21e88a159b12d8fea26eb49f56 (patch)
tree: 91eb606359db3e575fa83d86b225045ae46e1e3a /contrib
parent: e9198acd41d215bf4c83407cea38d2638896e339 (diff)
3 files changed, 4 insertions, 28 deletions
diff --git a/contrib/tmcombine/README.md b/contrib/tmcombine/README.md
index 2d21b95c8..2cbc83299 100644
--- a/contrib/tmcombine/README.md
+++ b/contrib/tmcombine/README.md
@@ -58,7 +58,7 @@ Regression tests (check if the output files (`test/phrase-table_testN`) differ f
 FURTHER NOTES
 -------------
 
- - Different combination algorithms require different statistics. To be on the safe side, apply `train_model.patch` to `train_model.perl` and use the option `-phrase-word-alignment` when training models.
+ - Different combination algorithms require different statistics. To be on the safe side, use the options `-phrase-word-alignment` and `-write-lexical-counts` when training models.
 
  - The script assumes that phrase tables are sorted (to allow incremental, more memory-friendly processing). Sort the tables with `LC_ALL=C`. Phrase tables produced by Moses are sorted correctly.
 
diff --git a/contrib/tmcombine/tmcombine.py b/contrib/tmcombine/tmcombine.py
index 3c02eaf45..24343b4bf 100755
--- a/contrib/tmcombine/tmcombine.py
+++ b/contrib/tmcombine/tmcombine.py
@@ -15,7 +15,7 @@
 
 
 # Some general things to note:
-#  - Different combination algorithms require different statistics. To be on the safe side, apply train_model.patch to train_model.perl and use the option -phrase-word-alignment for training all models.
+#  - Different combination algorithms require different statistics. To be on the safe side, use the options `-phrase-word-alignment` and `-write-lexical-counts` when training models.
 #  - The script assumes that phrase tables are sorted (to allow incremental, more memory-friendly processing). sort with LC_ALL=C.
 #  - Some configurations require additional statistics that are loaded in memory (lexical tables; complete list of target phrases). If memory consumption is a problem, use the option --lowmem (slightly slower and writes temporary files to disk), or consider pruning your phrase table before combining (e.g. using Johnson et al. 2007).
 #  - The script can read/write gzipped files, but the Python implementation is slow. You're better off unzipping the files on the command line and working with the unzipped files.
@@ -1233,7 +1233,7 @@ def handle_file(filename,action,fileobj=None,mode='r'):
                 
                 if 'counts' in filename and os.path.exists(os.path.isdir(filename)):
                     sys.stderr.write('For a weighted counts combination, we need statistics that Moses doesn\'t write to disk by default.\n')
-                    sys.stderr.write('Apply train_model.patch to train_model.perl and repeat step 4 of Moses training for all models.\n')
+                    sys.stderr.write('Repeat step 4 of Moses training for all models with the option -write-lexical-counts.\n')
                 
                 exit()
 
@@ -1327,7 +1327,7 @@ class Combine_TMs():
            output_lexical: If defined, also writes combined lexical tables. Writes to output_lexical.e2f and output_lexical.f2e, or output_lexical.counts.e2f in mode 'counts'.
 
            mode: declares the basic mixture-model algorithm. there are currently three options:
-                 'counts': weighted counts (requires some statistics that Moses doesn't produce. Apply train_model.patch to train_model.perl and repeat step 4 of Moses training to obtain them.)
+                 'counts': weighted counts (requires some statistics that Moses doesn't produce. Repeat step 4 of Moses training with the option -write-lexical-counts to obtain them.)
                            Only the standard Moses features are recomputed from weighted counts; additional features are linearly interpolated 
                            (see number_of_features to allow more features, and i_e2f etc. if the standard features are in a non-standard position)
                  'interpolate': linear interpolation
diff --git a/contrib/tmcombine/train_model.patch b/contrib/tmcombine/train_model.patch
deleted file mode 100644
index d422a1628..000000000
--- a/contrib/tmcombine/train_model.patch
+++ /dev/null
@@ -1,24 +0,0 @@
---- train-model.perl	2011-11-01 15:17:04.763230934 +0100
-+++ train-model.perl	2011-11-01 15:17:00.033229220 +0100
-@@ -1185,15 +1185,21 @@
- 
-     open(F2E,">$lexical_file.f2e") or die "ERROR: Can't write $lexical_file.f2e";
-     open(E2F,">$lexical_file.e2f") or die "ERROR: Can't write $lexical_file.e2f";
-+    open(F2E2,">$lexical_file.counts.f2e") or die "ERROR: Can't write $lexical_file.counts.f2e";
-+    open(E2F2,">$lexical_file.counts.e2f") or die "ERROR: Can't write $lexical_file.counts.e2f";
- 
-     foreach my $f (keys %WORD_TRANSLATION) {
- 	foreach my $e (keys %{$WORD_TRANSLATION{$f}}) {
- 	    printf F2E "%s %s %.7f\n",$e,$f,$WORD_TRANSLATION{$f}{$e}/$TOTAL_FOREIGN{$f};
- 	    printf E2F "%s %s %.7f\n",$f,$e,$WORD_TRANSLATION{$f}{$e}/$TOTAL_ENGLISH{$e};
-+	    printf F2E2 "%s %s %i %i\n",$e,$f,$WORD_TRANSLATION{$f}{$e},$TOTAL_FOREIGN{$f};
-+	    printf E2F2 "%s %s %i %i\n",$f,$e,$WORD_TRANSLATION{$f}{$e},$TOTAL_ENGLISH{$e};
- 	}
-     }
-     close(E2F);
-     close(F2E);
-+    close(E2F2);
-+    close(F2E2);
-     print STDERR "Saved: $lexical_file.f2e and $lexical_file.e2f\n";
- }
-
author	Rico Sennrich <rico.sennrich@gmx.ch>	2012-09-06 13:48:54 +0400
committer	Rico Sennrich <rico.sennrich@gmx.ch>	2012-09-06 13:48:54 +0400
commit	4e2fc82854688e21e88a159b12d8fea26eb49f56 (patch)
tree	91eb606359db3e575fa83d86b225045ae46e1e3a /contrib
parent	e9198acd41d215bf4c83407cea38d2638896e339 (diff)