Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/moses-smt/mosesdecoder.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorwlin12 <wanglin112@gmail.com>2012-09-07 20:33:44 +0400
committerwlin12 <wanglin112@gmail.com>2012-09-07 20:33:44 +0400
commit6202c7cc607c6351bf9c6bd742c5a7c9c654e222 (patch)
tree108ee72ae26fa34543e24b502840c7ac744775fb /contrib
parentc944492e3ec7803e292a6bf59a147cc79ee2495d (diff)
adding the code for relative entropy pruning to contrib/relent-filter
Diffstat (limited to 'contrib')
-rw-r--r--contrib/relent-filter/AUTHORS1
-rw-r--r--contrib/relent-filter/README.txt68
-rw-r--r--contrib/relent-filter/scripts/calcEmpiricalDistribution.pl53
-rwxr-xr-xcontrib/relent-filter/scripts/calcPruningScores.pl351
-rw-r--r--contrib/relent-filter/scripts/interpolateScores.pl94
-rwxr-xr-xcontrib/relent-filter/scripts/prunePT.pl114
-rwxr-xr-xcontrib/relent-filter/sigtest-filter/Makefile10
-rwxr-xr-xcontrib/relent-filter/sigtest-filter/README.txt42
-rwxr-xr-xcontrib/relent-filter/sigtest-filter/WIN32_functions.cpp231
-rwxr-xr-xcontrib/relent-filter/sigtest-filter/WIN32_functions.h24
-rwxr-xr-xcontrib/relent-filter/sigtest-filter/check-install5
-rwxr-xr-xcontrib/relent-filter/sigtest-filter/filter-ptbin0 -> 94567 bytes
-rwxr-xr-xcontrib/relent-filter/sigtest-filter/filter-pt.cpp377
-rwxr-xr-xcontrib/relent-filter/sigtest-filter/sigtest-filter.sln20
-rwxr-xr-xcontrib/relent-filter/src/IOWrapper.cpp580
-rwxr-xr-xcontrib/relent-filter/src/IOWrapper.h142
-rwxr-xr-xcontrib/relent-filter/src/Jamfile6
-rwxr-xr-xcontrib/relent-filter/src/LatticeMBR.cpp669
-rwxr-xr-xcontrib/relent-filter/src/LatticeMBR.h153
-rwxr-xr-xcontrib/relent-filter/src/LatticeMBRGrid.cpp213
-rwxr-xr-xcontrib/relent-filter/src/Main.cpp282
-rwxr-xr-xcontrib/relent-filter/src/Main.h39
-rwxr-xr-xcontrib/relent-filter/src/RelativeEntropyCalc.cpp83
-rwxr-xr-xcontrib/relent-filter/src/RelativeEntropyCalc.h51
-rwxr-xr-xcontrib/relent-filter/src/TranslationAnalysis.cpp126
-rwxr-xr-xcontrib/relent-filter/src/TranslationAnalysis.h25
-rwxr-xr-xcontrib/relent-filter/src/mbr.cpp178
-rwxr-xr-xcontrib/relent-filter/src/mbr.h28
-rw-r--r--contrib/sigtest-filter/Makefile2
29 files changed, 3966 insertions, 1 deletions
diff --git a/contrib/relent-filter/AUTHORS b/contrib/relent-filter/AUTHORS
new file mode 100644
index 000000000..184a6dddd
--- /dev/null
+++ b/contrib/relent-filter/AUTHORS
@@ -0,0 +1 @@
+Wang Ling - lingwang at cs dot cmu dot edu
diff --git a/contrib/relent-filter/README.txt b/contrib/relent-filter/README.txt
new file mode 100644
index 000000000..02f81041d
--- /dev/null
+++ b/contrib/relent-filter/README.txt
@@ -0,0 +1,68 @@
+Implementation of the Relative Entropy-based Phrase table filtering algorithm by Wang Ling (Ling et al, 2012).
+
+This implementation also calculates the significance scores for the phrase tables based on the Fisher's Test(Johnson et al, 2007). Uses a slightly modified version of the "sigtest-filter" by Chris Dyer.
+
+-------BUILD INSTRUCTIONS-------
+
+1 - Build the sigtest-filter binary
+
+1.1 - Download and build SALM available at http://projectile.sv.cmu.edu/research/public/tools/salm/salm.htm
+
+1.2 - Run "make SALMDIR=<path_to_salm>" in "<path_to_moses>/contrib/relent-filter/sigtest-filter" to create the executable filter-pt
+
+2 - Build moses project by running "./bjam <options>", this will create the executables for relent filtering
+
+-------USAGE INSTRUCTIONS-------
+
+Required files:
+s_train - source training file
+t_train - target training file
+moses_ini - path to the moses configuration file ( after tuning )
+pruning_binaries - path to the relent pruning binaries ( should be "<path_to_moses>/bin" )
+pruning_scripts - path to the relent pruning scripts ( should be "<path_to_moses>/contrib/relent-filter/scripts" )
+sigbin - path to the sigtest filter binaries ( should be "<path_to_moses>/contrib/relent-filter/sigtest-filter" )
+output_dir - path to write the output
+
+1 - build suffix arrays for the source and target parallel training data
+
+1.1 - run "<path to salm>/Bin/Linux/Index/IndexSA.O32 <s_train>" (or IndexSA.O64)
+
+1.2 - run "<path to salm>/Bin/Linux/Index/IndexSA.O32 <t_train>" (or IndexSA.O64)
+
+2 - calculate phrase pair scores by running:
+
+perl <pruning_scripts>/calcPruningScores.pl -moses_ini <moses_ini> -training_s <s_train> -training_t <t_train> -prune_bin <pruning_binaries> -prune_scripts <pruning_scripts> -moses_scripts <path_to_moses>/scripts/training/ -workdir <output_dir> -dec_size 10000
+
+this will create the following files in the <output_dir/scores/> dir:
+
+count.txt - counts of the phrase pairs for N(s,t) N(s,*) and N(*,t)
+divergence.txt - negative log of the divergence of the phrase pair
+empirical.txt - empirical distribution of the phrase pairs N(s,t)/N(*,*)
+rel_ent.txt - relative entropy of the phrase pairs
+significance.txt - significance of the phrase pairs
+
+You can use any one of these files for pruning and also combine these scores using <pruning_scripts>/interpolateScores.pl
+
+3 - To actually prune a phrase table you should run <pruning_scripts>/prunePT.pl
+
+For instance, to prune 30% of the phrase table using rel_ent run:
+perl <pruning_scripts>/prunePT.pl -table <phrase_table_file> -scores <output_dir>/scores/rel_ent.txt -percentage 70 > <pruned_phrase_table_file>
+
+You can also prune by threshold
+perl <pruning_scripts>/prunePT.pl -table <phrase_table_file> -scores <output_dir>/scores/rel_ent.txt -threshold 0.1 > <pruned_phrase_table_file>
+
+The same must be done for the reordering table by replacing <phrase_table_file> with the <reord_table_file>
+
+perl <pruning_scripts>/prunePT.pl -table <reord_table_file> -scores <output_dir>/scores/rel_ent.txt -percentage 70 > <pruned_reord_table_file>
+
+REFERENCES
+---------------------------------
+Ling, W., Graça, J., Trancoso, I., and Black, A. (2012). Entropy-based pruning for phrase-based
+machine translation. In Proceedings of the 2012
+Joint Conference on Empirical Methods in Natural Language Processing and
+Computational Natural Language Learning (EMNLP-CoNLL), pp. 962-971.
+
+H. Johnson, J. Martin, G. Foster and R. Kuhn. (2007) Improving Translation
+Quality by Discarding Most of the Phrasetable. In Proceedings of the 2007
+Joint Conference on Empirical Methods in Natural Language Processing and
+Computational Natural Language Learning (EMNLP-CoNLL), pp. 967-975.
diff --git a/contrib/relent-filter/scripts/calcEmpiricalDistribution.pl b/contrib/relent-filter/scripts/calcEmpiricalDistribution.pl
new file mode 100644
index 000000000..462ec5339
--- /dev/null
+++ b/contrib/relent-filter/scripts/calcEmpiricalDistribution.pl
@@ -0,0 +1,53 @@
+#!/usr/bin/perl -w
+
+# read arguments
+my $countFile = $ARGV[0];
+
+my $ZCAT = "gzip -cd";
+my $BZCAT = "bzcat";
+
+&process_count_file($countFile);
+
+sub process_count_file {
+ $file = $_[0];
+ open(COUNT_READER, &open_compressed($file)) or die "ERROR: Can't read $file";
+
+ print STDERR "reading file to calculate normalizer";
+ $normalizer=0;
+ while(<COUNT_READER>) {
+ my $line = $_;
+ chomp($line);
+ my @line_array = split(/\s+/, $line);
+ my $count = $line_array[0];
+ $normalizer+=$count;
+ }
+
+ close(COUNT_READER);
+
+ print STDERR "reading file again to print the counts";
+ open(COUNT_READER, &open_compressed($file)) or die "ERROR: Can't read $file";
+
+ while(<COUNT_READER>) {
+ my $line = $_;
+ chomp($line);
+ my @line_array = split(/\s+/, $line);
+ my $score = $line_array[0]/$normalizer;
+ print $score."\n";
+ }
+
+ close(COUNT_READER);
+}
+
+sub open_compressed {
+ my ($file) = @_;
+ print STDERR "FILE: $file\n";
+
+ # add extensions, if necessary
+ $file = $file.".bz2" if ! -e $file && -e $file.".bz2";
+ $file = $file.".gz" if ! -e $file && -e $file.".gz";
+
+ # pipe zipped, if necessary
+ return "$BZCAT $file|" if $file =~ /\.bz2$/;
+ return "$ZCAT $file|" if $file =~ /\.gz$/;
+ return $file;
+}
diff --git a/contrib/relent-filter/scripts/calcPruningScores.pl b/contrib/relent-filter/scripts/calcPruningScores.pl
new file mode 100755
index 000000000..cbfabac55
--- /dev/null
+++ b/contrib/relent-filter/scripts/calcPruningScores.pl
@@ -0,0 +1,351 @@
+#!/usr/bin/perl -w
+use Getopt::Long;
+use File::Basename;
+use POSIX;
+
+# read arguments
+my $line_start = 0;
+my $line_end = LONG_MAX;
+my $tmp_dir = "";
+my $dec_size = LONG_MAX;
+$_HELP = 1 if (@ARGV < 1 or !GetOptions ("moses_ini=s" => \$moses_ini, #moses conf file
+"start:i" => \$line_start, #fisrt phrase to process
+"end:i" => \$line_end, #last sentence to process (not including)
+"training_s=s" => \$training_s, #source training file
+"training_t=s" => \$training_t, #target training file
+"prune_bin=s" => \$prune_bin, #binary files in the pruning toolkit
+"prune_scripts=s" => \$prune_scripts, #scripts in the pruning toolkit
+"sig_bin=s" => \$sig_bin, #binary files to calculate significance
+"moses_scripts=s" => \$moses_scripts, #dir with the moses scripts
+"tmp_dir:s" => \$tmp_dir, #dir with the moses scripts
+"dec_size:i" => \$dec_size, #dir with the moses scripts
+"workdir=s" => \$workdir)); #directory to put all the output files
+
+# help message if arguments are not correct
+if ($_HELP) {
+ print "
+Usage: perl calcPruningScores.pl [PARAMS]
+Function: Calculates relative entropy for each phrase pair in a translation model.
+Authors: Wang Ling ( lingwang at cs dot cmu dot edu )
+PARAMS:
+ -moses_ini : moses configuration file with the model to prune (phrase table, reordering table, weights etc...)
+ -training_s : source training file, please run salm first
+ -training_t : target training file, please run salm first
+ -prune_bin : path to the binaries for pruning (probably <PATH_TO_MOSES>/bin)
+ -prune_scripts : path to the scripts for pruning (probably the directory where this script is)
+ -sig_bin : path to the binary for significance testing included in this toolkit
+ -moses_scripts : path to the moses training scripts (where filter-model-given-input.pl is)
+ -workdir : directory to produce the output
+ -tmp_dir : directory to store temporary files (improve performance if stored in a local disk), omit to store in workdir
+ -dec_size : number of phrase pairs to be decoded at a time, omit to decode all selected phrase pairs at once
+ -start and -end : starting and ending phrase pairs to process, to be used if you want to launch multiple processes in parallel for different parts of the phrase table. If specified the process will process the phrase pairs from <start> to <end-1>
+
+For any questions contact lingwang at cs dot cmu dot edu
+";
+ exit(1);
+}
+
+# setting up working dirs
+my $TMP_DIR = $tmp_dir;
+if ($tmp_dir eq ""){
+ $TMP_DIR = "$workdir/tmp";
+}
+my $SCORE_DIR = "$workdir/scores";
+my $FILTER_DIR = "$TMP_DIR/filter";
+
+# files for divergence module
+my $SOURCE_FILE = "$TMP_DIR/source.txt";
+my $CONSTRAINT_FILE = "$TMP_DIR/constraint.txt";
+my $DIVERGENCE_FILE = "$SCORE_DIR/divergence.txt";
+
+# files for significance module
+my $SIG_TABLE_FILE = "$TMP_DIR/source_target.txt";
+my $SIG_MOD_OUTPUT = "$TMP_DIR/sig_mod.out";
+my $SIG_FILE = "$SCORE_DIR/significance.txt";
+my $COUNT_FILE = "$SCORE_DIR/count.txt";
+my $EMP_DIST_FILE= "$SCORE_DIR/empirical.txt";
+my $REL_ENT_FILE= "$SCORE_DIR/rel_ent.txt";
+
+# setting up executables
+my $ZCAT = "gzip -cd";
+my $BZCAT = "bzcat";
+my $CP = "cp";
+my $SED = "sed";
+my $RM = "rm";
+my $SORT_EXEC = "sort";
+my $PRUNE_EXEC = "$prune_bin/calcDivergence";
+my $SIG_EXEC = "$sig_bin/filter-pt";
+my $FILTER_EXEC = "perl $moses_scripts/filter-model-given-input.pl";
+my $CALC_EMP_EXEC ="perl $prune_scripts/calcEmpiricalDistribution.pl";
+my $INT_TABLE_EXEC = "perl $prune_scripts/interpolateScores.pl";
+
+# moses ini variables
+my ($TRANSLATION_TABLE_FILE, $REORDERING_TABLE_FILE);
+
+# phrase table variables
+my ($N_PHRASES, $N_PHRASES_TO_PROCESS);
+
+# main functions
+&prepare();
+&calc_sig_and_counts();
+&calc_div();
+&clear_up();
+
+# (1) preparing data
+sub prepare {
+ print STDERR "(1) preparing data @ ".`date`;
+ safesystem("mkdir -p $workdir") or die("ERROR: could not create work dir $workdir");
+ safesystem("mkdir -p $TMP_DIR") or die("ERROR: could not create work dir $TMP_DIR");
+ safesystem("mkdir -p $SCORE_DIR") or die("ERROR: could not create work dir $SCORE_DIR");
+ &get_moses_ini_params();
+ &copy_tables_to_tmp_dir();
+ &write_data_files();
+
+ $N_PHRASES = &get_number_of_phrases();
+ $line_end = ($line_end > $N_PHRASES) ? $N_PHRASES : $line_end;
+ $N_PHRASES_TO_PROCESS = $line_end - $line_start;
+}
+
+sub write_data_files {
+ open(SOURCE_WRITER,">".$SOURCE_FILE) or die "ERROR: Can't write $SOURCE_FILE";
+ open(CONSTRAINT_WRITER,">".$CONSTRAINT_FILE) or die "ERROR: Can't write $CONSTRAINT_FILE";
+ open(TABLE_WRITER,">".$SIG_TABLE_FILE) or die "ERROR: Can't write $SIG_TABLE_FILE";
+ open(TTABLE_READER, &open_compressed($TRANSLATION_TABLE_FILE)) or die "ERROR: Can't read $TRANSLATION_TABLE_FILE";
+
+ $line_number = 0;
+ while($line_number < $line_start && !eof(TTABLE_READER)){
+ <TTABLE_READER>;
+ $line_number++;
+ }
+ while($line_number < $line_end && !eof(TTABLE_READER)) {
+ my $line = <TTABLE_READER>;
+ chomp($line);
+ my @line_array = split(/\s+\|\|\|\s+/, $line);
+ my $source = $line_array[0];
+ my $target = $line_array[1];
+ my $scores = $line_array[2];
+ print TABLE_WRITER $source." ||| ".$target." ||| ".$scores."\n";
+ print SOURCE_WRITER $source."\n";
+ print CONSTRAINT_WRITER $target."\n";
+ $line_number++;
+ }
+
+ close(SOURCE_WRITER);
+ close(CONSTRAINT_WRITER);
+ close(TABLE_WRITER);
+ close(TTABLE_READER);
+}
+
+sub copy_tables_to_tmp_dir {
+ $tmp_t_table = "$TMP_DIR/".basename($TRANSLATION_TABLE_FILE);
+ $tmp_r_table = "$TMP_DIR/".basename($REORDERING_TABLE_FILE);
+ $tmp_moses_ini = "$TMP_DIR/moses.ini";
+ $cp_t_cmd = "$CP $TRANSLATION_TABLE_FILE $TMP_DIR";
+ $cp_r_cmd = "$CP $REORDERING_TABLE_FILE $TMP_DIR";
+ safesystem("$cp_t_cmd") or die("ERROR: could not run:\n $cp_t_cmd");
+ safesystem("$cp_r_cmd") or die("ERROR: could not run:\n $cp_r_cmd");
+
+ $sed_cmd = "$SED s#$TRANSLATION_TABLE_FILE#$tmp_t_table#g $moses_ini | $SED s#$REORDERING_TABLE_FILE#$tmp_r_table#g > $tmp_moses_ini";
+ safesystem("$sed_cmd") or die("ERROR: could not run:\n $sed_cmd");
+
+ $TRANSLATION_TABLE_FILE = $tmp_t_table;
+ $REORDERING_TABLE_FILE = $tmp_r_table;
+ $moses_ini = $tmp_moses_ini;
+}
+
+# (2) calculating sig and counts
+sub calc_sig_and_counts {
+ print STDERR "(2) calculating counts and significance".`date`;
+ print STDERR "(2.1) running significance module".`date`;
+ &run_significance_module();
+ print STDERR "(2.2) writing counts and significance tables".`date`;
+ &write_counts_and_significance_table();
+ print STDERR "(2.3) calculating empirical distribution".`date`;
+}
+
+sub write_counts_and_significance_table {
+ open(COUNT_WRITER,">".$COUNT_FILE) or die "ERROR: Can't write $COUNT_FILE";
+ open(SIG_WRITER,">".$SIG_FILE) or die "ERROR: Can't write $SIG_FILE";
+ open(SIG_MOD_READER, &open_compressed($SIG_MOD_OUTPUT)) or die "ERROR: Can't read $SIG_MOD_OUTPUT";
+
+ while(<SIG_MOD_READER>) {
+ my($line) = $_;
+ chomp($line);
+ my @line_array = split(/\s+\|\|\|\s+/, $line);
+ my $count = $line_array[0];
+ my $sig = $line_array[1];
+ print COUNT_WRITER $count."\n";
+ print SIG_WRITER $sig."\n";
+ }
+
+ close(SIG_MOD_READER);
+ close(COUNT_WRITER);
+ close(SIG_WRITER);
+}
+
+sub run_significance_module {
+ my $sig_cmd = "cat $SIG_TABLE_FILE | $SIG_EXEC -e $training_t -f $training_s -l -10000 -p -c > $SIG_MOD_OUTPUT";
+ safesystem("$sig_cmd") or die("ERROR: could not run:\n $sig_cmd");
+}
+
+# (3) calculating divergence
+sub calc_div {
+ print STDERR "(3) calculating relative entropy".`date`;
+ print STDERR "(3.1) calculating empirical distribution".`date`;
+ &calculate_empirical_distribution();
+ print STDERR "(3.2) calculating divergence (this might take a while)".`date`;
+ if($N_PHRASES_TO_PROCESS > $dec_size) {
+ &calculate_divergence_shared("$FILTER_DIR");
+ }
+ else{
+ &calculate_divergence($moses_ini);
+ }
+ print STDERR "(3.3) calculating relative entropy from empirical and divergence distributions".`date`;
+ &calculate_relative_entropy();
+}
+
+sub calculate_empirical_distribution {
+ my $emp_cmd = "$CALC_EMP_EXEC $COUNT_FILE > $EMP_DIST_FILE";
+ safesystem("$emp_cmd") or die("ERROR: could not run:\n $emp_cmd");
+}
+
+sub get_fragmented_file_name {
+ my ($name, $frag, $interval) = @_;
+ return "$name-$frag-".($frag+$interval);
+}
+
+sub calculate_divergence {
+ my $moses_ini_file = $_[0];
+ print STDERR "force decoding phrase pairs\n";
+ my $prune_cmd = "cat $SOURCE_FILE | $PRUNE_EXEC -f $moses_ini_file -constraint $CONSTRAINT_FILE -early-discarding-threshold 0 -s 100000 -ttable-limit 0 > $DIVERGENCE_FILE 2> /dev/null";
+ safesystem("$prune_cmd") or die("ERROR: could not run:\n $prune_cmd");
+}
+
+sub calculate_divergence_shared {
+ my $filter_dir = $_[0];
+
+ &split_file_into_chunks($SOURCE_FILE, $dec_size, $N_PHRASES_TO_PROCESS);
+ &split_file_into_chunks($CONSTRAINT_FILE, $dec_size, $N_PHRASES_TO_PROCESS);
+
+ for(my $i = 0; $i < $N_PHRASES_TO_PROCESS; $i = $i + $dec_size) {
+ my $filter_cmd = "$FILTER_EXEC ".&get_fragmented_file_name($FILTER_DIR, $i, $dec_size)." $moses_ini ".&get_fragmented_file_name($SOURCE_FILE, $i, $dec_size);
+ safesystem("$filter_cmd") or die("ERROR: could not run:\n $filter_cmd");
+
+ my $moses_ini_file = &get_fragmented_file_name($filter_dir, $i, $dec_size)."/moses.ini";
+ my $source_file = &get_fragmented_file_name($SOURCE_FILE, $i, $dec_size);
+ my $constraint_file = &get_fragmented_file_name($CONSTRAINT_FILE, $i, $dec_size);
+ my $prune_cmd;
+ print STDERR "force decoding phrase pairs $i to ".($i + $dec_size)."\n";
+ if($i == 0){
+ $prune_cmd = "cat $source_file | $PRUNE_EXEC -f $moses_ini_file -constraint $constraint_file -early-discarding-threshold 0 -s 100000 -ttable-limit 0 > $DIVERGENCE_FILE 2> /dev/null";
+ }
+ else{
+ $prune_cmd = "cat $source_file | $PRUNE_EXEC -f $moses_ini_file -constraint $constraint_file -early-discarding-threshold 0 -s 100000 -ttable-limit 0 >> $DIVERGENCE_FILE 2> /dev/null";
+ }
+ safesystem("$prune_cmd") or die("ERROR: could not run:\n $prune_cmd");
+
+ my $rm_cmd = "$RM -r ".&get_fragmented_file_name($FILTER_DIR, $i, $dec_size);
+ safesystem("$rm_cmd") or die("ERROR: could not run:\n $rm_cmd");
+
+ }
+}
+
+sub calculate_relative_entropy {
+ my $int_cmd = "$INT_TABLE_EXEC -files \"$EMP_DIST_FILE $DIVERGENCE_FILE\" -weights \"1 1\" -operation \"*\" > $REL_ENT_FILE";
+ safesystem("$int_cmd") or die("ERROR: could not run:\n $int_cmd");
+
+}
+
+# (4) clear up stuff that is not needed
+sub clear_up {
+ print STDERR "(4) removing tmp dir".`date`;
+ $rm_cmd = "$RM -r $TMP_DIR";
+ safesystem("$rm_cmd") or die("ERROR: could not run:\n $rm_cmd");
+}
+
+# utility functions
+
+sub safesystem {
+ print STDERR "Executing: @_\n";
+ system(@_);
+ if ($? == -1) {
+ print STDERR "ERROR: Failed to execute: @_\n $!\n";
+ exit(1);
+ }
+ elsif ($? & 127) {
+ printf STDERR "ERROR: Execution of: @_\n died with signal %d, %s coredump\n",
+ ($? & 127), ($? & 128) ? 'with' : 'without';
+ exit(1);
+ }
+ else {
+ my $exitcode = $? >> 8;
+ print STDERR "Exit code: $exitcode\n" if $exitcode;
+ return ! $exitcode;
+ }
+}
+
+sub open_compressed {
+ my ($file) = @_;
+ print STDERR "FILE: $file\n";
+
+ # add extensions, if necessary
+ $file = $file.".bz2" if ! -e $file && -e $file.".bz2";
+ $file = $file.".gz" if ! -e $file && -e $file.".gz";
+
+ # pipe zipped, if necessary
+ return "$BZCAT $file|" if $file =~ /\.bz2$/;
+ return "$ZCAT $file|" if $file =~ /\.gz$/;
+ return $file;
+}
+
+sub get_moses_ini_params {
+
+ open(MOSES_READER, $moses_ini);
+ while(<MOSES_READER>) {
+ my($line) = $_;
+ chomp($line);
+
+ if($line eq "[ttable-file]"){
+ $tableLine = <MOSES_READER>;
+ chomp($tableLine);
+ ($_,$_,$_,$_,$TRANSLATION_TABLE_FILE) = split(" ",$tableLine); # put the other parameters there if needed
+ }
+ if($line eq "[distortion-file]"){
+ $tableLine = <MOSES_READER>;
+ chomp($tableLine);
+ ($_,$_,$_,$REORDERING_TABLE_FILE) = split(" ",$tableLine); # put the other parameters there if needed
+ }
+ }
+ close(MOSES_READER);
+}
+
+sub get_number_of_phrases {
+ my $ret = 0;
+ open(TABLE_READER, &open_compressed($TRANSLATION_TABLE_FILE)) or die "ERROR: Can't read $TRANSLATION_TABLE_FILE";
+
+ while(<TABLE_READER>) {
+ $ret++;
+ }
+
+ close (TABLE_READER);
+ return $ret;
+}
+
+sub split_file_into_chunks {
+ my ($file_to_split, $chunk_size, $number_of_phrases_to_process) = @_;
+ open(SOURCE_READER, &open_compressed($file_to_split)) or die "ERROR: Can't read $file_to_split";
+ my $FRAG_SOURCE_WRITER;
+ for(my $i = 0; $i < $number_of_phrases_to_process && !eof(SOURCE_READER); $i++) {
+ if(($i % $chunk_size) == 0){ # open fragmented file to write
+ my $frag_file = &get_fragmented_file_name($file_to_split, $i, $chunk_size);
+ open(FRAG_SOURCE_WRITER, ">".$frag_file) or die "ERROR: Can't write $frag_file";
+ }
+ my $line = <SOURCE_READER>;
+ print FRAG_SOURCE_WRITER $line;
+ if((%i % $chunk_size) == $chunk_size - 1 || (%i % $chunk_size) == $number_of_phrases_to_process - 1){ # close fragmented file before opening a new one
+ close(FRAG_SOURCE_WRITER);
+ }
+ }
+}
+
+
diff --git a/contrib/relent-filter/scripts/interpolateScores.pl b/contrib/relent-filter/scripts/interpolateScores.pl
new file mode 100644
index 000000000..b204e951a
--- /dev/null
+++ b/contrib/relent-filter/scripts/interpolateScores.pl
@@ -0,0 +1,94 @@
+#!/usr/bin/perl -w
+use Getopt::Long;
+use File::Basename;
+use POSIX;
+
+$operation="+";
+
+# read arguments
+$_HELP = 1 if (@ARGV < 1 or !GetOptions ("files=s" => \$files, #moses conf file
+"weights=s" => \$weights,
+"operation=s" => \$operation)); #directory to put all the output files
+
+
+# help message if arguments are not correct
+if ($_HELP) {
+ print "Relative Entropy Pruning
+Usage: perl interpolateScores.pl [PARAMS]
+Function: interpolates any number of score files interlated by their weights
+Authors: Wang Ling ( lingwang at cs dot cmu dot edu )
+PARAMS:
+ -files=s : table files to interpolate separated by a space (Ex \"file1 file2 file3\")
+ -weights : interpolation weights separated by a space (Ex \"0.3 0.3 0.4\")
+ -operation : +,* or min depending on the operation to perform to combine scores
+For any questions contact lingwang at cs dot cmu dot edu
+";
+ exit(1);
+}
+
+@FILES = split(/\s+/, $files);
+@WEIGHTS = split(/\s+/, $weights);
+
+my $ZCAT = "gzip -cd";
+my $BZCAT = "bzcat";
+
+&interpolate();
+
+sub interpolate {
+ my @READERS;
+ for($i = 0; $i < @FILES; $i++){
+ local *FILE;
+ open(FILE, &open_compressed($FILES[$i])) or die "ERROR: Can't read $FILES[$i]";
+ push(@READERS, *FILE);
+ }
+ $FIRST = $READERS[0];
+ while(!eof($FIRST)) {
+ if($operation eq "+"){
+ my $score = 0;
+ for($i = 0; $i < @FILES; $i++){
+ my $READER = $READERS[$i];
+ my $line = <$READER>;
+ chomp($line);
+ $score += $line*$WEIGHTS[$i];
+ }
+ print "$score\n";
+ }
+ if($operation eq "*"){
+ my $score = 1;
+ for($i = 0; $i < @FILES; $i++){
+ my $READER = $READERS[$i];
+ my $line = <$READER>;
+ chomp($line);
+ $score *= $line ** $WEIGHTS[$i];
+ }
+ print "$score\n"
+ }
+ if($operation eq "min"){
+ my $score = 99999;
+ for($i = 0; $i < @FILES; $i++){
+ my $READER = $READERS[$i];
+ my $line = <$READER>;
+ chomp($line);
+ if ($score > $line*$WEIGHTS[$i]){
+ $score = $line*$WEIGHTS[$i];
+ }
+ }
+ print "$score\n"
+
+ }
+ }
+}
+
+sub open_compressed {
+ my ($file) = @_;
+ print STDERR "FILE: $file\n";
+
+ # add extensions, if necessary
+ $file = $file.".bz2" if ! -e $file && -e $file.".bz2";
+ $file = $file.".gz" if ! -e $file && -e $file.".gz";
+
+ # pipe zipped, if necessary
+ return "$BZCAT $file|" if $file =~ /\.bz2$/;
+ return "$ZCAT $file|" if $file =~ /\.gz$/;
+ return $file;
+}
diff --git a/contrib/relent-filter/scripts/prunePT.pl b/contrib/relent-filter/scripts/prunePT.pl
new file mode 100755
index 000000000..37dc30bad
--- /dev/null
+++ b/contrib/relent-filter/scripts/prunePT.pl
@@ -0,0 +1,114 @@
+#!/usr/bin/perl -w
+
+# read arguments
+my $tmp_dir = "";
+my $percentage = -1;
+my $threshold = -1;
+use Getopt::Long;
+$_HELP = 1 if (@ARGV < 1 or !GetOptions ("table=s" => \$table, #table to filter
+"scores=s" => \$scores_file, #scores of each phrase pair, should have same size as the table to filter
+"percentage=i" => \$percentage, # percentage of phrase table to remain
+"threshold=i" => \$threshold)); # threshold (score < threshold equals prune entry)
+
+# help message if arguments are not correct
+if ($_HELP) {
+ print "Relative Entropy Pruning
+Usage: perl prunePT.pl [PARAMS]
+Function: prunes a phrase table given a score file
+Authors: Wang Ling ( lingwang at cs dot cmu dot edu )
+PARAMS:
+ -table : table to prune
+ -percentage : percentage of phrase table to remain (if the scores do not allow the exact percentage if multiple entries have the same threshold, the script chooses to retain more than the given percentage)
+ -threshold : threshold to prune (score < threshold equals prune entry), do not use this if percentage is specified
+For any questions contact lingwang at cs dot cmu dot edu
+";
+ exit(1);
+}
+
+
+my $THRESHOLD = $threshold;
+if ($percentage != -1){
+ $THRESHOLD = &get_threshold_by_percentage($percentage);
+}
+
+my $ZCAT = "gzip -cd";
+my $BZCAT = "bzcat";
+
+&prune_by_threshold($THRESHOLD);
+
+sub prune_by_threshold {
+ my $th = $_[0];
+ print STDERR "pruning using threshold $th \n";
+ open (SCORE_READER, &open_compressed($scores_file));
+ open (TABLE_READER, &open_compressed($table));
+ $number_of_phrases=0;
+ $number_of_unpruned_phrases=0;
+ while(!eof(SCORE_READER) && !eof(TABLE_READER)){
+ $score_line = <SCORE_READER>;
+ $table_line = <TABLE_READER>;
+ chomp($score_line);
+ if($score_line >= $th){
+ print $table_line;
+ $number_of_unpruned_phrases++;
+ }
+ $number_of_phrases++;
+ }
+ print STDERR "pruned ".($number_of_phrases - $number_of_unpruned_phrases)." phrase pairs out of $number_of_phrases\n";
+}
+
+sub get_threshold_by_percentage {
+ my $percentage = $_[0];
+ $ret = 0;
+
+ $number_of_phrases = &get_number_of_phrases();
+ $stop_phrase = ($percentage * $number_of_phrases) / 100;
+ $phrase_number = 0;
+
+
+ open (SCORE_READER, &open_compressed($scores_file));
+ while(<SCORE_READER>) {
+ my $line = $_;
+
+ }
+ close (SCORE_READER);
+
+ open (SCORE_READER, "cat $scores_file | LC_ALL=c sort -g |");
+ while(<SCORE_READER>) {
+ my $line = $_;
+ if($phrase_number >= $stop_phrase){
+ chomp($line);
+ $ret = $line;
+ last;
+ }
+ $phrase_number++;
+ }
+
+ close (SCORE_READER);
+ return $ret;
+}
+
+sub get_number_of_phrases {
+ $ret = 0;
+ open (SCORE_READER, $scores_file);
+
+ while(<SCORE_READER>) {
+ $ret++;
+ }
+
+ close (SCORE_READER);
+ return $ret;
+}
+
+sub open_compressed {
+ my ($file) = @_;
+ print STDERR "FILE: $file\n";
+
+ # add extensions, if necessary
+ $file = $file.".bz2" if ! -e $file && -e $file.".bz2";
+ $file = $file.".gz" if ! -e $file && -e $file.".gz";
+
+ # pipe zipped, if necessary
+ return "$BZCAT $file|" if $file =~ /\.bz2$/;
+ return "$ZCAT $file|" if $file =~ /\.gz$/;
+ return $file;
+}
diff --git a/contrib/relent-filter/sigtest-filter/Makefile b/contrib/relent-filter/sigtest-filter/Makefile
new file mode 100755
index 000000000..71de9c45f
--- /dev/null
+++ b/contrib/relent-filter/sigtest-filter/Makefile
@@ -0,0 +1,10 @@
+SALMDIR=/Users/hieuhoang/workspace/salm
+FLAVOR?=o64
+INC=-I$(SALMDIR)/Src/Shared -I$(SALMDIR)/Src/SuffixArrayApplications -I$(SALMDIR)/Src/SuffixArrayApplications/SuffixArraySearch
+OBJS=$(SALMDIR)/Distribution/Linux/Objs/Search/_SuffixArrayApplicationBase.$(FLAVOR) $(SALMDIR)/Distribution/Linux/Objs/Search/_SuffixArraySearchApplicationBase.$(FLAVOR) $(SALMDIR)/Distribution/Linux/Objs/Shared/_String.$(FLAVOR) $(SALMDIR)/Distribution/Linux/Objs/Shared/_IDVocabulary.$(FLAVOR)
+
+all: filter-pt
+
+filter-pt: filter-pt.cpp
+ ./check-install $(SALMDIR)
+ $(CXX) -O6 $(INC) $(OBJS) -o filter-pt filter-pt.cpp
diff --git a/contrib/relent-filter/sigtest-filter/README.txt b/contrib/relent-filter/sigtest-filter/README.txt
new file mode 100755
index 000000000..b21129b89
--- /dev/null
+++ b/contrib/relent-filter/sigtest-filter/README.txt
@@ -0,0 +1,42 @@
+Re-implementation of Johnson et al. (2007)'s phrasetable filtering strategy.
+
+This implementation relies on Joy Zhang's SALM Suffix Array toolkit. It is
+available here:
+
+ http://projectile.sv.cmu.edu/research/public/tools/salm/salm.htm
+
+--Chris Dyer <redpony@umd.edu>
+
+BUILD INSTRUCTIONS
+---------------------------------
+
+1. Download and build SALM.
+
+2. make SALMDIR=/path/to/SALM
+
+
+USAGE INSTRUCTIONS
+---------------------------------
+
+1. Using the SALM/Bin/Linux/Index/IndexSA.O32, create a suffix array index
+ of the source and target sides of your training bitext.
+
+2. cat phrase-table.txt | ./filter-pt -e TARG.suffix -f SOURCE.suffix \
+ -l <FILTER-VALUE>
+
+ FILTER-VALUE is the -log prob threshold described in Johnson et al.
+ (2007)'s paper. It may be either 'a+e', 'a-e', or a positive real
+ value. 'a+e' is a good setting- it filters out <1,1,1> phrase pairs.
+ I also recommend using -n 30, which filteres out all but the top
+ 30 phrase pairs, sorted by P(e|f). This was used in the paper.
+
+3. Run with no options to see more use-cases.
+
+
+REFERENCES
+---------------------------------
+
+H. Johnson, J. Martin, G. Foster and R. Kuhn. (2007) Improving Translation
+ Quality by Discarding Most of the Phrasetable. In Proceedings of the 2007
+ Joint Conference on Empirical Methods in Natural Language Processing and
+ Computational Natural Language Learning (EMNLP-CoNLL), pp. 967-975.
diff --git a/contrib/relent-filter/sigtest-filter/WIN32_functions.cpp b/contrib/relent-filter/sigtest-filter/WIN32_functions.cpp
new file mode 100755
index 000000000..60ddd340c
--- /dev/null
+++ b/contrib/relent-filter/sigtest-filter/WIN32_functions.cpp
@@ -0,0 +1,231 @@
+// XGetopt.cpp Version 1.2
+//
+// Author: Hans Dietrich
+// hdietrich2@hotmail.com
+//
+// Description:
+// XGetopt.cpp implements getopt(), a function to parse command lines.
+//
+// History
+// Version 1.2 - 2003 May 17
+// - Added Unicode support
+//
+// Version 1.1 - 2002 March 10
+// - Added example to XGetopt.cpp module header
+//
+// This software is released into the public domain.
+// You are free to use it in any way you like.
+//
+// This software is provided "as is" with no expressed
+// or implied warranty. I accept no liability for any
+// damage or loss of business that this software may cause.
+//
+///////////////////////////////////////////////////////////////////////////////
+
+
+///////////////////////////////////////////////////////////////////////////////
+// if you are using precompiled headers then include this line:
+///////////////////////////////////////////////////////////////////////////////
+
+
+///////////////////////////////////////////////////////////////////////////////
+// if you are not using precompiled headers then include these lines:
+//#include <windows.h>
+//#include <stdio.h>
+//#include <tchar.h>
+///////////////////////////////////////////////////////////////////////////////
+
+
+#include <stdio.h>
+#include <string.h>
+#include <math.h>
+#include "WIN32_functions.h"
+
+
+///////////////////////////////////////////////////////////////////////////////
+//
+// X G e t o p t . c p p
+//
+//
+// NAME
+// getopt -- parse command line options
+//
+// SYNOPSIS
+// int getopt(int argc, char *argv[], char *optstring)
+//
+// extern char *optarg;
+// extern int optind;
+//
+// DESCRIPTION
+// The getopt() function parses the command line arguments. Its
+// arguments argc and argv are the argument count and array as
+// passed into the application on program invocation. In the case
+// of Visual C++ programs, argc and argv are available via the
+// variables __argc and __argv (double underscores), respectively.
+// getopt returns the next option letter in argv that matches a
+// letter in optstring. (Note: Unicode programs should use
+// __targv instead of __argv. Also, all character and string
+// literals should be enclosed in ( ) ).
+//
+// optstring is a string of recognized option letters; if a letter
+// is followed by a colon, the option is expected to have an argument
+// that may or may not be separated from it by white space. optarg
+// is set to point to the start of the option argument on return from
+// getopt.
+//
+// Option letters may be combined, e.g., "-ab" is equivalent to
+// "-a -b". Option letters are case sensitive.
+//
+// getopt places in the external variable optind the argv index
+// of the next argument to be processed. optind is initialized
+// to 0 before the first call to getopt.
+//
+// When all options have been processed (i.e., up to the first
+// non-option argument), getopt returns EOF, optarg will point
+// to the argument, and optind will be set to the argv index of
+// the argument. If there are no non-option arguments, optarg
+// will be set to NULL.
+//
+// The special option "--" may be used to delimit the end of the
+// options; EOF will be returned, and "--" (and everything after it)
+// will be skipped.
+//
+// RETURN VALUE
+// For option letters contained in the string optstring, getopt
+// will return the option letter. getopt returns a question mark (?)
+// when it encounters an option letter not included in optstring.
+// EOF is returned when processing is finished.
+//
+// BUGS
+// 1) Long options are not supported.
+// 2) The GNU double-colon extension is not supported.
+// 3) The environment variable POSIXLY_CORRECT is not supported.
+// 4) The + syntax is not supported.
+// 5) The automatic permutation of arguments is not supported.
+// 6) This implementation of getopt() returns EOF if an error is
+// encountered, instead of -1 as the latest standard requires.
+//
+// EXAMPLE
+// BOOL CMyApp::ProcessCommandLine(int argc, char *argv[])
+// {
+// int c;
+//
+// while ((c = getopt(argc, argv, ("aBn:"))) != EOF)
+// {
+// switch (c)
+// {
+// case ('a'):
+// TRACE(("option a\n"));
+// //
+// // set some flag here
+// //
+// break;
+//
+// case ('B'):
+// TRACE( ("option B\n"));
+// //
+// // set some other flag here
+// //
+// break;
+//
+// case ('n'):
+// TRACE(("option n: value=%d\n"), atoi(optarg));
+// //
+// // do something with value here
+// //
+// break;
+//
+// case ('?'):
+// TRACE(("ERROR: illegal option %s\n"), argv[optind-1]);
+// return FALSE;
+// break;
+//
+// default:
+// TRACE(("WARNING: no handler for option %c\n"), c);
+// return FALSE;
+// break;
+// }
+// }
+// //
+// // check for non-option args here
+// //
+// return TRUE;
+// }
+//
+///////////////////////////////////////////////////////////////////////////////
+
+char *optarg; // global argument pointer
+int optind = 0; // global argv index
+
+int getopt(int argc, char *argv[], char *optstring)
+{
+ static char *next = NULL;
+ if (optind == 0)
+ next = NULL;
+
+ optarg = NULL;
+
+ if (next == NULL || *next =='\0') {
+ if (optind == 0)
+ optind++;
+
+ if (optind >= argc || argv[optind][0] != ('-') || argv[optind][1] == ('\0')) {
+ optarg = NULL;
+ if (optind < argc)
+ optarg = argv[optind];
+ return EOF;
+ }
+
+ if (strcmp(argv[optind], "--") == 0) {
+ optind++;
+ optarg = NULL;
+ if (optind < argc)
+ optarg = argv[optind];
+ return EOF;
+ }
+
+ next = argv[optind];
+ next++; // skip past -
+ optind++;
+ }
+
+ char c = *next++;
+ char *cp = strchr(optstring, c);
+
+ if (cp == NULL || c == (':'))
+ return ('?');
+
+ cp++;
+ if (*cp == (':')) {
+ if (*next != ('\0')) {
+ optarg = next;
+ next = NULL;
+ } else if (optind < argc) {
+ optarg = argv[optind];
+ optind++;
+ } else {
+ return ('?');
+ }
+ }
+
+ return c;
+}
+
+// for an overview, see
+// W. Press, S. Teukolsky and W. Vetterling. (1992) Numerical Recipes in C. Chapter 6.1.
+double lgamma(int x)
+{
+ // size_t xx=(size_t)x; xx--; size_t sum=1; while (xx) { sum *= xx--; } return log((double)(sum));
+ if (x <= 2) {
+ return 0.0;
+ }
+ static double coefs[6] = {76.18009172947146, -86.50532032941677, 24.01409824083091, -1.231739572450155, 0.1208650973866179e-2, -0.5395239384953e-5};
+ double tmp=(double)x+5.5;
+ tmp -= (((double)x)+0.5)*log(tmp);
+ double y=(double)x;
+ double sum = 1.000000000190015;
+ for (size_t j=0; j<6; ++j) {
+ sum += coefs[j]/++y;
+ }
+ return -tmp+log(2.5066282746310005*sum/(double)x);
+} \ No newline at end of file
diff --git a/contrib/relent-filter/sigtest-filter/WIN32_functions.h b/contrib/relent-filter/sigtest-filter/WIN32_functions.h
new file mode 100755
index 000000000..6a719392e
--- /dev/null
+++ b/contrib/relent-filter/sigtest-filter/WIN32_functions.h
@@ -0,0 +1,24 @@
+// XGetopt.h Version 1.2
+//
+// Author: Hans Dietrich
+// hdietrich2@hotmail.com
+//
+// This software is released into the public domain.
+// You are free to use it in any way you like.
+//
+// This software is provided "as is" with no expressed
+// or implied warranty. I accept no liability for any
+// damage or loss of business that this software may cause.
+//
+///////////////////////////////////////////////////////////////////////////////
+
+#ifndef XGETOPT_H
+#define XGETOPT_H
+
+extern int optind, opterr;
+extern char *optarg;
+
+int getopt(int argc, char *argv[], char *optstring);
+double lgamma(int x);
+
+#endif //XGETOPT_H
diff --git a/contrib/relent-filter/sigtest-filter/check-install b/contrib/relent-filter/sigtest-filter/check-install
new file mode 100755
index 000000000..ba4f431e0
--- /dev/null
+++ b/contrib/relent-filter/sigtest-filter/check-install
@@ -0,0 +1,5 @@
+#!/usr/bin/perl -w
+use strict;
+my $path = shift @ARGV;
+die "Can't find SALM installation path: $path\nPlease use:\n\n make SALMDIR=/path/to/SALM\n\n" unless (-d $path);
+exit 0;
diff --git a/contrib/relent-filter/sigtest-filter/filter-pt b/contrib/relent-filter/sigtest-filter/filter-pt
new file mode 100755
index 000000000..1dfe38fdf
--- /dev/null
+++ b/contrib/relent-filter/sigtest-filter/filter-pt
Binary files differ
diff --git a/contrib/relent-filter/sigtest-filter/filter-pt.cpp b/contrib/relent-filter/sigtest-filter/filter-pt.cpp
new file mode 100755
index 000000000..4a51953ea
--- /dev/null
+++ b/contrib/relent-filter/sigtest-filter/filter-pt.cpp
@@ -0,0 +1,377 @@
+
+#include <cstring>
+#include <cassert>
+#include <cstdio>
+#include <cstdlib>
+#include <algorithm>
+
+#include "_SuffixArraySearchApplicationBase.h"
+
+#include <vector>
+#include <iostream>
+#include <set>
+
+#ifdef WIN32
+#include "WIN32_functions.h"
+#else
+#include <unistd.h>
+#endif
+
+typedef std::set<TextLenType> SentIdSet;
+typedef std::map<std::string, SentIdSet> PhraseSetMap;
+
+#undef min
+
+// constants
+const size_t MINIMUM_SIZE_TO_KEEP = 10000; // reduce this to improve memory usage,
+// increase for speed
+const std::string SEPARATOR = " ||| ";
+
+const double ALPHA_PLUS_EPS = -1000.0; // dummy value
+const double ALPHA_MINUS_EPS = -2000.0; // dummy value
+
+// configuration params
+int pfe_filter_limit = 0; // 0 = don't filter anything based on P(f|e)
+bool print_cooc_counts = false; // add cooc counts to phrase table?
+bool print_neglog_significance = false; // add -log(p) to phrase table?
+double sig_filter_limit = 0; // keep phrase pairs with -log(sig) > sig_filter_limit
+// higher = filter-more
+bool pef_filter_only = false; // only filter based on pef
+
+// globals
+PhraseSetMap esets;
+double p_111 = 0.0; // alpha
+size_t nremoved_sigfilter = 0;
+size_t nremoved_pfefilter = 0;
+
+C_SuffixArraySearchApplicationBase e_sa;
+C_SuffixArraySearchApplicationBase f_sa;
+int num_lines;
+
+void usage()
+{
+ std::cerr << "\nFilter phrase table using significance testing as described\n"
+ << "in H. Johnson, et al. (2007) Improving Translation Quality\n"
+ << "by Discarding Most of the Phrasetable. EMNLP 2007.\n"
+ << "\nUsage:\n"
+ << "\n filter-pt -e english.suf-arr -f french.suf-arr\n"
+ << " [-c] [-p] [-l threshold] [-n num] < PHRASE-TABLE > FILTERED-PHRASE-TABLE\n\n"
+ << " [-l threshold] >0.0, a+e, or a-e: keep values that have a -log significance > this\n"
+ << " [-n num ] 0, 1...: 0=no filtering, >0 sort by P(e|f) and keep the top num elements\n"
+ << " [-c ] add the cooccurence counts to the phrase table\n"
+ << " [-p ] add -log(significance) to the phrasetable\n\n";
+ exit(1);
+}
+
+struct PTEntry {
+ PTEntry(const std::string& str, int index);
+ std::string f_phrase;
+ std::string e_phrase;
+ std::string extra;
+ std::string scores;
+ float pfe;
+ int cf;
+ int ce;
+ int cfe;
+ float nlog_pte;
+ void set_cooc_stats(int _cef, int _cf, int _ce, float nlp) {
+ cfe = _cef;
+ cf = _cf;
+ ce = _ce;
+ nlog_pte = nlp;
+ }
+
+};
+
+PTEntry::PTEntry(const std::string& str, int index) :
+ cf(0), ce(0), cfe(0), nlog_pte(0.0)
+{
+ size_t pos = 0;
+ std::string::size_type nextPos = str.find(SEPARATOR, pos);
+ this->f_phrase = str.substr(pos,nextPos);
+
+ pos = nextPos + SEPARATOR.size();
+ nextPos = str.find(SEPARATOR, pos);
+ this->e_phrase = str.substr(pos,nextPos-pos);
+
+ pos = nextPos + SEPARATOR.size();
+ nextPos = str.find(SEPARATOR, pos);
+ this->scores = str.substr(pos,nextPos-pos);
+
+ pos = nextPos + SEPARATOR.size();
+ this->extra = str.substr(pos);
+
+ int c = 0;
+ std::string::iterator i=scores.begin();
+ if (index > 0) {
+ for (; i != scores.end(); ++i) {
+ if ((*i) == ' ') {
+ c++;
+ if (c == index) break;
+ }
+ }
+ }
+ if (i != scores.end()) {
+ ++i;
+ }
+ char f[24];
+ char *fp=f;
+ while (i != scores.end() && *i != ' ') {
+ *fp++=*i++;
+ }
+ *fp++=0;
+
+ this->pfe = atof(f);
+
+ // std::cerr << "L: " << f_phrase << " ::: " << e_phrase << " ::: " << scores << " ::: " << pfe << std::endl;
+ // std::cerr << "X: " << extra << "\n";
+}
+
+struct PfeComparer {
+ bool operator()(const PTEntry* a, const PTEntry* b) const {
+ return a->pfe > b->pfe;
+ }
+};
+
+struct NlogSigThresholder {
+ NlogSigThresholder(float threshold) : t(threshold) {}
+ float t;
+ bool operator()(const PTEntry* a) const {
+ if (a->nlog_pte < t) {
+ delete a;
+ return true;
+ } else return false;
+ }
+};
+
+std::ostream& operator << (std::ostream& os, const PTEntry& pp)
+{
+ //os << pp.f_phrase << " ||| " << pp.e_phrase;
+ //os << " ||| " << pp.scores;
+ //if (pp.extra.size()>0) os << " ||| " << pp.extra;
+ if (print_cooc_counts) os << pp.cfe << " " << pp.cf << " " << pp.ce;
+ if (print_neglog_significance) os << " ||| " << pp.nlog_pte;
+ return os;
+}
+
+void print(int a, int b, int c, int d, float p)
+{
+ std::cerr << a << "\t" << b << "\t P=" << p << "\n"
+ << c << "\t" << d << "\t xf=" << (double)(b)*(double)(c)/(double)(a+1)/(double)(d+1) << "\n\n";
+}
+
+// 2x2 (one-sided) Fisher's exact test
+// see B. Moore. (2004) On Log Likelihood and the Significance of Rare Events
+double fisher_exact(int cfe, int ce, int cf)
+{
+ assert(cfe <= ce);
+ assert(cfe <= cf);
+
+ int a = cfe;
+ int b = (cf - cfe);
+ int c = (ce - cfe);
+ int d = (num_lines - ce - cf + cfe);
+ int n = a + b + c + d;
+
+ double cp = exp(lgamma(1+a+c) + lgamma(1+b+d) + lgamma(1+a+b) + lgamma(1+c+d) - lgamma(1+n) - lgamma(1+a) - lgamma(1+b) - lgamma(1+c) - lgamma(1+d));
+ double total_p = 0.0;
+ int tc = std::min(b,c);
+ for (int i=0; i<=tc; i++) {
+ total_p += cp;
+// double lg = lgamma(1+a+c) + lgamma(1+b+d) + lgamma(1+a+b) + lgamma(1+c+d) - lgamma(1+n) - lgamma(1+a) - lgamma(1+b) - lgamma(1+c) - lgamma(1+d); double cp = exp(lg);
+// print(a,b,c,d,cp);
+ double coef = (double)(b)*(double)(c)/(double)(a+1)/(double)(d+1);
+ cp *= coef;
+ ++a;
+ --c;
+ ++d;
+ --b;
+ }
+ return total_p;
+}
+
+// input: unordered list of translation options for a single source phrase
+void compute_cooc_stats_and_filter(std::vector<PTEntry*>& options)
+{
+ if (pfe_filter_limit>0 && options.size() > pfe_filter_limit) {
+ nremoved_pfefilter += (options.size() - pfe_filter_limit);
+ std::nth_element(options.begin(), options.begin()+pfe_filter_limit, options.end(), PfeComparer());
+ for (std::vector<PTEntry*>::iterator i=options.begin()+pfe_filter_limit; i != options.end(); ++i)
+ delete *i;
+ options.erase(options.begin()+pfe_filter_limit,options.end());
+ }
+ if (pef_filter_only) return;
+
+ SentIdSet fset;
+ vector<S_SimplePhraseLocationElement> locations;
+ //std::cerr << "Looking up f-phrase: " << options.front()->f_phrase << "\n";
+
+ locations = f_sa.locateExactPhraseInCorpus(options.front()->f_phrase.c_str());
+ if(locations.size()==0) {
+ cerr<<"No occurrences found!!\n";
+ }
+ for (vector<S_SimplePhraseLocationElement>::iterator i=locations.begin();
+ i != locations.end();
+ ++i) {
+ fset.insert(i->sentIdInCorpus);
+ }
+ size_t cf = fset.size();
+ for (std::vector<PTEntry*>::iterator i=options.begin(); i != options.end(); ++i) {
+ const std::string& e_phrase = (*i)->e_phrase;
+ size_t cef=0;
+ SentIdSet& eset = esets[(*i)->e_phrase];
+ if (eset.empty()) {
+ //std::cerr << "Looking up e-phrase: " << e_phrase << "\n";
+ vector<S_SimplePhraseLocationElement> locations = e_sa.locateExactPhraseInCorpus(e_phrase.c_str());
+ for (vector<S_SimplePhraseLocationElement>::iterator i=locations.begin(); i!= locations.end(); ++i) {
+ TextLenType curSentId = i->sentIdInCorpus;
+ eset.insert(curSentId);
+ }
+ }
+ size_t ce=eset.size();
+ if (ce < cf) {
+ for (SentIdSet::iterator i=eset.begin(); i != eset.end(); ++i) {
+ if (fset.find(*i) != fset.end()) cef++;
+ }
+ } else {
+ for (SentIdSet::iterator i=fset.begin(); i != fset.end(); ++i) {
+ if (eset.find(*i) != eset.end()) cef++;
+ }
+ }
+ double nlp = -log(fisher_exact(cef, cf, ce));
+ (*i)->set_cooc_stats(cef, cf, ce, nlp);
+ if (ce < MINIMUM_SIZE_TO_KEEP) {
+ esets.erase(e_phrase);
+ }
+ }
+ std::vector<PTEntry*>::iterator new_end =
+ std::remove_if(options.begin(), options.end(), NlogSigThresholder(sig_filter_limit));
+ nremoved_sigfilter += (options.end() - new_end);
+ options.erase(new_end,options.end());
+}
+
+int main(int argc, char * argv[])
+{
+ int c;
+ const char* efile=0;
+ const char* ffile=0;
+ int pfe_index = 2;
+ while ((c = getopt(argc, argv, "cpf:e:i:n:l:")) != -1) {
+ switch (c) {
+ case 'e':
+ efile = optarg;
+ break;
+ case 'f':
+ ffile = optarg;
+ break;
+ case 'i': // index of pfe in phrase table
+ pfe_index = atoi(optarg);
+ break;
+ case 'n': // keep only the top n entries in phrase table sorted by p(f|e) (0=all)
+ pfe_filter_limit = atoi(optarg);
+ std::cerr << "P(f|e) filter limit: " << pfe_filter_limit << std::endl;
+ break;
+ case 'c':
+ print_cooc_counts = true;
+ break;
+ case 'p':
+ print_neglog_significance = true;
+ break;
+ case 'l':
+ std::cerr << "-l = " << optarg << "\n";
+ if (strcmp(optarg,"a+e") == 0) {
+ sig_filter_limit = ALPHA_PLUS_EPS;
+ } else if (strcmp(optarg,"a-e") == 0) {
+ sig_filter_limit = ALPHA_MINUS_EPS;
+ } else {
+ char *x;
+ sig_filter_limit = strtod(optarg, &x);
+ }
+ break;
+ default:
+ usage();
+ }
+ }
+ //-----------------------------------------------------------------------------
+ if (optind != argc || ((!efile || !ffile) && !pef_filter_only)) {
+ usage();
+ }
+
+ //load the indexed corpus with vocabulary(noVoc=false) and with offset(noOffset=false)
+ if (!pef_filter_only) {
+ e_sa.loadData_forSearch(efile, false, false);
+ f_sa.loadData_forSearch(ffile, false, false);
+ size_t elines = e_sa.returnTotalSentNumber();
+ size_t flines = f_sa.returnTotalSentNumber();
+ if (elines != flines) {
+ std::cerr << "Number of lines in e-corpus != number of lines in f-corpus!\n";
+ usage();
+ } else {
+ std::cerr << "Training corpus: " << elines << " lines\n";
+ num_lines = elines;
+ }
+ p_111 = -log(fisher_exact(1,1,1));
+ std::cerr << "\\alpha = " << p_111 << "\n";
+ if (sig_filter_limit == ALPHA_MINUS_EPS) {
+ sig_filter_limit = p_111 - 0.001;
+ } else if (sig_filter_limit == ALPHA_PLUS_EPS) {
+ sig_filter_limit = p_111 + 0.001;
+ }
+ std::cerr << "Sig filter threshold is = " << sig_filter_limit << "\n";
+ } else {
+ std::cerr << "Filtering using P(e|f) only. n=" << pfe_filter_limit << std::endl;
+ }
+
+ char tmpString[10000];
+ std::string prev = "";
+ std::vector<PTEntry*> options;
+ size_t pt_lines = 0;
+ while(!cin.eof()) {
+ cin.getline(tmpString,10000,'\n');
+ if(++pt_lines%10000==0) {
+ std::cerr << ".";
+ if(pt_lines%500000==0) std::cerr << "[n:"<<pt_lines<<"]\n";
+ }
+
+ if(strlen(tmpString)>0) {
+ PTEntry* pp = new PTEntry(tmpString, pfe_index);
+ if (prev != pp->f_phrase) {
+ prev = pp->f_phrase;
+
+ if (!options.empty()) { // always true after first line
+ compute_cooc_stats_and_filter(options);
+ }
+ for (std::vector<PTEntry*>::iterator i=options.begin(); i != options.end(); ++i) {
+ std::cout << **i << std::endl;
+ delete *i;
+ }
+ options.clear();
+ options.push_back(pp);
+
+ } else {
+ options.push_back(pp);
+ }
+ // for(int i=0;i<locations.size(); i++){
+ // cout<<"SentId="<<locations[i].sentIdInCorpus<<" Pos="<<(int)locations[i].posInSentInCorpus<<endl;
+ // }
+ }
+ }
+ compute_cooc_stats_and_filter(options);
+ for (std::vector<PTEntry*>::iterator i=options.begin(); i != options.end(); ++i) {
+ std::cout << **i << std::endl;
+ delete *i;
+ }
+ float pfefper = (100.0*(float)nremoved_pfefilter)/(float)pt_lines;
+ float sigfper = (100.0*(float)nremoved_sigfilter)/(float)pt_lines;
+ std::cerr << "\n\n------------------------------------------------------\n"
+ << " unfiltered phrases pairs: " << pt_lines << "\n"
+ << "\n"
+ << " P(f|e) filter [first]: " << nremoved_pfefilter << " (" << pfefper << "%)\n"
+ << " significance filter: " << nremoved_sigfilter << " (" << sigfper << "%)\n"
+ << " TOTAL FILTERED: " << (nremoved_pfefilter + nremoved_sigfilter) << " (" << (sigfper + pfefper) << "%)\n"
+ << "\n"
+ << " FILTERED phrase pairs: " << (pt_lines - nremoved_pfefilter - nremoved_sigfilter) << " (" << (100.0-sigfper - pfefper) << "%)\n"
+ << "------------------------------------------------------\n";
+
+ return 0;
+}
diff --git a/contrib/relent-filter/sigtest-filter/sigtest-filter.sln b/contrib/relent-filter/sigtest-filter/sigtest-filter.sln
new file mode 100755
index 000000000..517b06238
--- /dev/null
+++ b/contrib/relent-filter/sigtest-filter/sigtest-filter.sln
@@ -0,0 +1,20 @@
+
+Microsoft Visual Studio Solution File, Format Version 9.00
+# Visual Studio 2005
+Project("{8BC9CEB8-8B4A-11D0-8D11-00A0C91BC942}") = "sigtest-filter", "sigtest-filter.vcproj", "{FA2910DF-FD9D-4E6D-A393-9F9F9E309E78}"
+EndProject
+Global
+ GlobalSection(SolutionConfigurationPlatforms) = preSolution
+ Debug|Win32 = Debug|Win32
+ Release|Win32 = Release|Win32
+ EndGlobalSection
+ GlobalSection(ProjectConfigurationPlatforms) = postSolution
+ {FA2910DF-FD9D-4E6D-A393-9F9F9E309E78}.Debug|Win32.ActiveCfg = Debug|Win32
+ {FA2910DF-FD9D-4E6D-A393-9F9F9E309E78}.Debug|Win32.Build.0 = Debug|Win32
+ {FA2910DF-FD9D-4E6D-A393-9F9F9E309E78}.Release|Win32.ActiveCfg = Release|Win32
+ {FA2910DF-FD9D-4E6D-A393-9F9F9E309E78}.Release|Win32.Build.0 = Release|Win32
+ EndGlobalSection
+ GlobalSection(SolutionProperties) = preSolution
+ HideSolutionNode = FALSE
+ EndGlobalSection
+EndGlobal
diff --git a/contrib/relent-filter/src/IOWrapper.cpp b/contrib/relent-filter/src/IOWrapper.cpp
new file mode 100755
index 000000000..053735c96
--- /dev/null
+++ b/contrib/relent-filter/src/IOWrapper.cpp
@@ -0,0 +1,580 @@
+// $Id$
+
+/***********************************************************************
+Moses - factored phrase-based language decoder
+Copyright (c) 2006 University of Edinburgh
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+ * Redistributions of source code must retain the above copyright notice,
+ this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright notice,
+ this list of conditions and the following disclaimer in the documentation
+ and/or other materials provided with the distribution.
+ * Neither the name of the University of Edinburgh nor the names of its contributors
+ may be used to endorse or promote products derived from this software
+ without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS
+BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+***********************************************************************/
+
+// example file on how to use moses library
+
+#include <iostream>
+#include <stack>
+#include "TypeDef.h"
+#include "Util.h"
+#include "IOWrapper.h"
+#include "Hypothesis.h"
+#include "WordsRange.h"
+#include "TrellisPathList.h"
+#include "StaticData.h"
+#include "DummyScoreProducers.h"
+#include "InputFileStream.h"
+
+using namespace std;
+using namespace Moses;
+
+namespace MosesCmd
+{
+
+IOWrapper::IOWrapper(
+ const vector<FactorType> &inputFactorOrder
+ , const vector<FactorType> &outputFactorOrder
+ , const FactorMask &inputFactorUsed
+ , size_t nBestSize
+ , const string &nBestFilePath)
+ :m_inputFactorOrder(inputFactorOrder)
+ ,m_outputFactorOrder(outputFactorOrder)
+ ,m_inputFactorUsed(inputFactorUsed)
+ ,m_inputFile(NULL)
+ ,m_inputStream(&std::cin)
+ ,m_nBestStream(NULL)
+ ,m_outputWordGraphStream(NULL)
+ ,m_outputSearchGraphStream(NULL)
+ ,m_detailedTranslationReportingStream(NULL)
+ ,m_alignmentOutputStream(NULL)
+{
+ Initialization(inputFactorOrder, outputFactorOrder
+ , inputFactorUsed
+ , nBestSize, nBestFilePath);
+}
+
+IOWrapper::IOWrapper(const std::vector<FactorType> &inputFactorOrder
+ , const std::vector<FactorType> &outputFactorOrder
+ , const FactorMask &inputFactorUsed
+ , size_t nBestSize
+ , const std::string &nBestFilePath
+ , const std::string &inputFilePath)
+ :m_inputFactorOrder(inputFactorOrder)
+ ,m_outputFactorOrder(outputFactorOrder)
+ ,m_inputFactorUsed(inputFactorUsed)
+ ,m_inputFilePath(inputFilePath)
+ ,m_inputFile(new InputFileStream(inputFilePath))
+ ,m_nBestStream(NULL)
+ ,m_outputWordGraphStream(NULL)
+ ,m_outputSearchGraphStream(NULL)
+ ,m_detailedTranslationReportingStream(NULL)
+ ,m_alignmentOutputStream(NULL)
+{
+ Initialization(inputFactorOrder, outputFactorOrder
+ , inputFactorUsed
+ , nBestSize, nBestFilePath);
+
+ m_inputStream = m_inputFile;
+}
+
+IOWrapper::~IOWrapper()
+{
+ if (m_inputFile != NULL)
+ delete m_inputFile;
+ if (m_nBestStream != NULL && !m_surpressSingleBestOutput) {
+ // outputting n-best to file, rather than stdout. need to close file and delete obj
+ delete m_nBestStream;
+ }
+ if (m_outputWordGraphStream != NULL) {
+ delete m_outputWordGraphStream;
+ }
+ if (m_outputSearchGraphStream != NULL) {
+ delete m_outputSearchGraphStream;
+ }
+ delete m_detailedTranslationReportingStream;
+ delete m_alignmentOutputStream;
+}
+
+void IOWrapper::Initialization(const std::vector<FactorType> &/*inputFactorOrder*/
+ , const std::vector<FactorType> &/*outputFactorOrder*/
+ , const FactorMask &/*inputFactorUsed*/
+ , size_t nBestSize
+ , const std::string &nBestFilePath)
+{
+ const StaticData &staticData = StaticData::Instance();
+
+ // n-best
+ m_surpressSingleBestOutput = false;
+
+ if (nBestSize > 0) {
+ if (nBestFilePath == "-" || nBestFilePath == "/dev/stdout") {
+ m_nBestStream = &std::cout;
+ m_surpressSingleBestOutput = true;
+ } else {
+ std::ofstream *file = new std::ofstream;
+ m_nBestStream = file;
+ file->open(nBestFilePath.c_str());
+ }
+ }
+
+ // wordgraph output
+ if (staticData.GetOutputWordGraph()) {
+ string fileName = staticData.GetParam("output-word-graph")[0];
+ std::ofstream *file = new std::ofstream;
+ m_outputWordGraphStream = file;
+ file->open(fileName.c_str());
+ }
+
+
+// search graph output
+ if (staticData.GetOutputSearchGraph()) {
+ string fileName;
+ if (staticData.GetOutputSearchGraphExtended())
+ fileName = staticData.GetParam("output-search-graph-extended")[0];
+ else
+ fileName = staticData.GetParam("output-search-graph")[0];
+ std::ofstream *file = new std::ofstream;
+ m_outputSearchGraphStream = file;
+ file->open(fileName.c_str());
+ }
+
+ // detailed translation reporting
+ if (staticData.IsDetailedTranslationReportingEnabled()) {
+ const std::string &path = staticData.GetDetailedTranslationReportingFilePath();
+ m_detailedTranslationReportingStream = new std::ofstream(path.c_str());
+ CHECK(m_detailedTranslationReportingStream->good());
+ }
+
+ // sentence alignment output
+ if (! staticData.GetAlignmentOutputFile().empty()) {
+ m_alignmentOutputStream = new ofstream(staticData.GetAlignmentOutputFile().c_str());
+ CHECK(m_alignmentOutputStream->good());
+ }
+
+}
+
+InputType*IOWrapper::GetInput(InputType* inputType)
+{
+ if(inputType->Read(*m_inputStream, m_inputFactorOrder)) {
+ if (long x = inputType->GetTranslationId()) {
+ if (x>=m_translationId) m_translationId = x+1;
+ } else inputType->SetTranslationId(m_translationId++);
+
+ return inputType;
+ } else {
+ delete inputType;
+ return NULL;
+ }
+}
+
+/***
+ * print surface factor only for the given phrase
+ */
+void OutputSurface(std::ostream &out, const Hypothesis &edge, const std::vector<FactorType> &outputFactorOrder,
+ bool reportSegmentation, bool reportAllFactors)
+{
+ CHECK(outputFactorOrder.size() > 0);
+ const Phrase& phrase = edge.GetCurrTargetPhrase();
+ if (reportAllFactors == true) {
+ out << phrase;
+ } else {
+ size_t size = phrase.GetSize();
+ for (size_t pos = 0 ; pos < size ; pos++) {
+ const Factor *factor = phrase.GetFactor(pos, outputFactorOrder[0]);
+ out << *factor;
+ CHECK(factor);
+
+ for (size_t i = 1 ; i < outputFactorOrder.size() ; i++) {
+ const Factor *factor = phrase.GetFactor(pos, outputFactorOrder[i]);
+ CHECK(factor);
+
+ out << "|" << *factor;
+ }
+ out << " ";
+ }
+ }
+
+ // trace option "-t"
+ if (reportSegmentation == true && phrase.GetSize() > 0) {
+ out << "|" << edge.GetCurrSourceWordsRange().GetStartPos()
+ << "-" << edge.GetCurrSourceWordsRange().GetEndPos() << "| ";
+ }
+}
+
+void OutputBestSurface(std::ostream &out, const Hypothesis *hypo, const std::vector<FactorType> &outputFactorOrder,
+ bool reportSegmentation, bool reportAllFactors)
+{
+ if (hypo != NULL) {
+ // recursively retrace this best path through the lattice, starting from the end of the hypothesis sentence
+ OutputBestSurface(out, hypo->GetPrevHypo(), outputFactorOrder, reportSegmentation, reportAllFactors);
+ OutputSurface(out, *hypo, outputFactorOrder, reportSegmentation, reportAllFactors);
+ }
+}
+
+void OutputAlignment(ostream &out, const AlignmentInfo &ai, size_t sourceOffset, size_t targetOffset)
+{
+ typedef std::vector< const std::pair<size_t,size_t>* > AlignVec;
+ AlignVec alignments = ai.GetSortedAlignments();
+
+ AlignVec::const_iterator it;
+ for (it = alignments.begin(); it != alignments.end(); ++it) {
+ const std::pair<size_t,size_t> &alignment = **it;
+ out << alignment.first + sourceOffset << "-" << alignment.second + targetOffset << " ";
+ }
+
+}
+
+void OutputAlignment(ostream &out, const vector<const Hypothesis *> &edges)
+{
+ size_t targetOffset = 0;
+
+ for (int currEdge = (int)edges.size() - 1 ; currEdge >= 0 ; currEdge--) {
+ const Hypothesis &edge = *edges[currEdge];
+ const TargetPhrase &tp = edge.GetCurrTargetPhrase();
+ size_t sourceOffset = edge.GetCurrSourceWordsRange().GetStartPos();
+
+ OutputAlignment(out, tp.GetAlignmentInfo(), sourceOffset, targetOffset);
+
+ targetOffset += tp.GetSize();
+ }
+ out << std::endl;
+}
+
+void OutputAlignment(OutputCollector* collector, size_t lineNo , const vector<const Hypothesis *> &edges)
+{
+ ostringstream out;
+ OutputAlignment(out, edges);
+
+ collector->Write(lineNo,out.str());
+}
+
+void OutputAlignment(OutputCollector* collector, size_t lineNo , const Hypothesis *hypo)
+{
+ if (collector) {
+ std::vector<const Hypothesis *> edges;
+ const Hypothesis *currentHypo = hypo;
+ while (currentHypo) {
+ edges.push_back(currentHypo);
+ currentHypo = currentHypo->GetPrevHypo();
+ }
+
+ OutputAlignment(collector,lineNo, edges);
+ }
+}
+
+void OutputAlignment(OutputCollector* collector, size_t lineNo , const TrellisPath &path)
+{
+ if (collector) {
+ OutputAlignment(collector,lineNo, path.GetEdges());
+ }
+}
+
+void OutputBestHypo(const Moses::TrellisPath &path, long /*translationId*/, bool reportSegmentation, bool reportAllFactors, std::ostream &out)
+{
+ const std::vector<const Hypothesis *> &edges = path.GetEdges();
+
+ for (int currEdge = (int)edges.size() - 1 ; currEdge >= 0 ; currEdge--) {
+ const Hypothesis &edge = *edges[currEdge];
+ OutputSurface(out, edge, StaticData::Instance().GetOutputFactorOrder(), reportSegmentation, reportAllFactors);
+ }
+ out << endl;
+}
+
+void IOWrapper::Backtrack(const Hypothesis *hypo)
+{
+
+ if (hypo->GetPrevHypo() != NULL) {
+ VERBOSE(3,hypo->GetId() << " <= ");
+ Backtrack(hypo->GetPrevHypo());
+ }
+}
+
+void OutputBestHypo(const std::vector<Word>& mbrBestHypo, long /*translationId*/, bool /*reportSegmentation*/, bool /*reportAllFactors*/, ostream& out)
+{
+
+ for (size_t i = 0 ; i < mbrBestHypo.size() ; i++) {
+ const Factor *factor = mbrBestHypo[i].GetFactor(StaticData::Instance().GetOutputFactorOrder()[0]);
+ CHECK(factor);
+ if (i>0) out << " " << *factor;
+ else out << *factor;
+ }
+ out << endl;
+}
+
+
+void OutputInput(std::vector<const Phrase*>& map, const Hypothesis* hypo)
+{
+ if (hypo->GetPrevHypo()) {
+ OutputInput(map, hypo->GetPrevHypo());
+ map[hypo->GetCurrSourceWordsRange().GetStartPos()] = hypo->GetSourcePhrase();
+ }
+}
+
+void OutputInput(std::ostream& os, const Hypothesis* hypo)
+{
+ size_t len = hypo->GetInput().GetSize();
+ std::vector<const Phrase*> inp_phrases(len, 0);
+ OutputInput(inp_phrases, hypo);
+ for (size_t i=0; i<len; ++i)
+ if (inp_phrases[i]) os << *inp_phrases[i];
+}
+
+void IOWrapper::OutputBestHypo(const Hypothesis *hypo, long /*translationId*/, bool reportSegmentation, bool reportAllFactors)
+{
+ if (hypo != NULL) {
+ VERBOSE(1,"BEST TRANSLATION: " << *hypo << endl);
+ VERBOSE(3,"Best path: ");
+ Backtrack(hypo);
+ VERBOSE(3,"0" << std::endl);
+ if (!m_surpressSingleBestOutput) {
+ if (StaticData::Instance().IsPathRecoveryEnabled()) {
+ OutputInput(cout, hypo);
+ cout << "||| ";
+ }
+ OutputBestSurface(cout, hypo, m_outputFactorOrder, reportSegmentation, reportAllFactors);
+ cout << endl;
+ }
+ } else {
+ VERBOSE(1, "NO BEST TRANSLATION" << endl);
+ if (!m_surpressSingleBestOutput) {
+ cout << endl;
+ }
+ }
+}
+
+void OutputNBest(std::ostream& out, const Moses::TrellisPathList &nBestList, const std::vector<Moses::FactorType>& outputFactorOrder, const TranslationSystem* system, long translationId, bool reportSegmentation)
+{
+ const StaticData &staticData = StaticData::Instance();
+ bool labeledOutput = staticData.IsLabeledNBestList();
+ bool reportAllFactors = staticData.GetReportAllFactorsNBest();
+ bool includeAlignment = staticData.NBestIncludesAlignment();
+ bool includeWordAlignment = staticData.PrintAlignmentInfoInNbest();
+
+ TrellisPathList::const_iterator iter;
+ for (iter = nBestList.begin() ; iter != nBestList.end() ; ++iter) {
+ const TrellisPath &path = **iter;
+ const std::vector<const Hypothesis *> &edges = path.GetEdges();
+
+ // print the surface factor of the translation
+ out << translationId << " ||| ";
+ for (int currEdge = (int)edges.size() - 1 ; currEdge >= 0 ; currEdge--) {
+ const Hypothesis &edge = *edges[currEdge];
+ OutputSurface(out, edge, outputFactorOrder, reportSegmentation, reportAllFactors);
+ }
+ out << " |||";
+
+ std::string lastName = "";
+ const vector<const StatefulFeatureFunction*>& sff = system->GetStatefulFeatureFunctions();
+ for( size_t i=0; i<sff.size(); i++ ) {
+ if( labeledOutput && lastName != sff[i]->GetScoreProducerWeightShortName() ) {
+ lastName = sff[i]->GetScoreProducerWeightShortName();
+ out << " " << lastName << ":";
+ }
+ vector<float> scores = path.GetScoreBreakdown().GetScoresForProducer( sff[i] );
+ for (size_t j = 0; j<scores.size(); ++j) {
+ out << " " << scores[j];
+ }
+ }
+
+ const vector<const StatelessFeatureFunction*>& slf = system->GetStatelessFeatureFunctions();
+ for( size_t i=0; i<slf.size(); i++ ) {
+ if( labeledOutput && lastName != slf[i]->GetScoreProducerWeightShortName() ) {
+ lastName = slf[i]->GetScoreProducerWeightShortName();
+ out << " " << lastName << ":";
+ }
+ vector<float> scores = path.GetScoreBreakdown().GetScoresForProducer( slf[i] );
+ for (size_t j = 0; j<scores.size(); ++j) {
+ out << " " << scores[j];
+ }
+ }
+
+ // translation components
+ const vector<PhraseDictionaryFeature*>& pds = system->GetPhraseDictionaries();
+ if (pds.size() > 0) {
+
+ for( size_t i=0; i<pds.size(); i++ ) {
+ size_t pd_numinputscore = pds[i]->GetNumInputScores();
+ vector<float> scores = path.GetScoreBreakdown().GetScoresForProducer( pds[i] );
+ for (size_t j = 0; j<scores.size(); ++j){
+
+ if (labeledOutput && (i == 0) ){
+ if ((j == 0) || (j == pd_numinputscore)){
+ lastName = pds[i]->GetScoreProducerWeightShortName(j);
+ out << " " << lastName << ":";
+ }
+ }
+ out << " " << scores[j];
+ }
+ }
+ }
+
+ // generation
+ const vector<GenerationDictionary*>& gds = system->GetGenerationDictionaries();
+ if (gds.size() > 0) {
+
+ for( size_t i=0; i<gds.size(); i++ ) {
+ size_t pd_numinputscore = gds[i]->GetNumInputScores();
+ vector<float> scores = path.GetScoreBreakdown().GetScoresForProducer( gds[i] );
+ for (size_t j = 0; j<scores.size(); ++j){
+
+ if (labeledOutput && (i == 0) ){
+ if ((j == 0) || (j == pd_numinputscore)){
+ lastName = gds[i]->GetScoreProducerWeightShortName(j);
+ out << " " << lastName << ":";
+ }
+ }
+ out << " " << scores[j];
+ }
+ }
+ }
+
+ // total
+ out << " ||| " << path.GetTotalScore();
+
+ //phrase-to-phrase alignment
+ if (includeAlignment) {
+ out << " |||";
+ for (int currEdge = (int)edges.size() - 2 ; currEdge >= 0 ; currEdge--) {
+ const Hypothesis &edge = *edges[currEdge];
+ const WordsRange &sourceRange = edge.GetCurrSourceWordsRange();
+ WordsRange targetRange = path.GetTargetWordsRange(edge);
+ out << " " << sourceRange.GetStartPos();
+ if (sourceRange.GetStartPos() < sourceRange.GetEndPos()) {
+ out << "-" << sourceRange.GetEndPos();
+ }
+ out<< "=" << targetRange.GetStartPos();
+ if (targetRange.GetStartPos() < targetRange.GetEndPos()) {
+ out<< "-" << targetRange.GetEndPos();
+ }
+ }
+ }
+
+ if (includeWordAlignment) {
+ out << " ||| ";
+ for (int currEdge = (int)edges.size() - 2 ; currEdge >= 0 ; currEdge--) {
+ const Hypothesis &edge = *edges[currEdge];
+ const WordsRange &sourceRange = edge.GetCurrSourceWordsRange();
+ WordsRange targetRange = path.GetTargetWordsRange(edge);
+ const int sourceOffset = sourceRange.GetStartPos();
+ const int targetOffset = targetRange.GetStartPos();
+ const AlignmentInfo &ai = edge.GetCurrTargetPhrase().GetAlignmentInfo();
+
+ OutputAlignment(out, ai, sourceOffset, targetOffset);
+
+ }
+ }
+
+ if (StaticData::Instance().IsPathRecoveryEnabled()) {
+ out << "|||";
+ OutputInput(out, edges[0]);
+ }
+
+ out << endl;
+ }
+
+
+ out <<std::flush;
+}
+
+void OutputLatticeMBRNBest(std::ostream& out, const vector<LatticeMBRSolution>& solutions,long translationId)
+{
+ for (vector<LatticeMBRSolution>::const_iterator si = solutions.begin(); si != solutions.end(); ++si) {
+ out << translationId;
+ out << " |||";
+ const vector<Word> mbrHypo = si->GetWords();
+ for (size_t i = 0 ; i < mbrHypo.size() ; i++) {
+ const Factor *factor = mbrHypo[i].GetFactor(StaticData::Instance().GetOutputFactorOrder()[0]);
+ if (i>0) out << " " << *factor;
+ else out << *factor;
+ }
+ out << " |||";
+ out << " map: " << si->GetMapScore();
+ out << " w: " << mbrHypo.size();
+ const vector<float>& ngramScores = si->GetNgramScores();
+ for (size_t i = 0; i < ngramScores.size(); ++i) {
+ out << " " << ngramScores[i];
+ }
+ out << " ||| " << si->GetScore();
+
+ out << endl;
+ }
+}
+
+
+void IOWrapper::OutputLatticeMBRNBestList(const vector<LatticeMBRSolution>& solutions,long translationId)
+{
+ OutputLatticeMBRNBest(*m_nBestStream, solutions,translationId);
+}
+
+bool ReadInput(IOWrapper &ioWrapper, InputTypeEnum inputType, InputType*& source)
+{
+ delete source;
+ switch(inputType) {
+ case SentenceInput:
+ source = ioWrapper.GetInput(new Sentence);
+ break;
+ case ConfusionNetworkInput:
+ source = ioWrapper.GetInput(new ConfusionNet);
+ break;
+ case WordLatticeInput:
+ source = ioWrapper.GetInput(new WordLattice);
+ break;
+ default:
+ TRACE_ERR("Unknown input type: " << inputType << "\n");
+ }
+ return (source ? true : false);
+}
+
+
+
+IOWrapper *GetIOWrapper(const StaticData &staticData)
+{
+ IOWrapper *ioWrapper;
+ const std::vector<FactorType> &inputFactorOrder = staticData.GetInputFactorOrder()
+ ,&outputFactorOrder = staticData.GetOutputFactorOrder();
+ FactorMask inputFactorUsed(inputFactorOrder);
+
+ // io
+ if (staticData.GetParam("input-file").size() == 1) {
+ VERBOSE(2,"IO from File" << endl);
+ string filePath = staticData.GetParam("input-file")[0];
+
+ ioWrapper = new IOWrapper(inputFactorOrder, outputFactorOrder, inputFactorUsed
+ , staticData.GetNBestSize()
+ , staticData.GetNBestFilePath()
+ , filePath);
+ } else {
+ VERBOSE(1,"IO from STDOUT/STDIN" << endl);
+ ioWrapper = new IOWrapper(inputFactorOrder, outputFactorOrder, inputFactorUsed
+ , staticData.GetNBestSize()
+ , staticData.GetNBestFilePath());
+ }
+ ioWrapper->ResetTranslationId();
+
+ IFVERBOSE(1)
+ PrintUserTime("Created input-output object");
+
+ return ioWrapper;
+}
+
+}
+
diff --git a/contrib/relent-filter/src/IOWrapper.h b/contrib/relent-filter/src/IOWrapper.h
new file mode 100755
index 000000000..e44208002
--- /dev/null
+++ b/contrib/relent-filter/src/IOWrapper.h
@@ -0,0 +1,142 @@
+// $Id$
+
+/***********************************************************************
+Moses - factored phrase-based language decoder
+Copyright (c) 2006 University of Edinburgh
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+ * Redistributions of source code must retain the above copyright notice,
+ this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright notice,
+ this list of conditions and the following disclaimer in the documentation
+ and/or other materials provided with the distribution.
+ * Neither the name of the University of Edinburgh nor the names of its contributors
+ may be used to endorse or promote products derived from this software
+ without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS
+BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+***********************************************************************/
+
+// example file on how to use moses library
+
+#ifndef moses_cmd_IOWrapper_h
+#define moses_cmd_IOWrapper_h
+
+#include <cassert>
+#include <fstream>
+#include <ostream>
+#include <vector>
+#include "util/check.hh"
+
+#include "TypeDef.h"
+#include "Sentence.h"
+#include "FactorTypeSet.h"
+#include "FactorCollection.h"
+#include "Hypothesis.h"
+#include "OutputCollector.h"
+#include "TrellisPathList.h"
+#include "InputFileStream.h"
+#include "InputType.h"
+#include "WordLattice.h"
+#include "LatticeMBR.h"
+
+namespace MosesCmd
+{
+
+/** Helper class that holds misc variables to write data out to command line.
+ */
+class IOWrapper
+{
+protected:
+ long m_translationId;
+
+ const std::vector<Moses::FactorType> &m_inputFactorOrder;
+ const std::vector<Moses::FactorType> &m_outputFactorOrder;
+ const Moses::FactorMask &m_inputFactorUsed;
+ std::string m_inputFilePath;
+ Moses::InputFileStream *m_inputFile;
+ std::istream *m_inputStream;
+ std::ostream *m_nBestStream
+ ,*m_outputWordGraphStream,*m_outputSearchGraphStream;
+ std::ostream *m_detailedTranslationReportingStream;
+ std::ofstream *m_alignmentOutputStream;
+ bool m_surpressSingleBestOutput;
+
+ void Initialization(const std::vector<Moses::FactorType> &inputFactorOrder
+ , const std::vector<Moses::FactorType> &outputFactorOrder
+ , const Moses::FactorMask &inputFactorUsed
+ , size_t nBestSize
+ , const std::string &nBestFilePath);
+
+public:
+ IOWrapper(const std::vector<Moses::FactorType> &inputFactorOrder
+ , const std::vector<Moses::FactorType> &outputFactorOrder
+ , const Moses::FactorMask &inputFactorUsed
+ , size_t nBestSize
+ , const std::string &nBestFilePath);
+
+ IOWrapper(const std::vector<Moses::FactorType> &inputFactorOrder
+ , const std::vector<Moses::FactorType> &outputFactorOrder
+ , const Moses::FactorMask &inputFactorUsed
+ , size_t nBestSize
+ , const std::string &nBestFilePath
+ , const std::string &infilePath);
+ ~IOWrapper();
+
+ Moses::InputType* GetInput(Moses::InputType *inputType);
+
+ void OutputBestHypo(const Moses::Hypothesis *hypo, long translationId, bool reportSegmentation, bool reportAllFactors);
+ void OutputLatticeMBRNBestList(const std::vector<LatticeMBRSolution>& solutions,long translationId);
+ void Backtrack(const Moses::Hypothesis *hypo);
+
+ void ResetTranslationId() {
+ m_translationId = 0;
+ }
+
+ std::ofstream *GetAlignmentOutputStream() {
+ return m_alignmentOutputStream;
+ }
+
+ std::ostream &GetOutputWordGraphStream() {
+ return *m_outputWordGraphStream;
+ }
+ std::ostream &GetOutputSearchGraphStream() {
+ return *m_outputSearchGraphStream;
+ }
+
+ std::ostream &GetDetailedTranslationReportingStream() {
+ assert (m_detailedTranslationReportingStream);
+ return *m_detailedTranslationReportingStream;
+ }
+};
+
+IOWrapper *GetIOWrapper(const Moses::StaticData &staticData);
+bool ReadInput(IOWrapper &ioWrapper, Moses::InputTypeEnum inputType, Moses::InputType*& source);
+void OutputBestSurface(std::ostream &out, const Moses::Hypothesis *hypo, const std::vector<Moses::FactorType> &outputFactorOrder, bool reportSegmentation, bool reportAllFactors);
+void OutputNBest(std::ostream& out, const Moses::TrellisPathList &nBestList, const std::vector<Moses::FactorType>&,
+ const Moses::TranslationSystem* system, long translationId, bool reportSegmentation);
+void OutputLatticeMBRNBest(std::ostream& out, const std::vector<LatticeMBRSolution>& solutions,long translationId);
+void OutputBestHypo(const std::vector<Moses::Word>& mbrBestHypo, long /*translationId*/,
+ bool reportSegmentation, bool reportAllFactors, std::ostream& out);
+void OutputBestHypo(const Moses::TrellisPath &path, long /*translationId*/,bool reportSegmentation, bool reportAllFactors, std::ostream &out);
+void OutputInput(std::ostream& os, const Moses::Hypothesis* hypo);
+void OutputAlignment(Moses::OutputCollector* collector, size_t lineNo, const Moses::Hypothesis *hypo);
+void OutputAlignment(Moses::OutputCollector* collector, size_t lineNo, const Moses::TrellisPath &path);
+
+
+}
+
+#endif
diff --git a/contrib/relent-filter/src/Jamfile b/contrib/relent-filter/src/Jamfile
new file mode 100755
index 000000000..c0aa6160d
--- /dev/null
+++ b/contrib/relent-filter/src/Jamfile
@@ -0,0 +1,6 @@
+alias deps : ../../../moses/src//moses ;
+
+exe calcDivergence : Main.cpp mbr.cpp IOWrapper.cpp TranslationAnalysis.cpp LatticeMBR.cpp RelativeEntropyCalc.cpp deps ;
+
+alias programs : calcDivergence ;
+
diff --git a/contrib/relent-filter/src/LatticeMBR.cpp b/contrib/relent-filter/src/LatticeMBR.cpp
new file mode 100755
index 000000000..2bd62747e
--- /dev/null
+++ b/contrib/relent-filter/src/LatticeMBR.cpp
@@ -0,0 +1,669 @@
+/*
+ * LatticeMBR.cpp
+ * moses-cmd
+ *
+ * Created by Abhishek Arun on 26/01/2010.
+ * Copyright 2010 __MyCompanyName__. All rights reserved.
+ *
+ */
+
+#include "LatticeMBR.h"
+#include "StaticData.h"
+#include <algorithm>
+#include <set>
+
+using namespace std;
+using namespace Moses;
+
+namespace MosesCmd
+{
+
+size_t bleu_order = 4;
+float UNKNGRAMLOGPROB = -20;
+void GetOutputWords(const TrellisPath &path, vector <Word> &translation)
+{
+ const std::vector<const Hypothesis *> &edges = path.GetEdges();
+
+ // print the surface factor of the translation
+ for (int currEdge = (int)edges.size() - 1 ; currEdge >= 0 ; currEdge--) {
+ const Hypothesis &edge = *edges[currEdge];
+ const Phrase &phrase = edge.GetCurrTargetPhrase();
+ size_t size = phrase.GetSize();
+ for (size_t pos = 0 ; pos < size ; pos++) {
+ translation.push_back(phrase.GetWord(pos));
+ }
+ }
+}
+
+
+void extract_ngrams(const vector<Word >& sentence, map < Phrase, int > & allngrams)
+{
+ for (int k = 0; k < (int)bleu_order; k++) {
+ for(int i =0; i < max((int)sentence.size()-k,0); i++) {
+ Phrase ngram( k+1);
+ for ( int j = i; j<= i+k; j++) {
+ ngram.AddWord(sentence[j]);
+ }
+ ++allngrams[ngram];
+ }
+ }
+}
+
+
+
+void NgramScores::addScore(const Hypothesis* node, const Phrase& ngram, float score)
+{
+ set<Phrase>::const_iterator ngramIter = m_ngrams.find(ngram);
+ if (ngramIter == m_ngrams.end()) {
+ ngramIter = m_ngrams.insert(ngram).first;
+ }
+ map<const Phrase*,float>& ngramScores = m_scores[node];
+ map<const Phrase*,float>::iterator scoreIter = ngramScores.find(&(*ngramIter));
+ if (scoreIter == ngramScores.end()) {
+ ngramScores[&(*ngramIter)] = score;
+ } else {
+ ngramScores[&(*ngramIter)] = log_sum(score,scoreIter->second);
+ }
+}
+
+NgramScores::NodeScoreIterator NgramScores::nodeBegin(const Hypothesis* node)
+{
+ return m_scores[node].begin();
+}
+
+
+NgramScores::NodeScoreIterator NgramScores::nodeEnd(const Hypothesis* node)
+{
+ return m_scores[node].end();
+}
+
+LatticeMBRSolution::LatticeMBRSolution(const TrellisPath& path, bool isMap) :
+ m_score(0.0f)
+{
+ const std::vector<const Hypothesis *> &edges = path.GetEdges();
+
+ for (int currEdge = (int)edges.size() - 1 ; currEdge >= 0 ; currEdge--) {
+ const Hypothesis &edge = *edges[currEdge];
+ const Phrase &phrase = edge.GetCurrTargetPhrase();
+ size_t size = phrase.GetSize();
+ for (size_t pos = 0 ; pos < size ; pos++) {
+ m_words.push_back(phrase.GetWord(pos));
+ }
+ }
+ if (isMap) {
+ m_mapScore = path.GetTotalScore();
+ } else {
+ m_mapScore = 0;
+ }
+}
+
+
+void LatticeMBRSolution::CalcScore(map<Phrase, float>& finalNgramScores, const vector<float>& thetas, float mapWeight)
+{
+ m_ngramScores.assign(thetas.size()-1, -10000);
+
+ map < Phrase, int > counts;
+ extract_ngrams(m_words,counts);
+
+ //Now score this translation
+ m_score = thetas[0] * m_words.size();
+
+ //Calculate the ngramScores, working in log space at first
+ for (map < Phrase, int >::iterator ngrams = counts.begin(); ngrams != counts.end(); ++ngrams) {
+ float ngramPosterior = UNKNGRAMLOGPROB;
+ map<Phrase,float>::const_iterator ngramPosteriorIt = finalNgramScores.find(ngrams->first);
+ if (ngramPosteriorIt != finalNgramScores.end()) {
+ ngramPosterior = ngramPosteriorIt->second;
+ }
+ size_t ngramSize = ngrams->first.GetSize();
+ m_ngramScores[ngramSize-1] = log_sum(log((float)ngrams->second) + ngramPosterior,m_ngramScores[ngramSize-1]);
+ }
+
+ //convert from log to probability and create weighted sum
+ for (size_t i = 0; i < m_ngramScores.size(); ++i) {
+ m_ngramScores[i] = exp(m_ngramScores[i]);
+ m_score += thetas[i+1] * m_ngramScores[i];
+ }
+
+
+ //The map score
+ m_score += m_mapScore*mapWeight;
+}
+
+
+void pruneLatticeFB(Lattice & connectedHyp, map < const Hypothesis*, set <const Hypothesis* > > & outgoingHyps, map<const Hypothesis*, vector<Edge> >& incomingEdges,
+ const vector< float> & estimatedScores, const Hypothesis* bestHypo, size_t edgeDensity, float scale)
+{
+
+ //Need hyp 0 in connectedHyp - Find empty hypothesis
+ VERBOSE(2,"Pruning lattice to edge density " << edgeDensity << endl);
+ const Hypothesis* emptyHyp = connectedHyp.at(0);
+ while (emptyHyp->GetId() != 0) {
+ emptyHyp = emptyHyp->GetPrevHypo();
+ }
+ connectedHyp.push_back(emptyHyp); //Add it to list of hyps
+
+ //Need hyp 0's outgoing Hyps
+ for (size_t i = 0; i < connectedHyp.size(); ++i) {
+ if (connectedHyp[i]->GetId() > 0 && connectedHyp[i]->GetPrevHypo()->GetId() == 0)
+ outgoingHyps[emptyHyp].insert(connectedHyp[i]);
+ }
+
+ //sort hyps based on estimated scores - do so by copying to multimap
+ multimap<float, const Hypothesis*> sortHypsByVal;
+ for (size_t i =0; i < estimatedScores.size(); ++i) {
+ sortHypsByVal.insert(make_pair(estimatedScores[i], connectedHyp[i]));
+ }
+
+ multimap<float, const Hypothesis*>::const_iterator it = --sortHypsByVal.end();
+ float bestScore = it->first;
+ //store best score as score of hyp 0
+ sortHypsByVal.insert(make_pair(bestScore, emptyHyp));
+
+
+ IFVERBOSE(3) {
+ for (multimap<float, const Hypothesis*>::const_iterator it = --sortHypsByVal.end(); it != --sortHypsByVal.begin(); --it) {
+ const Hypothesis* currHyp = it->second;
+ cerr << "Hyp " << currHyp->GetId() << ", estimated score: " << it->first << endl;
+ }
+ }
+
+
+ set <const Hypothesis*> survivingHyps; //store hyps that make the cut in this
+
+ VERBOSE(2, "BEST HYPO TARGET LENGTH : " << bestHypo->GetSize() << endl)
+ size_t numEdgesTotal = edgeDensity * bestHypo->GetSize(); //as per Shankar, aim for (density * target length of MAP solution) arcs
+ size_t numEdgesCreated = 0;
+ VERBOSE(2, "Target edge count: " << numEdgesTotal << endl);
+
+ float prevScore = -999999;
+
+ //now iterate over multimap
+ for (multimap<float, const Hypothesis*>::const_iterator it = --sortHypsByVal.end(); it != --sortHypsByVal.begin(); --it) {
+ float currEstimatedScore = it->first;
+ const Hypothesis* currHyp = it->second;
+
+ if (numEdgesCreated >= numEdgesTotal && prevScore > currEstimatedScore) //if this hyp has equal estimated score to previous, include its edges too
+ break;
+
+ prevScore = currEstimatedScore;
+ VERBOSE(3, "Num edges created : "<< numEdgesCreated << ", numEdges wanted " << numEdgesTotal << endl)
+ VERBOSE(3, "Considering hyp " << currHyp->GetId() << ", estimated score: " << it->first << endl)
+
+ survivingHyps.insert(currHyp); //CurrHyp made the cut
+
+ // is its best predecessor already included ?
+ if (survivingHyps.find(currHyp->GetPrevHypo()) != survivingHyps.end()) { //yes, then add an edge
+ vector <Edge>& edges = incomingEdges[currHyp];
+ Edge winningEdge(currHyp->GetPrevHypo(),currHyp,scale*(currHyp->GetScore() - currHyp->GetPrevHypo()->GetScore()),currHyp->GetCurrTargetPhrase());
+ edges.push_back(winningEdge);
+ ++numEdgesCreated;
+ }
+
+ //let's try the arcs too
+ const ArcList *arcList = currHyp->GetArcList();
+ if (arcList != NULL) {
+ ArcList::const_iterator iterArcList;
+ for (iterArcList = arcList->begin() ; iterArcList != arcList->end() ; ++iterArcList) {
+ const Hypothesis *loserHypo = *iterArcList;
+ const Hypothesis* loserPrevHypo = loserHypo->GetPrevHypo();
+ if (survivingHyps.find(loserPrevHypo) != survivingHyps.end()) { //found it, add edge
+ double arcScore = loserHypo->GetScore() - loserPrevHypo->GetScore();
+ Edge losingEdge(loserPrevHypo, currHyp, arcScore*scale, loserHypo->GetCurrTargetPhrase());
+ vector <Edge>& edges = incomingEdges[currHyp];
+ edges.push_back(losingEdge);
+ ++numEdgesCreated;
+ }
+ }
+ }
+
+ //Now if a successor node has already been visited, add an edge connecting the two
+ map < const Hypothesis*, set < const Hypothesis* > >::const_iterator outgoingIt = outgoingHyps.find(currHyp);
+
+ if (outgoingIt != outgoingHyps.end()) {//currHyp does have successors
+ const set<const Hypothesis*> & outHyps = outgoingIt->second; //the successors
+ for (set<const Hypothesis*>::const_iterator outHypIts = outHyps.begin(); outHypIts != outHyps.end(); ++outHypIts) {
+ const Hypothesis* succHyp = *outHypIts;
+
+ if (survivingHyps.find(succHyp) == survivingHyps.end()) //Have we encountered the successor yet?
+ continue; //No, move on to next
+
+ //Curr Hyp can be : a) the best predecessor of succ b) or an arc attached to succ
+ if (succHyp->GetPrevHypo() == currHyp) { //best predecessor
+ vector <Edge>& succEdges = incomingEdges[succHyp];
+ Edge succWinningEdge(currHyp, succHyp, scale*(succHyp->GetScore() - currHyp->GetScore()), succHyp->GetCurrTargetPhrase());
+ succEdges.push_back(succWinningEdge);
+ survivingHyps.insert(succHyp);
+ ++numEdgesCreated;
+ }
+
+ //now, let's find an arc
+ const ArcList *arcList = succHyp->GetArcList();
+ if (arcList != NULL) {
+ ArcList::const_iterator iterArcList;
+ //QUESTION: What happens if there's more than one loserPrevHypo?
+ for (iterArcList = arcList->begin() ; iterArcList != arcList->end() ; ++iterArcList) {
+ const Hypothesis *loserHypo = *iterArcList;
+ const Hypothesis* loserPrevHypo = loserHypo->GetPrevHypo();
+ if (loserPrevHypo == currHyp) { //found it
+ vector <Edge>& succEdges = incomingEdges[succHyp];
+ double arcScore = loserHypo->GetScore() - currHyp->GetScore();
+ Edge losingEdge(currHyp, succHyp,scale* arcScore, loserHypo->GetCurrTargetPhrase());
+ succEdges.push_back(losingEdge);
+ ++numEdgesCreated;
+ }
+ }
+ }
+ }
+ }
+ }
+
+ connectedHyp.clear();
+ for (set <const Hypothesis*>::iterator it = survivingHyps.begin(); it != survivingHyps.end(); ++it) {
+ connectedHyp.push_back(*it);
+ }
+
+ VERBOSE(2, "Done! Num edges created : "<< numEdgesCreated << ", numEdges wanted " << numEdgesTotal << endl)
+
+ IFVERBOSE(3) {
+ cerr << "Surviving hyps: " ;
+ for (set <const Hypothesis*>::iterator it = survivingHyps.begin(); it != survivingHyps.end(); ++it) {
+ cerr << (*it)->GetId() << " ";
+ }
+ cerr << endl;
+ }
+
+
+}
+
+void calcNgramExpectations(Lattice & connectedHyp, map<const Hypothesis*, vector<Edge> >& incomingEdges,
+ map<Phrase, float>& finalNgramScores, bool posteriors)
+{
+
+ sort(connectedHyp.begin(),connectedHyp.end(),ascendingCoverageCmp); //sort by increasing source word cov
+
+ /*cerr << "Lattice:" << endl;
+ for (Lattice::const_iterator i = connectedHyp.begin(); i != connectedHyp.end(); ++i) {
+ const Hypothesis* h = *i;
+ cerr << *h << endl;
+ const vector<Edge>& edges = incomingEdges[h];
+ for (size_t e = 0; e < edges.size(); ++e) {
+ cerr << edges[e];
+ }
+ }*/
+
+ map<const Hypothesis*, float> forwardScore;
+ forwardScore[connectedHyp[0]] = 0.0f; //forward score of hyp 0 is 1 (or 0 in logprob space)
+ set< const Hypothesis *> finalHyps; //store completed hyps
+
+ NgramScores ngramScores;//ngram scores for each hyp
+
+ for (size_t i = 1; i < connectedHyp.size(); ++i) {
+ const Hypothesis* currHyp = connectedHyp[i];
+ if (currHyp->GetWordsBitmap().IsComplete()) {
+ finalHyps.insert(currHyp);
+ }
+
+ VERBOSE(3, "Processing hyp: " << currHyp->GetId() << ", num words cov= " << currHyp->GetWordsBitmap().GetNumWordsCovered() << endl)
+
+ vector <Edge> & edges = incomingEdges[currHyp];
+ for (size_t e = 0; e < edges.size(); ++e) {
+ const Edge& edge = edges[e];
+ if (forwardScore.find(currHyp) == forwardScore.end()) {
+ forwardScore[currHyp] = forwardScore[edge.GetTailNode()] + edge.GetScore();
+ VERBOSE(3, "Fwd score["<<currHyp->GetId()<<"] = fwdScore["<<edge.GetTailNode()->GetId() << "] + edge Score: " << edge.GetScore() << endl)
+ } else {
+ forwardScore[currHyp] = log_sum(forwardScore[currHyp], forwardScore[edge.GetTailNode()] + edge.GetScore());
+ VERBOSE(3, "Fwd score["<<currHyp->GetId()<<"] += fwdScore["<<edge.GetTailNode()->GetId() << "] + edge Score: " << edge.GetScore() << endl)
+ }
+ }
+
+ //Process ngrams now
+ for (size_t j =0 ; j < edges.size(); ++j) {
+ Edge& edge = edges[j];
+ const NgramHistory & incomingPhrases = edge.GetNgrams(incomingEdges);
+
+ //let's first score ngrams introduced by this edge
+ for (NgramHistory::const_iterator it = incomingPhrases.begin(); it != incomingPhrases.end(); ++it) {
+ const Phrase& ngram = it->first;
+ const PathCounts& pathCounts = it->second;
+ VERBOSE(4, "Calculating score for: " << it->first << endl)
+
+ for (PathCounts::const_iterator pathCountIt = pathCounts.begin(); pathCountIt != pathCounts.end(); ++pathCountIt) {
+ //Score of an n-gram is forward score of head node of leftmost edge + all edge scores
+ const Path& path = pathCountIt->first;
+ //cerr << "path count for " << ngram << " is " << pathCountIt->second << endl;
+ float score = forwardScore[path[0]->GetTailNode()];
+ for (size_t i = 0; i < path.size(); ++i) {
+ score += path[i]->GetScore();
+ }
+ //if we're doing expectations, then the number of times the ngram
+ //appears on the path is relevant.
+ size_t count = posteriors ? 1 : pathCountIt->second;
+ for (size_t k = 0; k < count; ++k) {
+ ngramScores.addScore(currHyp,ngram,score);
+ }
+ }
+ }
+
+ //Now score ngrams that are just being propagated from the history
+ for (NgramScores::NodeScoreIterator it = ngramScores.nodeBegin(edge.GetTailNode());
+ it != ngramScores.nodeEnd(edge.GetTailNode()); ++it) {
+ const Phrase & currNgram = *(it->first);
+ float currNgramScore = it->second;
+ VERBOSE(4, "Calculating score for: " << currNgram << endl)
+
+ // For posteriors, don't double count ngrams
+ if (!posteriors || incomingPhrases.find(currNgram) == incomingPhrases.end()) {
+ float score = edge.GetScore() + currNgramScore;
+ ngramScores.addScore(currHyp,currNgram,score);
+ }
+ }
+
+ }
+ }
+
+ float Z = 9999999; //the total score of the lattice
+
+ //Done - Print out ngram posteriors for final hyps
+ for (set< const Hypothesis *>::iterator finalHyp = finalHyps.begin(); finalHyp != finalHyps.end(); ++finalHyp) {
+ const Hypothesis* hyp = *finalHyp;
+
+ for (NgramScores::NodeScoreIterator it = ngramScores.nodeBegin(hyp); it != ngramScores.nodeEnd(hyp); ++it) {
+ const Phrase& ngram = *(it->first);
+ if (finalNgramScores.find(ngram) == finalNgramScores.end()) {
+ finalNgramScores[ngram] = it->second;
+ } else {
+ finalNgramScores[ngram] = log_sum(it->second, finalNgramScores[ngram]);
+ }
+ }
+
+ if (Z == 9999999) {
+ Z = forwardScore[hyp];
+ } else {
+ Z = log_sum(Z, forwardScore[hyp]);
+ }
+ }
+
+ //Z *= scale; //scale the score
+
+ for (map<Phrase, float>::iterator finalScoresIt = finalNgramScores.begin(); finalScoresIt != finalNgramScores.end(); ++finalScoresIt) {
+ finalScoresIt->second = finalScoresIt->second - Z;
+ IFVERBOSE(2) {
+ VERBOSE(2,finalScoresIt->first << " [" << finalScoresIt->second << "]" << endl);
+ }
+ }
+
+}
+
+const NgramHistory& Edge::GetNgrams(map<const Hypothesis*, vector<Edge> > & incomingEdges)
+{
+
+ if (m_ngrams.size() > 0)
+ return m_ngrams;
+
+ const Phrase& currPhrase = GetWords();
+ //Extract the n-grams local to this edge
+ for (size_t start = 0; start < currPhrase.GetSize(); ++start) {
+ for (size_t end = start; end < start + bleu_order; ++end) {
+ if (end < currPhrase.GetSize()) {
+ Phrase edgeNgram(end-start+1);
+ for (size_t index = start; index <= end; ++index) {
+ edgeNgram.AddWord(currPhrase.GetWord(index));
+ }
+ //cout << "Inserting Phrase : " << edgeNgram << endl;
+ vector<const Edge*> edgeHistory;
+ edgeHistory.push_back(this);
+ storeNgramHistory(edgeNgram, edgeHistory);
+ } else {
+ break;
+ }
+ }
+ }
+
+ map<const Hypothesis*, vector<Edge> >::iterator it = incomingEdges.find(m_tailNode);
+ if (it != incomingEdges.end()) { //node has incoming edges
+ vector<Edge> & inEdges = it->second;
+
+ for (vector<Edge>::iterator edge = inEdges.begin(); edge != inEdges.end(); ++edge) {//add the ngrams straddling prev and curr edge
+ const NgramHistory & edgeIncomingNgrams = edge->GetNgrams(incomingEdges);
+ for (NgramHistory::const_iterator edgeInNgramHist = edgeIncomingNgrams.begin(); edgeInNgramHist != edgeIncomingNgrams.end(); ++edgeInNgramHist) {
+ const Phrase& edgeIncomingNgram = edgeInNgramHist->first;
+ const PathCounts & edgeIncomingNgramPaths = edgeInNgramHist->second;
+ size_t back = min(edgeIncomingNgram.GetSize(), edge->GetWordsSize());
+ const Phrase& edgeWords = edge->GetWords();
+ IFVERBOSE(3) {
+ cerr << "Edge: "<< *edge <<endl;
+ cerr << "edgeWords: " << edgeWords << endl;
+ cerr << "edgeInNgram: " << edgeIncomingNgram << endl;
+ }
+
+ Phrase edgeSuffix(ARRAY_SIZE_INCR);
+ Phrase ngramSuffix(ARRAY_SIZE_INCR);
+ GetPhraseSuffix(edgeWords,back,edgeSuffix);
+ GetPhraseSuffix(edgeIncomingNgram,back,ngramSuffix);
+
+ if (ngramSuffix == edgeSuffix) { //we've got the suffix of previous edge
+ size_t edgeInNgramSize = edgeIncomingNgram.GetSize();
+
+ for (size_t i = 0; i < GetWordsSize() && i + edgeInNgramSize < bleu_order ; ++i) {
+ Phrase newNgram(edgeIncomingNgram);
+ for (size_t j = 0; j <= i ; ++j) {
+ newNgram.AddWord(GetWords().GetWord(j));
+ }
+ VERBOSE(3, "Inserting New Phrase : " << newNgram << endl)
+
+ for (PathCounts::const_iterator pathIt = edgeIncomingNgramPaths.begin(); pathIt != edgeIncomingNgramPaths.end(); ++pathIt) {
+ Path newNgramPath = pathIt->first;
+ newNgramPath.push_back(this);
+ storeNgramHistory(newNgram, newNgramPath, pathIt->second);
+ }
+ }
+ }
+ }
+ }
+ }
+ return m_ngrams;
+}
+
+//Add the last lastN words of origPhrase to targetPhrase
+void Edge::GetPhraseSuffix(const Phrase& origPhrase, size_t lastN, Phrase& targetPhrase) const
+{
+ size_t origSize = origPhrase.GetSize();
+ size_t startIndex = origSize - lastN;
+ for (size_t index = startIndex; index < origPhrase.GetSize(); ++index) {
+ targetPhrase.AddWord(origPhrase.GetWord(index));
+ }
+}
+
+bool Edge::operator< (const Edge& compare ) const
+{
+ if (m_headNode->GetId() < compare.m_headNode->GetId())
+ return true;
+ if (compare.m_headNode->GetId() < m_headNode->GetId())
+ return false;
+ if (m_tailNode->GetId() < compare.m_tailNode->GetId())
+ return true;
+ if (compare.m_tailNode->GetId() < m_tailNode->GetId())
+ return false;
+ return GetScore() < compare.GetScore();
+}
+
+ostream& operator<< (ostream& out, const Edge& edge)
+{
+ out << "Head: " << edge.m_headNode->GetId() << ", Tail: " << edge.m_tailNode->GetId() << ", Score: " << edge.m_score << ", Phrase: " << edge.m_targetPhrase << endl;
+ return out;
+}
+
+bool ascendingCoverageCmp(const Hypothesis* a, const Hypothesis* b)
+{
+ return a->GetWordsBitmap().GetNumWordsCovered() < b->GetWordsBitmap().GetNumWordsCovered();
+}
+
+void getLatticeMBRNBest(Manager& manager, TrellisPathList& nBestList,
+ vector<LatticeMBRSolution>& solutions, size_t n)
+{
+ const StaticData& staticData = StaticData::Instance();
+ std::map < int, bool > connected;
+ std::vector< const Hypothesis *> connectedList;
+ map<Phrase, float> ngramPosteriors;
+ std::map < const Hypothesis*, set <const Hypothesis*> > outgoingHyps;
+ map<const Hypothesis*, vector<Edge> > incomingEdges;
+ vector< float> estimatedScores;
+ manager.GetForwardBackwardSearchGraph(&connected, &connectedList, &outgoingHyps, &estimatedScores);
+ pruneLatticeFB(connectedList, outgoingHyps, incomingEdges, estimatedScores, manager.GetBestHypothesis(), staticData.GetLatticeMBRPruningFactor(),staticData.GetMBRScale());
+ calcNgramExpectations(connectedList, incomingEdges, ngramPosteriors,true);
+
+ vector<float> mbrThetas = staticData.GetLatticeMBRThetas();
+ float p = staticData.GetLatticeMBRPrecision();
+ float r = staticData.GetLatticeMBRPRatio();
+ float mapWeight = staticData.GetLatticeMBRMapWeight();
+ if (mbrThetas.size() == 0) { //thetas not specified on the command line, use p and r instead
+ mbrThetas.push_back(-1); //Theta 0
+ mbrThetas.push_back(1/(bleu_order*p));
+ for (size_t i = 2; i <= bleu_order; ++i) {
+ mbrThetas.push_back(mbrThetas[i-1] / r);
+ }
+ }
+ IFVERBOSE(2) {
+ VERBOSE(2,"Thetas: ");
+ for (size_t i = 0; i < mbrThetas.size(); ++i) {
+ VERBOSE(2,mbrThetas[i] << " ");
+ }
+ VERBOSE(2,endl);
+ }
+ TrellisPathList::const_iterator iter;
+ size_t ctr = 0;
+ LatticeMBRSolutionComparator comparator;
+ for (iter = nBestList.begin() ; iter != nBestList.end() ; ++iter, ++ctr) {
+ const TrellisPath &path = **iter;
+ solutions.push_back(LatticeMBRSolution(path,iter==nBestList.begin()));
+ solutions.back().CalcScore(ngramPosteriors,mbrThetas,mapWeight);
+ sort(solutions.begin(), solutions.end(), comparator);
+ while (solutions.size() > n) {
+ solutions.pop_back();
+ }
+ }
+ VERBOSE(2,"LMBR Score: " << solutions[0].GetScore() << endl);
+}
+
+vector<Word> doLatticeMBR(Manager& manager, TrellisPathList& nBestList)
+{
+
+ vector<LatticeMBRSolution> solutions;
+ getLatticeMBRNBest(manager, nBestList, solutions,1);
+ return solutions.at(0).GetWords();
+}
+
+const TrellisPath doConsensusDecoding(Manager& manager, TrellisPathList& nBestList)
+{
+ static const int BLEU_ORDER = 4;
+ static const float SMOOTH = 1;
+
+ //calculate the ngram expectations
+ const StaticData& staticData = StaticData::Instance();
+ std::map < int, bool > connected;
+ std::vector< const Hypothesis *> connectedList;
+ map<Phrase, float> ngramExpectations;
+ std::map < const Hypothesis*, set <const Hypothesis*> > outgoingHyps;
+ map<const Hypothesis*, vector<Edge> > incomingEdges;
+ vector< float> estimatedScores;
+ manager.GetForwardBackwardSearchGraph(&connected, &connectedList, &outgoingHyps, &estimatedScores);
+ pruneLatticeFB(connectedList, outgoingHyps, incomingEdges, estimatedScores, manager.GetBestHypothesis(), staticData.GetLatticeMBRPruningFactor(),staticData.GetMBRScale());
+ calcNgramExpectations(connectedList, incomingEdges, ngramExpectations,false);
+
+ //expected length is sum of expected unigram counts
+ //cerr << "Thread " << pthread_self() << " Ngram expectations size: " << ngramExpectations.size() << endl;
+ float ref_length = 0.0f;
+ for (map<Phrase,float>::const_iterator ref_iter = ngramExpectations.begin();
+ ref_iter != ngramExpectations.end(); ++ref_iter) {
+ //cerr << "Ngram: " << ref_iter->first << " score: " <<
+ // ref_iter->second << endl;
+ if (ref_iter->first.GetSize() == 1) {
+ ref_length += exp(ref_iter->second);
+ // cerr << "Expected for " << ref_iter->first << " is " << exp(ref_iter->second) << endl;
+ }
+ }
+
+ VERBOSE(2,"REF Length: " << ref_length << endl);
+
+ //use the ngram expectations to rescore the nbest list.
+ TrellisPathList::const_iterator iter;
+ TrellisPathList::const_iterator best = nBestList.end();
+ float bestScore = -100000;
+ //cerr << "nbest list size: " << nBestList.GetSize() << endl;
+ for (iter = nBestList.begin() ; iter != nBestList.end() ; ++iter) {
+ const TrellisPath &path = **iter;
+ vector<Word> words;
+ map<Phrase,int> ngrams;
+ GetOutputWords(path,words);
+ /*for (size_t i = 0; i < words.size(); ++i) {
+ cerr << words[i].GetFactor(0)->GetString() << " ";
+ }
+ cerr << endl;
+ */
+ extract_ngrams(words,ngrams);
+
+ vector<float> comps(2*BLEU_ORDER+1);
+ float logbleu = 0.0;
+ float brevity = 0.0;
+ int hyp_length = words.size();
+ for (int i = 0; i < BLEU_ORDER; ++i) {
+ comps[2*i] = 0.0;
+ comps[2*i+1] = max(hyp_length-i,0);
+ }
+
+ for (map<Phrase,int>::const_iterator hyp_iter = ngrams.begin();
+ hyp_iter != ngrams.end(); ++hyp_iter) {
+ map<Phrase,float>::const_iterator ref_iter = ngramExpectations.find(hyp_iter->first);
+ if (ref_iter != ngramExpectations.end()) {
+ comps[2*(hyp_iter->first.GetSize()-1)] += min(exp(ref_iter->second), (float)(hyp_iter->second));
+ }
+
+ }
+ comps[comps.size()-1] = ref_length;
+ /*for (size_t i = 0; i < comps.size(); ++i) {
+ cerr << comps[i] << " ";
+ }
+ cerr << endl;
+ */
+
+ float score = 0.0f;
+ if (comps[0] != 0) {
+ for (int i=0; i<BLEU_ORDER; i++) {
+ if ( i > 0 ) {
+ logbleu += log((float)comps[2*i]+SMOOTH)-log((float)comps[2*i+1]+SMOOTH);
+ } else {
+ logbleu += log((float)comps[2*i])-log((float)comps[2*i+1]);
+ }
+ }
+ logbleu /= BLEU_ORDER;
+ brevity = 1.0-(float)comps[comps.size()-1]/comps[1]; // comps[comps_n-1] is the ref length, comps[1] is the test length
+ if (brevity < 0.0) {
+ logbleu += brevity;
+ }
+ score = exp(logbleu);
+ }
+
+ //cerr << "score: " << score << " bestScore: " << bestScore << endl;
+ if (score > bestScore) {
+ bestScore = score;
+ best = iter;
+ VERBOSE(2,"NEW BEST: " << score << endl);
+ //for (size_t i = 0; i < comps.size(); ++i) {
+ // cerr << comps[i] << " ";
+ //}
+ //cerr << endl;
+ }
+ }
+
+ assert (best != nBestList.end());
+ return **best;
+ //vector<Word> bestWords;
+ //GetOutputWords(**best,bestWords);
+ //return bestWords;
+}
+
+}
+
+
diff --git a/contrib/relent-filter/src/LatticeMBR.h b/contrib/relent-filter/src/LatticeMBR.h
new file mode 100755
index 000000000..14a2e22da
--- /dev/null
+++ b/contrib/relent-filter/src/LatticeMBR.h
@@ -0,0 +1,153 @@
+/*
+ * LatticeMBR.h
+ * moses-cmd
+ *
+ * Created by Abhishek Arun on 26/01/2010.
+ * Copyright 2010 __MyCompanyName__. All rights reserved.
+ *
+ */
+
+#ifndef moses_cmd_LatticeMBR_h
+#define moses_cmd_LatticeMBR_h
+
+#include <map>
+#include <vector>
+#include <set>
+#include "Hypothesis.h"
+#include "Manager.h"
+#include "TrellisPathList.h"
+
+
+
+namespace MosesCmd
+{
+
+class Edge;
+
+typedef std::vector< const Moses::Hypothesis *> Lattice;
+typedef std::vector<const Edge*> Path;
+typedef std::map<Path, size_t> PathCounts;
+typedef std::map<Moses::Phrase, PathCounts > NgramHistory;
+
+class Edge
+{
+ const Moses::Hypothesis* m_tailNode;
+ const Moses::Hypothesis* m_headNode;
+ float m_score;
+ Moses::TargetPhrase m_targetPhrase;
+ NgramHistory m_ngrams;
+
+public:
+ Edge(const Moses::Hypothesis* from, const Moses::Hypothesis* to, float score, const Moses::TargetPhrase& targetPhrase) : m_tailNode(from), m_headNode(to), m_score(score), m_targetPhrase(targetPhrase) {
+ //cout << "Creating new edge from Node " << from->GetId() << ", to Node : " << to->GetId() << ", score: " << score << " phrase: " << targetPhrase << endl;
+ }
+
+ const Moses::Hypothesis* GetHeadNode() const {
+ return m_headNode;
+ }
+
+ const Moses::Hypothesis* GetTailNode() const {
+ return m_tailNode;
+ }
+
+ float GetScore() const {
+ return m_score;
+ }
+
+ size_t GetWordsSize() const {
+ return m_targetPhrase.GetSize();
+ }
+
+ const Moses::Phrase& GetWords() const {
+ return m_targetPhrase;
+ }
+
+ friend std::ostream& operator<< (std::ostream& out, const Edge& edge);
+
+ const NgramHistory& GetNgrams( std::map<const Moses::Hypothesis*, std::vector<Edge> > & incomingEdges) ;
+
+ bool operator < (const Edge & compare) const;
+
+ void GetPhraseSuffix(const Moses::Phrase& origPhrase, size_t lastN, Moses::Phrase& targetPhrase) const;
+
+ void storeNgramHistory(const Moses::Phrase& phrase, Path & path, size_t count = 1) {
+ m_ngrams[phrase][path]+= count;
+ }
+
+};
+
+/**
+* Data structure to hold the ngram scores as we traverse the lattice. Maps (hypo,ngram) to score
+*/
+class NgramScores
+{
+public:
+ NgramScores() {}
+
+ /** logsum this score to the existing score */
+ void addScore(const Moses::Hypothesis* node, const Moses::Phrase& ngram, float score);
+
+ /** Iterate through ngrams for selected node */
+ typedef std::map<const Moses::Phrase*, float>::const_iterator NodeScoreIterator;
+ NodeScoreIterator nodeBegin(const Moses::Hypothesis* node);
+ NodeScoreIterator nodeEnd(const Moses::Hypothesis* node);
+
+private:
+ std::set<Moses::Phrase> m_ngrams;
+ std::map<const Moses::Hypothesis*, std::map<const Moses::Phrase*, float> > m_scores;
+};
+
+
+/** Holds a lattice mbr solution, and its scores */
+class LatticeMBRSolution
+{
+public:
+ /** Read the words from the path */
+ LatticeMBRSolution(const Moses::TrellisPath& path, bool isMap);
+ const std::vector<float>& GetNgramScores() const {
+ return m_ngramScores;
+ }
+ const std::vector<Moses::Word>& GetWords() const {
+ return m_words;
+ }
+ float GetMapScore() const {
+ return m_mapScore;
+ }
+ float GetScore() const {
+ return m_score;
+ }
+
+ /** Initialise ngram scores */
+ void CalcScore(std::map<Moses::Phrase, float>& finalNgramScores, const std::vector<float>& thetas, float mapWeight);
+
+private:
+ std::vector<Moses::Word> m_words;
+ float m_mapScore;
+ std::vector<float> m_ngramScores;
+ float m_score;
+};
+
+struct LatticeMBRSolutionComparator {
+ bool operator()(const LatticeMBRSolution& a, const LatticeMBRSolution& b) {
+ return a.GetScore() > b.GetScore();
+ }
+};
+
+void pruneLatticeFB(Lattice & connectedHyp, std::map < const Moses::Hypothesis*, std::set <const Moses::Hypothesis* > > & outgoingHyps, std::map<const Moses::Hypothesis*, std::vector<Edge> >& incomingEdges,
+ const std::vector< float> & estimatedScores, const Moses::Hypothesis*, size_t edgeDensity,float scale);
+
+//Use the ngram scores to rerank the nbest list, return at most n solutions
+void getLatticeMBRNBest(Moses::Manager& manager, Moses::TrellisPathList& nBestList, std::vector<LatticeMBRSolution>& solutions, size_t n);
+//calculate expectated ngram counts, clipping at 1 (ie calculating posteriors) if posteriors==true.
+void calcNgramExpectations(Lattice & connectedHyp, std::map<const Moses::Hypothesis*, std::vector<Edge> >& incomingEdges, std::map<Moses::Phrase,
+ float>& finalNgramScores, bool posteriors);
+void GetOutputFactors(const Moses::TrellisPath &path, std::vector <Moses::Word> &translation);
+void extract_ngrams(const std::vector<Moses::Word >& sentence, std::map < Moses::Phrase, int > & allngrams);
+bool ascendingCoverageCmp(const Moses::Hypothesis* a, const Moses::Hypothesis* b);
+std::vector<Moses::Word> doLatticeMBR(Moses::Manager& manager, Moses::TrellisPathList& nBestList);
+const Moses::TrellisPath doConsensusDecoding(Moses::Manager& manager, Moses::TrellisPathList& nBestList);
+//std::vector<Moses::Word> doConsensusDecoding(Moses::Manager& manager, Moses::TrellisPathList& nBestList);
+
+}
+
+#endif
diff --git a/contrib/relent-filter/src/LatticeMBRGrid.cpp b/contrib/relent-filter/src/LatticeMBRGrid.cpp
new file mode 100755
index 000000000..71c387839
--- /dev/null
+++ b/contrib/relent-filter/src/LatticeMBRGrid.cpp
@@ -0,0 +1,213 @@
+// $Id: LatticeMBRGrid.cpp 3045 2010-04-05 13:07:29Z hieuhoang1972 $
+
+/***********************************************************************
+Moses - factored phrase-based language decoder
+Copyright (c) 2010 University of Edinburgh
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+ * Redistributions of source code must retain the above copyright notice,
+ this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright notice,
+ this list of conditions and the following disclaimer in the documentation
+ and/or other materials provided with the distribution.
+ * Neither the name of the University of Edinburgh nor the names of its contributors
+ may be used to endorse or promote products derived from this software
+ without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS
+BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+***********************************************************************/
+/**
+* Lattice MBR grid search. Enables a grid search through the four parameters (p,r,scale and prune) used in lattice MBR.
+ See 'Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation by Tromble, Kumar, Och and Macherey,
+ EMNLP 2008 for details of the parameters.
+
+ The grid search is controlled by specifying comma separated lists for the lmbr parameters (-lmbr-p, -lmbr-r,
+ -lmbr-pruning-factor and -mbr-scale). All other parameters are passed through to moses. If any of the lattice mbr
+ parameters are missing, then they are set to their default values. Output is of the form:
+ sentence-id ||| p r prune scale ||| translation-hypothesis
+**/
+
+#include <cstdlib>
+#include <iostream>
+#include <map>
+#include <stdexcept>
+#include <set>
+
+#include "IOWrapper.h"
+#include "LatticeMBR.h"
+#include "Manager.h"
+#include "StaticData.h"
+
+
+using namespace std;
+using namespace Moses;
+using namespace MosesCmd;
+
+//keys
+enum gridkey {lmbr_p,lmbr_r,lmbr_prune,lmbr_scale};
+
+namespace MosesCmd
+{
+
+class Grid
+{
+public:
+ /** Add a parameter with key, command line argument, and default value */
+ void addParam(gridkey key, const string& arg, float defaultValue) {
+ m_args[arg] = key;
+ CHECK(m_grid.find(key) == m_grid.end());
+ m_grid[key].push_back(defaultValue);
+ }
+
+ /** Parse the arguments, removing those that define the grid and returning a copy of the rest */
+ void parseArgs(int& argc, char**& argv) {
+ char** newargv = new char*[argc+1]; //Space to add mbr parameter
+ int newargc = 0;
+ for (int i = 0; i < argc; ++i) {
+ bool consumed = false;
+ for (map<string,gridkey>::const_iterator argi = m_args.begin(); argi != m_args.end(); ++argi) {
+ if (!strcmp(argv[i], argi->first.c_str())) {
+ ++i;
+ if (i >= argc) {
+ cerr << "Error: missing parameter for " << argi->first << endl;
+ throw runtime_error("Missing parameter");
+ } else {
+ string value = argv[i];
+ gridkey key = argi->second;
+ if (m_grid[key].size() != 1) {
+ throw runtime_error("Duplicate grid argument");
+ }
+ m_grid[key].clear();
+ char delim = ',';
+ string::size_type lastpos = value.find_first_not_of(delim);
+ string::size_type pos = value.find_first_of(delim,lastpos);
+ while (string::npos != pos || string::npos != lastpos) {
+ float param = atof(value.substr(lastpos, pos-lastpos).c_str());
+ if (!param) {
+ cerr << "Error: Illegal grid parameter for " << argi->first << endl;
+ throw runtime_error("Illegal grid parameter");
+ }
+ m_grid[key].push_back(param);
+ lastpos = value.find_first_not_of(delim,pos);
+ pos = value.find_first_of(delim,lastpos);
+ }
+ consumed = true;
+ }
+ if (consumed) break;
+ }
+ }
+ if (!consumed) {
+ newargv[newargc] = new char[strlen(argv[i]) + 1];
+ strcpy(newargv[newargc],argv[i]);
+ ++newargc;
+ }
+ }
+ argc = newargc;
+ argv = newargv;
+ }
+
+ /** Get the grid for a particular key.*/
+ const vector<float>& getGrid(gridkey key) const {
+ map<gridkey,vector<float> >::const_iterator iter = m_grid.find(key);
+ assert (iter != m_grid.end());
+ return iter->second;
+
+ }
+
+private:
+ map<gridkey,vector<float> > m_grid;
+ map<string,gridkey> m_args;
+};
+
+} // namespace
+
+int main(int argc, char* argv[])
+{
+ cerr << "Lattice MBR Grid search" << endl;
+
+ Grid grid;
+ grid.addParam(lmbr_p, "-lmbr-p", 0.5);
+ grid.addParam(lmbr_r, "-lmbr-r", 0.5);
+ grid.addParam(lmbr_prune, "-lmbr-pruning-factor",30.0);
+ grid.addParam(lmbr_scale, "-mbr-scale",1.0);
+
+ grid.parseArgs(argc,argv);
+
+ Parameter* params = new Parameter();
+ if (!params->LoadParam(argc,argv)) {
+ params->Explain();
+ exit(1);
+ }
+ if (!StaticData::LoadDataStatic(params, argv[0])) {
+ exit(1);
+ }
+
+ StaticData& staticData = const_cast<StaticData&>(StaticData::Instance());
+ staticData.SetUseLatticeMBR(true);
+ IOWrapper* ioWrapper = GetIOWrapper(staticData);
+
+ if (!ioWrapper) {
+ throw runtime_error("Failed to initialise IOWrapper");
+ }
+ size_t nBestSize = staticData.GetMBRSize();
+
+ if (nBestSize <= 0) {
+ throw new runtime_error("Non-positive size specified for n-best list");
+ }
+
+ size_t lineCount = 0;
+ InputType* source = NULL;
+
+ const vector<float>& pgrid = grid.getGrid(lmbr_p);
+ const vector<float>& rgrid = grid.getGrid(lmbr_r);
+ const vector<float>& prune_grid = grid.getGrid(lmbr_prune);
+ const vector<float>& scale_grid = grid.getGrid(lmbr_scale);
+
+ while(ReadInput(*ioWrapper,staticData.GetInputType(),source)) {
+ ++lineCount;
+ Sentence sentence;
+ const TranslationSystem& system = staticData.GetTranslationSystem(TranslationSystem::DEFAULT);
+ Manager manager(*source,staticData.GetSearchAlgorithm(), &system);
+ manager.ProcessSentence();
+ TrellisPathList nBestList;
+ manager.CalcNBest(nBestSize, nBestList,true);
+ //grid search
+ for (vector<float>::const_iterator pi = pgrid.begin(); pi != pgrid.end(); ++pi) {
+ float p = *pi;
+ staticData.SetLatticeMBRPrecision(p);
+ for (vector<float>::const_iterator ri = rgrid.begin(); ri != rgrid.end(); ++ri) {
+ float r = *ri;
+ staticData.SetLatticeMBRPRatio(r);
+ for (vector<float>::const_iterator prune_i = prune_grid.begin(); prune_i != prune_grid.end(); ++prune_i) {
+ size_t prune = (size_t)(*prune_i);
+ staticData.SetLatticeMBRPruningFactor(prune);
+ for (vector<float>::const_iterator scale_i = scale_grid.begin(); scale_i != scale_grid.end(); ++scale_i) {
+ float scale = *scale_i;
+ staticData.SetMBRScale(scale);
+ cout << lineCount << " ||| " << p << " " << r << " " << prune << " " << scale << " ||| ";
+ vector<Word> mbrBestHypo = doLatticeMBR(manager,nBestList);
+ OutputBestHypo(mbrBestHypo, lineCount, staticData.GetReportSegmentation(),
+ staticData.GetReportAllFactors(),cout);
+ }
+ }
+
+ }
+ }
+
+
+ }
+
+}
diff --git a/contrib/relent-filter/src/Main.cpp b/contrib/relent-filter/src/Main.cpp
new file mode 100755
index 000000000..07525cfc0
--- /dev/null
+++ b/contrib/relent-filter/src/Main.cpp
@@ -0,0 +1,282 @@
+/***********************************************************************
+Relative Entropy-based Phrase table Pruning
+Copyright (C) 2012 Wang Ling
+
+This library is free software; you can redistribute it and/or
+modify it under the terms of the GNU Lesser General Public
+License as published by the Free Software Foundation; either
+version 2.1 of the License, or (at your option) any later version.
+
+This library is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+Lesser General Public License for more details.
+
+You should have received a copy of the GNU Lesser General Public
+License along with this library; if not, write to the Free Software
+Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+***********************************************************************/
+
+/**
+ * Moses main, for single-threaded and multi-threaded.
+ **/
+
+#include <exception>
+#include <fstream>
+#include <sstream>
+#include <vector>
+
+#ifdef WIN32
+// Include Visual Leak Detector
+//#include <vld.h>
+#endif
+
+#include "Hypothesis.h"
+#include "Manager.h"
+#include "IOWrapper.h"
+#include "StaticData.h"
+#include "Util.h"
+#include "ThreadPool.h"
+#include "TranslationAnalysis.h"
+#include "OutputCollector.h"
+#include "RelativeEntropyCalc.h"
+#include "LexicalReordering.h"
+#include "LexicalReorderingState.h"
+
+#ifdef HAVE_PROTOBUF
+#include "hypergraph.pb.h"
+#endif
+
+using namespace std;
+using namespace Moses;
+using namespace MosesCmd;
+
+namespace MosesCmd
+{
+// output floats with three significant digits
+static const size_t PRECISION = 3;
+
+/** Enforce rounding */
+void fix(std::ostream& stream, size_t size)
+{
+ stream.setf(std::ios::fixed);
+ stream.precision(size);
+}
+
+/** Translates a sentence.
+ * - calls the search (Manager)
+ * - applies the decision rule
+ * - outputs best translation and additional reporting
+ **/
+class TranslationTask : public Task
+{
+
+public:
+
+ TranslationTask(size_t lineNumber,
+ InputType* source, OutputCollector* searchGraphCollector) :
+ m_source(source), m_lineNumber(lineNumber),
+ m_searchGraphCollector(searchGraphCollector) {}
+
+ /** Translate one sentence
+ * gets called by main function implemented at end of this source file */
+ void Run() {
+
+ // report thread number
+#ifdef BOOST_HAS_PTHREADS
+ TRACE_ERR("Translating line " << m_lineNumber << " in thread id " << pthread_self() << std::endl);
+#endif
+
+ // shorthand for "global data"
+ const StaticData &staticData = StaticData::Instance();
+ // input sentence
+ Sentence sentence();
+ // set translation system
+ const TranslationSystem& system = staticData.GetTranslationSystem(TranslationSystem::DEFAULT);
+
+ // execute the translation
+ // note: this executes the search, resulting in a search graph
+ // we still need to apply the decision rule (MAP, MBR, ...)
+ Manager manager(m_lineNumber, *m_source,staticData.GetSearchAlgorithm(), &system);
+ manager.ProcessSentence();
+
+ // output search graph
+ if (m_searchGraphCollector) {
+ ostringstream out;
+ fix(out,PRECISION);
+
+ vector<SearchGraphNode> searchGraph;
+ manager.GetSearchGraph(searchGraph);
+ out << RelativeEntropyCalc::CalcRelativeEntropy(m_lineNumber,searchGraph) << endl;
+ m_searchGraphCollector->Write(m_lineNumber, out.str());
+
+ }
+ manager.CalcDecoderStatistics();
+ }
+
+ ~TranslationTask() {
+ delete m_source;
+ }
+
+private:
+ InputType* m_source;
+ size_t m_lineNumber;
+ OutputCollector* m_searchGraphCollector;
+ std::ofstream *m_alignmentStream;
+
+};
+
+static void PrintFeatureWeight(const FeatureFunction* ff)
+{
+
+ size_t weightStart = StaticData::Instance().GetScoreIndexManager().GetBeginIndex(ff->GetScoreBookkeepingID());
+ size_t weightEnd = StaticData::Instance().GetScoreIndexManager().GetEndIndex(ff->GetScoreBookkeepingID());
+ for (size_t i = weightStart; i < weightEnd; ++i) {
+ cout << ff->GetScoreProducerDescription(i-weightStart) << " " << ff->GetScoreProducerWeightShortName(i-weightStart) << " "
+ << StaticData::Instance().GetAllWeights()[i] << endl;
+ }
+}
+
+
+static void ShowWeights()
+{
+ fix(cout,6);
+ const StaticData& staticData = StaticData::Instance();
+ const TranslationSystem& system = staticData.GetTranslationSystem(TranslationSystem::DEFAULT);
+ const vector<const StatelessFeatureFunction*>& slf =system.GetStatelessFeatureFunctions();
+ const vector<const StatefulFeatureFunction*>& sff = system.GetStatefulFeatureFunctions();
+ const vector<PhraseDictionaryFeature*>& pds = system.GetPhraseDictionaries();
+ const vector<GenerationDictionary*>& gds = system.GetGenerationDictionaries();
+ for (size_t i = 0; i < sff.size(); ++i) {
+ PrintFeatureWeight(sff[i]);
+ }
+ for (size_t i = 0; i < slf.size(); ++i) {
+ PrintFeatureWeight(slf[i]);
+ }
+ for (size_t i = 0; i < pds.size(); ++i) {
+ PrintFeatureWeight(pds[i]);
+ }
+ for (size_t i = 0; i < gds.size(); ++i) {
+ PrintFeatureWeight(gds[i]);
+ }
+}
+
+} //namespace
+
+/** main function of the command line version of the decoder **/
+int main(int argc, char** argv)
+{
+ try {
+
+ // echo command line, if verbose
+ IFVERBOSE(1) {
+ TRACE_ERR("command: ");
+ for(int i=0; i<argc; ++i) TRACE_ERR(argv[i]<<" ");
+ TRACE_ERR(endl);
+ }
+
+ // set number of significant decimals in output
+ fix(cout,PRECISION);
+ fix(cerr,PRECISION);
+
+ // load all the settings into the Parameter class
+ // (stores them as strings, or array of strings)
+ Parameter* params = new Parameter();
+ if (!params->LoadParam(argc,argv)) {
+ params->Explain();
+ exit(1);
+ }
+
+
+ // initialize all "global" variables, which are stored in StaticData
+ // note: this also loads models such as the language model, etc.
+ if (!StaticData::LoadDataStatic(params, argv[0])) {
+ exit(1);
+ }
+
+ // setting "-show-weights" -> just dump out weights and exit
+ if (params->isParamSpecified("show-weights")) {
+ ShowWeights();
+ exit(0);
+ }
+
+ // shorthand for accessing information in StaticData
+ const StaticData& staticData = StaticData::Instance();
+
+
+ //initialise random numbers
+ srand(time(NULL));
+
+ // set up read/writing class
+ IOWrapper* ioWrapper = GetIOWrapper(staticData);
+ if (!ioWrapper) {
+ cerr << "Error; Failed to create IO object" << endl;
+ exit(1);
+ }
+
+ // check on weights
+ vector<float> weights = staticData.GetAllWeights();
+ IFVERBOSE(2) {
+ TRACE_ERR("The score component vector looks like this:\n" << staticData.GetScoreIndexManager());
+ TRACE_ERR("The global weight vector looks like this:");
+ for (size_t j=0; j<weights.size(); j++) {
+ TRACE_ERR(" " << weights[j]);
+ }
+ TRACE_ERR("\n");
+ }
+ // every score must have a weight! check that here:
+ if(weights.size() != staticData.GetScoreIndexManager().GetTotalNumberOfScores()) {
+ TRACE_ERR("ERROR: " << staticData.GetScoreIndexManager().GetTotalNumberOfScores() << " score components, but " << weights.size() << " weights defined" << std::endl);
+ exit(1);
+ }
+
+ // setting lexicalized reordering setup
+ PhraseBasedReorderingState::m_useFirstBackwardScore = false;
+
+
+ auto_ptr<OutputCollector> outputCollector;
+ outputCollector.reset(new OutputCollector());
+
+#ifdef WITH_THREADS
+ ThreadPool pool(staticData.ThreadCount());
+#endif
+
+ // main loop over set of input sentences
+ InputType* source = NULL;
+ size_t lineCount = 0;
+ while(ReadInput(*ioWrapper,staticData.GetInputType(),source)) {
+ IFVERBOSE(1) {
+ ResetUserTime();
+ }
+ // set up task of translating one sentence
+ TranslationTask* task =
+ new TranslationTask(lineCount,source, outputCollector.get());
+ // execute task
+#ifdef WITH_THREADS
+ pool.Submit(task);
+#else
+ task->Run();
+ delete task;
+#endif
+
+ source = NULL; //make sure it doesn't get deleted
+ ++lineCount;
+ }
+
+ // we are done, finishing up
+#ifdef WITH_THREADS
+ pool.Stop(true); //flush remaining jobs
+#endif
+
+ } catch (const std::exception &e) {
+ std::cerr << "Exception: " << e.what() << std::endl;
+ return EXIT_FAILURE;
+ }
+
+#ifndef EXIT_RETURN
+ //This avoids that destructors are called (it can take a long time)
+ exit(EXIT_SUCCESS);
+#else
+ return EXIT_SUCCESS;
+#endif
+}
diff --git a/contrib/relent-filter/src/Main.h b/contrib/relent-filter/src/Main.h
new file mode 100755
index 000000000..f0782144e
--- /dev/null
+++ b/contrib/relent-filter/src/Main.h
@@ -0,0 +1,39 @@
+/*********************************************************************
+Relative Entropy-based Phrase table Pruning
+Copyright (C) 2012 Wang Ling
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+ * Redistributions of source code must retain the above copyright notice,
+ this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright notice,
+ this list of conditions and the following disclaimer in the documentation
+ and/or other materials provided with the distribution.
+ * Neither the name of the University of Edinburgh nor the names of its contributors
+ may be used to endorse or promote products derived from this software
+ without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS
+BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+***********************************************************************/
+
+#ifndef moses_cmd_Main_h
+#define moses_cmd_Main_h
+
+#include "StaticData.h"
+
+class IOWrapper;
+
+int main(int argc, char* argv[]);
+#endif
diff --git a/contrib/relent-filter/src/RelativeEntropyCalc.cpp b/contrib/relent-filter/src/RelativeEntropyCalc.cpp
new file mode 100755
index 000000000..212eedf87
--- /dev/null
+++ b/contrib/relent-filter/src/RelativeEntropyCalc.cpp
@@ -0,0 +1,83 @@
+/***********************************************************************
+Relative Entropy-based Phrase table Pruning
+Copyright (C) 2012 Wang Ling
+
+This library is free software; you can redistribute it and/or
+modify it under the terms of the GNU Lesser General Public
+License as published by the Free Software Foundation; either
+version 2.1 of the License, or (at your option) any later version.
+
+This library is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+Lesser General Public License for more details.
+
+You should have received a copy of the GNU Lesser General Public
+License along with this library; if not, write to the Free Software
+Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+***********************************************************************/
+
+#include <vector>
+#include "Hypothesis.h"
+#include "StaticData.h"
+#include "RelativeEntropyCalc.h"
+#include "Manager.h"
+
+using namespace std;
+using namespace Moses;
+using namespace MosesCmd;
+
+namespace MosesCmd
+{
+ double RelativeEntropyCalc::CalcRelativeEntropy(int translationId, std::vector<SearchGraphNode>& searchGraph){
+ const StaticData &staticData = StaticData::Instance();
+ const Phrase *m_constraint = staticData.GetConstrainingPhrase(translationId);
+
+ double prunedScore = -numeric_limits<double>::max();
+ double unprunedScore = -numeric_limits<double>::max();
+ for (size_t i = 0; i < searchGraph.size(); ++i) {
+ const SearchGraphNode& searchNode = searchGraph[i];
+ int nodeId = searchNode.hypo->GetId();
+ if(nodeId == 0) continue; // initial hypothesis
+
+ int forwardId = searchNode.forward;
+ if(forwardId == -1){ // is final hypothesis
+ Phrase catOutput(0);
+ ConcatOutputPhraseRecursive(catOutput, searchNode.hypo);
+ if(catOutput == *m_constraint){ // is the output actually the same as the constraint (forced decoding does not always force the output)
+ const Hypothesis *prevHypo = searchNode.hypo->GetPrevHypo();
+ int backId = prevHypo->GetId();
+ double derivationScore = searchNode.hypo->GetScore();
+ if(backId != 0){ // derivation using smaller units
+ if(prunedScore < derivationScore){
+ prunedScore = derivationScore;
+ }
+ }
+ if(unprunedScore < derivationScore){
+ unprunedScore = derivationScore;
+ }
+ }
+ }
+ }
+
+ double neg_log_div = 0;
+ if( unprunedScore == -numeric_limits<double>::max()){
+ neg_log_div = numeric_limits<double>::max(); // could not find phrase pair, give it a low score so that it doesnt get pruned
+ }
+ else{
+ neg_log_div = unprunedScore - prunedScore;
+ }
+ if (neg_log_div > 100){
+ return 100;
+ }
+ return neg_log_div;
+ }
+
+ void RelativeEntropyCalc::ConcatOutputPhraseRecursive(Phrase& phrase, const Hypothesis *hypo){
+ int nodeId = hypo->GetId();
+ if(nodeId == 0) return; // initial hypothesis
+ ConcatOutputPhraseRecursive(phrase, hypo->GetPrevHypo());
+ const Phrase &endPhrase = hypo->GetCurrTargetPhrase();
+ phrase.Append(endPhrase);
+ }
+}
diff --git a/contrib/relent-filter/src/RelativeEntropyCalc.h b/contrib/relent-filter/src/RelativeEntropyCalc.h
new file mode 100755
index 000000000..efe8ba495
--- /dev/null
+++ b/contrib/relent-filter/src/RelativeEntropyCalc.h
@@ -0,0 +1,51 @@
+/*********************************************************************
+Relative Entropy-based Phrase table Pruning
+Copyright (C) 2012 Wang Ling
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without modification,
+are permitted provided that the following conditions are met:
+
+ * Redistributions of source code must retain the above copyright notice,
+ this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright notice,
+ this list of conditions and the following disclaimer in the documentation
+ and/or other materials provided with the distribution.
+ * Neither the name of the University of Edinburgh nor the names of its contributors
+ may be used to endorse or promote products derived from this software
+ without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS
+BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
+IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+***********************************************************************/
+
+#include <vector>
+#include "Hypothesis.h"
+#include "StaticData.h"
+#include "Manager.h"
+
+using namespace std;
+using namespace Moses;
+
+namespace MosesCmd
+{
+
+class RelativeEntropyCalc
+{
+public:
+ static double CalcRelativeEntropy(int translationId, std::vector<SearchGraphNode>& searchGraph);
+
+protected:
+ static void ConcatOutputPhraseRecursive(Phrase& phrase, const Hypothesis *hypo);
+};
+
+}
diff --git a/contrib/relent-filter/src/TranslationAnalysis.cpp b/contrib/relent-filter/src/TranslationAnalysis.cpp
new file mode 100755
index 000000000..89da48301
--- /dev/null
+++ b/contrib/relent-filter/src/TranslationAnalysis.cpp
@@ -0,0 +1,126 @@
+// $Id$
+
+#include <iostream>
+#include <sstream>
+#include <algorithm>
+#include "StaticData.h"
+#include "Hypothesis.h"
+#include "TranslationAnalysis.h"
+
+using namespace Moses;
+
+namespace TranslationAnalysis
+{
+
+void PrintTranslationAnalysis(const TranslationSystem* system, std::ostream &os, const Hypothesis* hypo)
+{
+ os << std::endl << "TRANSLATION HYPOTHESIS DETAILS:" << std::endl;
+ std::vector<const Hypothesis*> translationPath;
+
+ while (hypo) {
+ translationPath.push_back(hypo);
+ hypo = hypo->GetPrevHypo();
+ }
+
+ std::reverse(translationPath.begin(), translationPath.end());
+ std::vector<std::string> droppedWords;
+ std::vector<const Hypothesis*>::iterator tpi = translationPath.begin();
+ if(tpi == translationPath.end())
+ return;
+ ++tpi; // skip initial translation state
+ std::vector<std::string> sourceMap;
+ std::vector<std::string> targetMap;
+ std::vector<unsigned int> lmAcc(0);
+ size_t lmCalls = 0;
+ bool doLMStats = ((*tpi)->GetLMStats() != 0);
+ if (doLMStats)
+ lmAcc.resize((*tpi)->GetLMStats()->size(), 0);
+ for (; tpi != translationPath.end(); ++tpi) {
+ std::ostringstream sms;
+ std::ostringstream tms;
+ std::string target = (*tpi)->GetTargetPhraseStringRep();
+ std::string source = (*tpi)->GetSourcePhraseStringRep();
+ WordsRange twr = (*tpi)->GetCurrTargetWordsRange();
+ WordsRange swr = (*tpi)->GetCurrSourceWordsRange();
+ const AlignmentInfo &alignmentInfo = (*tpi)->GetCurrTargetPhrase().GetAlignmentInfo();
+ // language model backoff stats,
+ if (doLMStats) {
+ std::vector<std::vector<unsigned int> >& lmstats = *(*tpi)->GetLMStats();
+ std::vector<std::vector<unsigned int> >::iterator i = lmstats.begin();
+ std::vector<unsigned int>::iterator acc = lmAcc.begin();
+
+ for (; i != lmstats.end(); ++i, ++acc) {
+ std::vector<unsigned int>::iterator j = i->begin();
+ lmCalls += i->size();
+ for (; j != i->end(); ++j) {
+ (*acc) += *j;
+ }
+ }
+ }
+
+ bool epsilon = false;
+ if (target == "") {
+ target="<EPSILON>";
+ epsilon = true;
+ droppedWords.push_back(source);
+ }
+ os << " SOURCE: " << swr << " " << source << std::endl
+ << " TRANSLATED AS: " << target << std::endl
+ << " WORD ALIGNED: " << alignmentInfo << std::endl;
+ size_t twr_i = twr.GetStartPos();
+ size_t swr_i = swr.GetStartPos();
+ if (!epsilon) {
+ sms << twr_i;
+ }
+ if (epsilon) {
+ tms << "del(" << swr_i << ")";
+ } else {
+ tms << swr_i;
+ }
+ swr_i++;
+ twr_i++;
+ for (; twr_i <= twr.GetEndPos() && twr.GetEndPos() != NOT_FOUND; twr_i++) {
+ sms << '-' << twr_i;
+ }
+ for (; swr_i <= swr.GetEndPos() && swr.GetEndPos() != NOT_FOUND; swr_i++) {
+ tms << '-' << swr_i;
+ }
+ if (!epsilon) targetMap.push_back(sms.str());
+ sourceMap.push_back(tms.str());
+ }
+ std::vector<std::string>::iterator si = sourceMap.begin();
+ std::vector<std::string>::iterator ti = targetMap.begin();
+ os << std::endl << "SOURCE/TARGET SPANS:";
+ os << std::endl << " SOURCE:";
+ for (; si != sourceMap.end(); ++si) {
+ os << " " << *si;
+ }
+ os << std::endl << " TARGET:";
+ for (; ti != targetMap.end(); ++ti) {
+ os << " " << *ti;
+ }
+ os << std::endl << std::endl;
+ if (doLMStats && lmCalls > 0) {
+ std::vector<unsigned int>::iterator acc = lmAcc.begin();
+ const LMList& lmlist = system->GetLanguageModels();
+ LMList::const_iterator i = lmlist.begin();
+ for (; acc != lmAcc.end(); ++acc, ++i) {
+ char buf[256];
+ sprintf(buf, "%.4f", (float)(*acc)/(float)lmCalls);
+ os << (*i)->GetScoreProducerDescription() <<", AVG N-GRAM LENGTH: " << buf << std::endl;
+ }
+ }
+
+ if (droppedWords.size() > 0) {
+ std::vector<std::string>::iterator dwi = droppedWords.begin();
+ os << std::endl << "WORDS/PHRASES DROPPED:" << std::endl;
+ for (; dwi != droppedWords.end(); ++dwi) {
+ os << "\tdropped=" << *dwi << std::endl;
+ }
+ }
+ os << std::endl << "SCORES (UNWEIGHTED/WEIGHTED): ";
+ StaticData::Instance().GetScoreIndexManager().PrintLabeledWeightedScores(os, translationPath.back()->GetScoreBreakdown(), StaticData::Instance().GetAllWeights());
+ os << std::endl;
+}
+
+}
diff --git a/contrib/relent-filter/src/TranslationAnalysis.h b/contrib/relent-filter/src/TranslationAnalysis.h
new file mode 100755
index 000000000..1eb7a04fd
--- /dev/null
+++ b/contrib/relent-filter/src/TranslationAnalysis.h
@@ -0,0 +1,25 @@
+// $Id$
+
+/*
+ * also see moses/SentenceStats
+ */
+
+#ifndef moses_cmd_TranslationAnalysis_h
+#define moses_cmd_TranslationAnalysis_h
+
+#include <iostream>
+#include "Hypothesis.h"
+#include "TranslationSystem.h"
+
+namespace TranslationAnalysis
+{
+
+/***
+ * print details about the translation represented in hypothesis to
+ * os. Included information: phrase alignment, words dropped, scores
+ */
+void PrintTranslationAnalysis(const Moses::TranslationSystem* system, std::ostream &os, const Moses::Hypothesis* hypo);
+
+}
+
+#endif
diff --git a/contrib/relent-filter/src/mbr.cpp b/contrib/relent-filter/src/mbr.cpp
new file mode 100755
index 000000000..7462d3fc6
--- /dev/null
+++ b/contrib/relent-filter/src/mbr.cpp
@@ -0,0 +1,178 @@
+#include <iostream>
+#include <fstream>
+#include <sstream>
+#include <iomanip>
+#include <vector>
+#include <map>
+#include <stdlib.h>
+#include <math.h>
+#include <algorithm>
+#include <stdio.h>
+#include "TrellisPathList.h"
+#include "TrellisPath.h"
+#include "StaticData.h"
+#include "Util.h"
+#include "mbr.h"
+
+using namespace std ;
+using namespace Moses;
+
+
+/* Input :
+ 1. a sorted n-best list, with duplicates filtered out in the following format
+ 0 ||| amr moussa is currently on a visit to libya , tomorrow , sunday , to hold talks with regard to the in sudan . ||| 0 -4.94418 0 0 -2.16036 0 0 -81.4462 -106.593 -114.43 -105.55 -12.7873 -26.9057 -25.3715 -52.9336 7.99917 -24 ||| -4.58432
+
+ 2. a weight vector
+ 3. bleu order ( default = 4)
+ 4. scaling factor to weigh the weight vector (default = 1.0)
+
+ Output :
+ translations that minimise the Bayes Risk of the n-best list
+
+
+*/
+
+int BLEU_ORDER = 4;
+int SMOOTH = 1;
+float min_interval = 1e-4;
+void extract_ngrams(const vector<const Factor* >& sentence, map < vector < const Factor* >, int > & allngrams)
+{
+ vector< const Factor* > ngram;
+ for (int k = 0; k < BLEU_ORDER; k++) {
+ for(int i =0; i < max((int)sentence.size()-k,0); i++) {
+ for ( int j = i; j<= i+k; j++) {
+ ngram.push_back(sentence[j]);
+ }
+ ++allngrams[ngram];
+ ngram.clear();
+ }
+ }
+}
+
+float calculate_score(const vector< vector<const Factor*> > & sents, int ref, int hyp, vector < map < vector < const Factor *>, int > > & ngram_stats )
+{
+ int comps_n = 2*BLEU_ORDER+1;
+ vector<int> comps(comps_n);
+ float logbleu = 0.0, brevity;
+
+ int hyp_length = sents[hyp].size();
+
+ for (int i =0; i<BLEU_ORDER; i++) {
+ comps[2*i] = 0;
+ comps[2*i+1] = max(hyp_length-i,0);
+ }
+
+ map< vector < const Factor * > ,int > & hyp_ngrams = ngram_stats[hyp] ;
+ map< vector < const Factor * >, int > & ref_ngrams = ngram_stats[ref] ;
+
+ for (map< vector< const Factor * >, int >::iterator it = hyp_ngrams.begin();
+ it != hyp_ngrams.end(); it++) {
+ map< vector< const Factor * >, int >::iterator ref_it = ref_ngrams.find(it->first);
+ if(ref_it != ref_ngrams.end()) {
+ comps[2* (it->first.size()-1)] += min(ref_it->second,it->second);
+ }
+ }
+ comps[comps_n-1] = sents[ref].size();
+
+ for (int i=0; i<BLEU_ORDER; i++) {
+ if (comps[0] == 0)
+ return 0.0;
+ if ( i > 0 )
+ logbleu += log((float)comps[2*i]+SMOOTH)-log((float)comps[2*i+1]+SMOOTH);
+ else
+ logbleu += log((float)comps[2*i])-log((float)comps[2*i+1]);
+ }
+ logbleu /= BLEU_ORDER;
+ brevity = 1.0-(float)comps[comps_n-1]/comps[1]; // comps[comps_n-1] is the ref length, comps[1] is the test length
+ if (brevity < 0.0)
+ logbleu += brevity;
+ return exp(logbleu);
+}
+
+const TrellisPath doMBR(const TrellisPathList& nBestList)
+{
+ float marginal = 0;
+
+ vector<float> joint_prob_vec;
+ vector< vector<const Factor*> > translations;
+ float joint_prob;
+ vector< map < vector <const Factor *>, int > > ngram_stats;
+
+ TrellisPathList::const_iterator iter;
+
+ // get max score to prevent underflow
+ float maxScore = -1e20;
+ for (iter = nBestList.begin() ; iter != nBestList.end() ; ++iter) {
+ const TrellisPath &path = **iter;
+ float score = StaticData::Instance().GetMBRScale()
+ * path.GetScoreBreakdown().InnerProduct(StaticData::Instance().GetAllWeights());
+ if (maxScore < score) maxScore = score;
+ }
+
+ for (iter = nBestList.begin() ; iter != nBestList.end() ; ++iter) {
+ const TrellisPath &path = **iter;
+ joint_prob = UntransformScore(StaticData::Instance().GetMBRScale() * path.GetScoreBreakdown().InnerProduct(StaticData::Instance().GetAllWeights()) - maxScore);
+ marginal += joint_prob;
+ joint_prob_vec.push_back(joint_prob);
+
+ // get words in translation
+ vector<const Factor*> translation;
+ GetOutputFactors(path, translation);
+
+ // collect n-gram counts
+ map < vector < const Factor *>, int > counts;
+ extract_ngrams(translation,counts);
+
+ ngram_stats.push_back(counts);
+ translations.push_back(translation);
+ }
+
+ vector<float> mbr_loss;
+ float bleu, weightedLoss;
+ float weightedLossCumul = 0;
+ float minMBRLoss = 1000000;
+ int minMBRLossIdx = -1;
+
+ /* Main MBR computation done here */
+ iter = nBestList.begin();
+ for (unsigned int i = 0; i < nBestList.GetSize(); i++) {
+ weightedLossCumul = 0;
+ for (unsigned int j = 0; j < nBestList.GetSize(); j++) {
+ if ( i != j) {
+ bleu = calculate_score(translations, j, i,ngram_stats );
+ weightedLoss = ( 1 - bleu) * ( joint_prob_vec[j]/marginal);
+ weightedLossCumul += weightedLoss;
+ if (weightedLossCumul > minMBRLoss)
+ break;
+ }
+ }
+ if (weightedLossCumul < minMBRLoss) {
+ minMBRLoss = weightedLossCumul;
+ minMBRLossIdx = i;
+ }
+ iter++;
+ }
+ /* Find sentence that minimises Bayes Risk under 1- BLEU loss */
+ return nBestList.at(minMBRLossIdx);
+ //return translations[minMBRLossIdx];
+}
+
+void GetOutputFactors(const TrellisPath &path, vector <const Factor*> &translation)
+{
+ const std::vector<const Hypothesis *> &edges = path.GetEdges();
+ const std::vector<FactorType>& outputFactorOrder = StaticData::Instance().GetOutputFactorOrder();
+ assert (outputFactorOrder.size() == 1);
+
+ // print the surface factor of the translation
+ for (int currEdge = (int)edges.size() - 1 ; currEdge >= 0 ; currEdge--) {
+ const Hypothesis &edge = *edges[currEdge];
+ const Phrase &phrase = edge.GetCurrTargetPhrase();
+ size_t size = phrase.GetSize();
+ for (size_t pos = 0 ; pos < size ; pos++) {
+
+ const Factor *factor = phrase.GetFactor(pos, outputFactorOrder[0]);
+ translation.push_back(factor);
+ }
+ }
+}
+
diff --git a/contrib/relent-filter/src/mbr.h b/contrib/relent-filter/src/mbr.h
new file mode 100755
index 000000000..d08b11a98
--- /dev/null
+++ b/contrib/relent-filter/src/mbr.h
@@ -0,0 +1,28 @@
+// $Id$
+
+/***********************************************************************
+Moses - factored phrase-based language decoder
+Copyright (C) 2006 University of Edinburgh
+
+This library is free software; you can redistribute it and/or
+modify it under the terms of the GNU Lesser General Public
+License as published by the Free Software Foundation; either
+version 2.1 of the License, or (at your option) any later version.
+
+This library is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+Lesser General Public License for more details.
+
+You should have received a copy of the GNU Lesser General Public
+License along with this library; if not, write to the Free Software
+Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+***********************************************************************/
+
+#ifndef moses_cmd_mbr_h
+#define moses_cmd_mbr_h
+
+const Moses::TrellisPath doMBR(const Moses::TrellisPathList& nBestList);
+void GetOutputFactors(const Moses::TrellisPath &path, std::vector <const Moses::Factor*> &translation);
+float calculate_score(const std::vector< std::vector<const Moses::Factor*> > & sents, int ref, int hyp, std::vector < std::map < std::vector < const Moses::Factor *>, int > > & ngram_stats );
+#endif
diff --git a/contrib/sigtest-filter/Makefile b/contrib/sigtest-filter/Makefile
index ddefc907b..71de9c45f 100644
--- a/contrib/sigtest-filter/Makefile
+++ b/contrib/sigtest-filter/Makefile
@@ -1,5 +1,5 @@
SALMDIR=/Users/hieuhoang/workspace/salm
-FLAVOR?=o32
+FLAVOR?=o64
INC=-I$(SALMDIR)/Src/Shared -I$(SALMDIR)/Src/SuffixArrayApplications -I$(SALMDIR)/Src/SuffixArrayApplications/SuffixArraySearch
OBJS=$(SALMDIR)/Distribution/Linux/Objs/Search/_SuffixArrayApplicationBase.$(FLAVOR) $(SALMDIR)/Distribution/Linux/Objs/Search/_SuffixArraySearchApplicationBase.$(FLAVOR) $(SALMDIR)/Distribution/Linux/Objs/Shared/_String.$(FLAVOR) $(SALMDIR)/Distribution/Linux/Objs/Shared/_IDVocabulary.$(FLAVOR)