Attempt to stop people from publishing non-comparable BLEU scores, as discussed in statmt meeting

author: Kenneth Heafield <github@kheafield.com> 2017-10-20 00:57:36 +0300
committer: Kenneth Heafield <github@kheafield.com> 2017-10-20 00:57:36 +0300
commit: 545eee7e75487aeaf45a8b077c57e189e50b2c2e (patch)
tree: 6c1436f6192bbf35ded19d9d3df1efe4e9653825
parent: eced95d694cb0297ebaba3a66cd4ee3f4d97f3c6 (diff)
1 files changed, 3 insertions, 0 deletions
diff --git a/scripts/generic/multi-bleu.perl b/scripts/generic/multi-bleu.perl
index a25e347bb..15e26ff4a 100755
--- a/scripts/generic/multi-bleu.perl
+++ b/scripts/generic/multi-bleu.perl
@@ -168,6 +168,9 @@ printf "BLEU = %.2f, %.1f/%.1f/%.1f/%.1f (BP=%.3f, ratio=%.3f, hyp_len=%d, ref_l
     $length_translation,
     $length_reference;
 
+
+print STDERR "Do not publish scores from multi-bleu.perl.  The scores depend on your tokenizer, which is unlikely to be reproducible from your paper or consistent across research groups.  Instead you should detokenize then use mteval-v14.pl, which has a standard tokenization.  Scores from multi-bleu.perl can still be used for internal purposes when you have a consistent tokenizer.\n";
+
 sub my_log {
   return -9999999999 unless $_[0];
   return log($_[0]);
author	Kenneth Heafield <github@kheafield.com>	2017-10-20 00:57:36 +0300
committer	Kenneth Heafield <github@kheafield.com>	2017-10-20 00:57:36 +0300
commit	545eee7e75487aeaf45a8b077c57e189e50b2c2e (patch)
tree	6c1436f6192bbf35ded19d9d3df1efe4e9653825
parent	eced95d694cb0297ebaba3a66cd4ee3f4d97f3c6 (diff)