1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
|
#!/usr/bin/env bash
# score-0.85
# copyright 2010, João L. A. C. Rosas
# licenced under the GPL licence, version 3
# date: 02/09/2010
# Special thanks to Hilário Leal Fontes and Maria José Machado who made research about this script, sent me experimental results, helped to test it and made very helpful suggestions
# ***Purpose***: This script processes all the Moses translation files present in the $mosesdir/translation_files_for_tmx, if you want to prepare a translation to be used with a translation memory, or in the $mosesdir/translation_output directory, if you want to have a plain translation. For each Moses translation present there, it extracts from its name the names of the abbreviations of the source and target languages and of the scorebasename (which must not included the "." sign). With this information, it reconstructs the full name of the source file and reference translation file. For a set of source file, its Moses translation file and its reference (human-made) translation file, this script creates a report presenting, depending on the parameters set by the user, either 1) a score of the whole Moses translation or 2) a score of each segment of the Moses translation. In this latter case, each line of the file consists of the a) BLEU score and b) NIST score of the Moses translation ***of that segment***, c) the number of the segment in the source document, d) the source, e) reference and f) Moses translation segments, in that order. These 6 fields are separated by the "|" character. The lines are sorted by ascending order of BLEU score.
###########################################################################################################################################################
#THIS SCRIPT ASSUMES THAT A IRSTLM AND RANDLM ENABLED MOSES HAS ALREADY BEEN INSTALLED WITH THE create script IN $mosesdir (BY DEFAULT $HOME/moses-irstlm-randlm), THAT A CORPUS HAS BEEN TRAINED WITH THE train script AND THAT A TRANSLATION HAS ALREADY BEEN MADE WITH THE translate script.
# IT ALSO ASSUMES THAT THE PACKAGES UPON WHICH IT DEPENDS, INDICATED IN THE create script, HAVE BEEN INSTALLED
###########################################################################################################################################################
##########################################################################################################################################################
# The values of the variables that follow should be filled according to your needs: # ##########################################################################################################################################################
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# !!! THIS SCRIPT SHOULD NOT BE USED WITH DOCUMENTS TRANSLATED WITH THE translate script WITH ITS $translate_for_tmx PARAMETER SET TO 1 ***UNLESS*** the $othercleanings, $improvesegmentation and $ removeduplicates parameters of that script were all set to 0 and $minseglen was set to -1 (this processing changes the order of the segments and can also make the source document have a number of segments that is different from the number of segments of the reference translation, namely because it can delete some segments and/or add some new ones) !!!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# !!! The names of the source and target reference translation files used for scoring should not include spaces !!!
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# The source file name and the reference translation file MUST observe the following conventions:
# Source file : <basename>.<abbreviation of source language> (ex: 100.en)
# Reference translation file: <basename>.<abbreviation of target language>.ref (ex: 100.pt.ref)
#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
#Base directory of your Moses installation (made with the create script)
mosesdir=$HOME/moses-irstlm-randlm
#Scores documents prepared for TMX translation memories. If this parameter is set to 1, the script will look for the documents $s and $m in the $mosesdir/translation_files_for_tmx directory; if not set to 1, it will look for the $s document in the mosesdir/translation_input directory and for the $m document in $mosesdir/translation_output; in both cases, it will look for the $r document in $mosesdir/translation_reference
scoreTMXdocuments=0
#This is an arbitrary commentary that you can use if you want to register something (a parameter used, whatever) in the name of the scorefile. Like this, you might not have to open several files before discovering the one you are really looking for (if you do many scores of the same document translated with different parameters); more useful while you are trying to discover the right combination of parameters for your specific situation; !!!Remember, however, that most Linux systems have a maximum file name length of 255 characters; if the name of the document to translate is already long, you might exceed that limit !!! Example of a note:"12-07-2010" (date of the batch score)
batch_user_note="12-07-2010"
#Create a report where each segment gets its own score; 0 = score the whole document; 1 = score each segment
score_line_by_line=0
#Remove moses translation segments that are equal to reference translation segments and whose BLEU score is zero (!!! Only active if score_line_by_line=1 !!!)
remove_equal=1
#Tokenize the source document and the reference and the Moses translation
tokenize=1
#Lowercase the source document and the reference and the Moses translation
lowercase=1
##########################################################################################################################################################
# DO NOT CHANGE THE LINES THAT FOLLOW ... unless you know what you are doing! #
##########################################################################################################################################################
#Directory where Moses translation tools are located
toolsdir=$mosesdir/tools
if [ "$scoreTMXdocuments" = "1" ]; then
sourcelanguagedir=$mosesdir/translation_files_for_tmx
mosestranslationdir=$mosesdir/translation_files_for_tmx
else
sourcelanguagedir=$mosesdir/translation_input
mosestranslationdir=$mosesdir/translation_output
fi
reftranslationdir=$mosesdir/translation_reference
#Directory where the output of the present script, the translation scoring document, will be created
scoredir=$mosesdir/translation_scoring
# Create the input directories, if they do not yet exist; later steps will confirm that the input files do not yet exist (this saves time to the user, who will not have to also create these directories)
if [ ! -d $sourcelanguagedir ] ; then mkdir -p $sourcelanguagedir ; fi
if [ ! -d $reftranslationdir ] ; then mkdir -p $reftranslationdir ; fi
if [ ! -d $mosestranslationdir ] ; then mkdir -p $mosestranslationdir ; fi
if [ ! -d $scoredir ] ; then mkdir -p $scoredir ; fi
# Define functions
remove_garbage() {
if [ -f $scoredir/$s ]; then
rm $scoredir/$s
fi
if [ -f $scoredir/$r ]; then
rm $scoredir/$r
fi
if [ -f $scoredir/$m ]; then
rm $scoredir/$m
fi
if [ -f $scoredir/$scorebasename-src.$lang1.sgm ]; then
rm $scoredir/$scorebasename-src.$lang1.sgm
fi
if [ -f $scoredir/$scorebasename-ref.$lang2.sgm ]; then
rm $scoredir/$scorebasename-ref.$lang2.sgm
fi
if [ -f $scoredir/$scorebasename.moses.sgm ]; then
rm $scoredir/$scorebasename.moses.sgm
fi
}
log_wrong_file() {
if [ ! -f $scoredir/$tmp ]; then
echo "LIST OF NOT SCORED FILES (in the $mosestranslationdir directory):" > $scoredir/$tmp
echo "==============================================================================================" >> $scoredir/$tmp
echo "" >> $scoredir/$tmp
echo "==============================================================================================" >> $scoredir/$tmp
fi
echo -e "***$filename*** file:" >> $scoredir/$tmp
echo "----------------------------------------------------------------------------------------------" >> $scoredir/$tmp
echo -e "\t$error_msg" >> $scoredir/$tmp
echo "==============================================================================================" >> $scoredir/$tmp
}
#-----------------------------------------------------------------------------------------------------------------------------------------
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
tmp="!!!SCORES-NOT-DONE!!!"
if [ -f $scoredir/$tmp ]; then
rm $scoredir/$tmp
fi
i=0
for filetoscore in $mosestranslationdir/*; do
if [ ! -d $filetoscore ]; then
error_msg=""
filename=${filetoscore##*/}
tempbasename=${filename%.*}
tempbasename1=${tempbasename%.*}
scorebasename=${tempbasename1%.*}
temp=${filename%.*}
temp1=${temp%.*}
lang1=${temp1##*.}
lang2=${temp##*.}
s=$scorebasename.$lang1
m=$filename
r=$scorebasename.$lang2.ref
#-----------------------------------------------------------------------------------------------------------------------------------------
#Define report name
if [ "$lang1" = "$filename" -a "$lang2" = "$filename" ]; then
lang1t=""
lang2t=""
else
lang1t=$lang1
lang2t=$lang2
fi
if [ "$score_line_by_line" = "1" ]; then
scorefile=$scorebasename.$batch_user_note.$lang1t-$lang2t.F-$scoreTMXdocuments-R-$remove_equal-T-$tokenize.L-$lowercase.line-by-line
else
scorefile=$scorebasename-$batch_user_note-$lang1t-$lang2t.F-$scoreTMXdocuments-R-$remove_equal-T-$tokenize.L-$lowercase.whole-doc
fi
#-----------------------------------------------------------------------------------------------------------------------------------------
scorefile_name_len=${#scorefile}
if [ "${filetoscore##*.}" = "moses" ]; then
echo "--------------------------------------------------------------------"
echo "MOSES TRANSLATION: $filename (in the $mosestranslationdir directory)"
let i=$i+1
if [ "$scorefile_name_len" -gt "229" -a "$score_line_by_line" != "1" ]; then
echo "==============================================================================================" >> $scoredir/$tmp
error_msg="The translated file name and/or the \$batch_user_note parameter would result in a scorefile name that exceeds the maximal limit of 255 characters. Please try to use translation files and user notes that do not lead to files names exceeding the maximal allowable length."
echo -e "$error_msg Analysing now next Moses translation."
log_wrong_file
scorefile=$(echo $scorefile | cut -c1-229)
continue
fi
if [ "$scorefile_name_len" -gt "242" -a "$score_line_by_line" = "1" ]; then
error_msg="The translated file name and/or the \$batch_user_note parameter would result in a scorefile name that exceeds the maximal limit of 255 characters. Please try to use translation files and user notes that do not lead to files with names exceeding their maximal allowable length."
echo -e "$error_msg Analysing now next Moses translation."
log_wrong_file
scorefile=$(echo $scorefile | cut -c1-242)
continue
fi
#-----------------------------------------------------------------------------------------------------------------------------------------
if [ "$lang1" = "$lang2" ]; then
error_msg="You did not respect the Moses for Mere Mortals conventions for naming the source and or the reference files.\n\tSource file\t\t\t: <scorebasename>.<source language abbreviation> (ex: 100.pt)\n\tReference translation file\t: <scorebasename>.<target language abbreviation> (ex: 100.en.ref)\nPlease correct the name of the files and then run this script again."
echo -e "$error_msg Analysing now next Moses translation."
log_wrong_file
continue
fi
#-----------------------------------------------------------------------------------------------------------------------------------------
#Get number of segments for each input file (source, reference and Moses translation)
#avoid wc error messages when the file does not exist
exec 3> /dev/stderr 2> /dev/null
lines_s=`wc -l "$sourcelanguagedir/$s" | awk '{print $1'}`
if [ "$lines_s" ]; then
echo "Source file : $lines_s lines"
else
echo "Source file : doesn't exist"
fi
lines=`wc -l "$mosestranslationdir/$m" | awk '{print $1'}`
if [ "$lines" ]; then
echo "Moses translation: $lines lines"
else
echo "Moses translation: doesn't exist"
fi
lines_r=`wc -l "$reftranslationdir/$r" | awk '{print $1'}`
if [ "$lines_r" ]; then
echo "Reference file : $lines_r lines"
else
echo "Reference file : doesn't exist"
fi
exec 2>&3
#Check that source, reference and Moses translation files have the same number of segments
if [ "$lines_s" != "$lines_r" ]; then
if [ "$lines_s" = "" ]; then
lines_s=0
fi
if [ "$lines_r" = "" ]; then
lines_r=0
fi
error_msg="Source and reference files do not have the same number of lines (source = $lines_s and reference = $lines_r lines) or one or both of them might not exist. If you verify manually that they do have the same number of segments, then wc (a Linux command) is interpreting at least one of the characters of one of the files as something it isn't. If that is the case, you will have to isolate the line(s) that is (are) causing problems and to substitute the character in question by some other character."
echo "$error_msg Analysing now next Moses translation."
log_wrong_file
remove_garbage
continue
fi
if [ "$lines" != "$lines_r" ]; then
if [ "$lines" = "" ]; then
lines=0
fi
if [ "$lines_r" = "" ]; then
lines_r=0
fi
error_msg="Reference and moses translation files do not have the same number of lines (reference = $lines_r lines and moses translation = $lines) or one or both of them might not exist. If you verify manually that they do have the same number of segments, then wc (a Linux command) is interpreting at least one of the characters of one of the files as something it isn't. If that is the case, you will have to isolate the line(s) that is (are) causing problems and to substitute the character in question by some other character."
echo "$error_msg Analysing now next Moses translation."
log_wrong_file
remove_garbage
continue
fi
#-----------------------------------------------------------------------------------------------------------------------------------------
#Check that $s, $r and $m exist
if [ ! -f $sourcelanguagedir/$s ] ; then
error_msg="The expected source language file ($sourcelanguagedir/$s) needed for scoring the Moses translation ($mosestranslationdir/$m) does not exist. Did you respect the file naming conventions described at the top of the score-0.85 script or did you use the wrong language pair for translating?"
echo "$error_msg Analysing now next Moses translation."
log_wrong_file
continue
else
cp $sourcelanguagedir/$s $scoredir
if [ "$tokenize" = "1" -a "$lowercase" = "1" ]; then
$toolsdir/scripts/tokenizer.perl -l $lang1 < $scoredir/$s > $scoredir/$s.tok
$toolsdir/scripts/lowercase.perl < $scoredir/$s.tok > $scoredir/$s
rm -f $scoredir/$s.tok
elif [ "$tokenize" = "1" ]; then
$toolsdir/scripts/tokenizer.perl -l $lang1 < $scoredir/$s > $scoredir/$s.tok
mv -f $scoredir/$s.tok $scoredir/$s
elif [ "$lowercase" = "1" ]; then
$toolsdir/scripts/lowercase.perl < $scoredir/$s > $scoredir/$s.lower
mv -f $scoredir/$s.lower $scoredir/$s
fi
sed 's/\\$/\\ /g' < $scoredir/$s > $scoredir/$s.clean
mv -f $scoredir/$s.clean $scoredir/$s
fi
if [ ! -f $reftranslationdir/$r ] ; then
error_msg="The expected reference (human-made) file ($reftranslationdir/$r) needed for scoring the Moses translation ($mosestranslationdir/$m) does not exist."
echo "$error_msg Analysing now next Moses translation. Did you respect the file naming conventions described at the top of the score-0.21 script or did you use the wrong language pair for translating?"
log_wrong_file
continue
else
cp $reftranslationdir/$r $scoredir
if [ "$tokenize" = "1" -a "$lowercase" = "1" ]; then
$toolsdir/scripts/tokenizer.perl -l $lang2 < $scoredir/$r > $scoredir/$r.tok
$toolsdir/scripts/lowercase.perl < $scoredir/$r.tok > $scoredir/$r
rm -f $scoredir/$r.tok
elif [ "$tokenize" = "1" ]; then
$toolsdir/scripts/tokenizer.perl -l $lang2 < $scoredir/$r > $scoredir/$r.tok
mv -f $scoredir/$r.tok $scoredir/$r
elif [ "$lowercase" = "1" ]; then
$toolsdir/scripts/lowercase.perl < $scoredir/$r > $scoredir/$r.lower
mv -f $scoredir/$r.lower $scoredir/$r
fi
sed 's/\\$/\\ /g' < $scoredir/$r > $scoredir/$r.clean
mv -f $scoredir/$r.clean $scoredir/$r
fi
if [ ! -f $mosestranslationdir/$m ] ; then
error_msg="The Moses translation file ($mosestranslationdir/$m) file does not exist. Did you respect the file naming conventions described at the top of the score-0.80 script?"
echo "$error_msg Analysing now next Moses translation."
log_wrong_file
continue
else
cp $mosestranslationdir/$m $scoredir
if [ "$tokenize" = "1" -a "$lowercase" = "1" ]; then
$toolsdir/scripts/tokenizer.perl -l $lang2 < $scoredir/$m > $scoredir/$m.tok
$toolsdir/scripts/lowercase.perl < $scoredir/$m.tok > $scoredir/$m
rm -f $scoredir/$m.tok
elif [ "$tokenize" = "1" ]; then
$toolsdir/scripts/tokenizer.perl -l $lang2 < $scoredir/$m > $scoredir/$m.tok
mv -f $scoredir/$m.tok $scoredir/$m
elif [ "$lowercase" = "1" ]; then
$toolsdir/scripts/lowercase.perl < $scoredir/$m > $scoredir/$m.lower
mv -f $scoredir/$m.lower $scoredir/$m
fi
sed 's/\\$/\\ /g' < $scoredir/$m > $scoredir/$m.clean
mv -f $scoredir/$m.clean $scoredir/$m
fi
echo "===================================================================================" > $scoredir/temp
echo "*** Script version ***: score-0.85" >> $scoredir/temp
echo "===================================================================================" >> $scoredir/temp
echo "===================================================================================" >> $scoredir/temp
echo "Extracted file names and other data (extracted automatically; errors are possible):" >> $scoredir/temp
echo "===================================================================================" >> $scoredir/temp
echo "source language : $lang1" >> $scoredir/temp
echo "target language : $lang2" >> $scoredir/temp
echo "-----------------------------------------------------------------------------------" >> $scoredir/temp
echo "source file : $sourcelanguagedir/$s" >> $scoredir/temp
echo "moses translation : $mosestranslationdir/$m" >> $scoredir/temp
echo "reference file : $reftranslationdir/$r" >> $scoredir/temp
echo "-----------------------------------------------------------------------------------" >> $scoredir/temp
echo "batch_user_note : $batch_user_note" >> $scoredir/temp
echo "===================================================================================" >> $scoredir/temp
echo "score_line_by_line : $score_line_by_line" >> $scoredir/temp
if [ "$score_line_by_line" = "1" ]; then
echo "tokenize : $tokenize" >> $scoredir/temp
echo "lowercase : $lowercase" >> $scoredir/temp
echo "remove_equal : $remove_equal" >> $scoredir/temp
fi
echo "===================================================================================" >> $scoredir/temp
#=========================================================================================================================================================
#1. SCORE LINE BY LINE
#=========================================================================================================================================================
if [ "$score_line_by_line" = "1" ]; then
if [ -f $scoredir/$scorefile ]; then
rm -f $scoredir/$scorefile
fi
echo "************************** Score line by line"
counter=0
echo "BLEU|NIST|<segnum>|source seg|ref seg|Moses seg" >> $scoredir/temp
echo "" >> $scoredir/temp
sed -e 's#\& #\&\; #g' -e 's#<#\<\;#g' $scoredir/$s > $scoredir/$s.tmp
mv $scoredir/$s.tmp $scoredir/$s
sed -e 's#\& #\&\; #g' -e 's#<#\<\;#g' $scoredir/$r > $scoredir/$r.tmp
mv $scoredir/$r.tmp $scoredir/$r
sed -e 's#\& #\&\; #g' -e 's#<#\<\;#g' $scoredir/$m > $scoredir/$m.tmp
mv $scoredir/$m.tmp $scoredir/$m
echo "***** Score each segment:"
while [ "$counter" -lt "$lines" ]; do
let "counter += 1"
echo "Segment $counter"
source_sentence=`awk "NR==$counter{print;exit}" $scoredir/$s`
ref_sentence=`awk "NR==$counter{print;exit}" $scoredir/$r`
moses_sentence=`awk "NR==$counter{print;exit}" $scoredir/$m`
#-----------------------------------------------------------------------------------------------------------------------------------------
# ******** wrap source file
if [ "$source_sentence" != "" ]; then
echo '<srcset setid="'$scorebasename'" srclang="'$lang1'">' > $scoredir/$scorebasename-src.$lang1.sgm
echo '<DOC docid="'$scorebasename'">' >> $scoredir/$scorebasename-src.$lang1.sgm
echo "<seg id=$counter>"$source_sentence"</seg>" >> $scoredir/$scorebasename-src.$lang1.sgm
echo "</DOC>" >> $scoredir/$scorebasename-src.$lang1.sgm
echo "</srcset>" >> $scoredir/$scorebasename-src.$lang1.sgm
fi
#-----------------------------------------------------------------------------------------------------------------------------------------
# ******** wrap reference (human-made) translation
if [ "$ref_sentence" != "" ]; then
echo '<refset setid="'$scorebasename'" srclang="'$lang1'" trglang="'$lang2'">' > $scoredir/$scorebasename-ref.$lang2.sgm
echo '<DOC docid="'$scorebasename'" sysid="ref">' >> $scoredir/$scorebasename-ref.$lang2.sgm
echo "<seg id=$counter>"$ref_sentence"</seg>" >> $scoredir/$scorebasename-ref.$lang2.sgm
echo "</DOC>" >> $scoredir/$scorebasename-ref.$lang2.sgm
echo "</refset>" >> $scoredir/$scorebasename-ref.$lang2.sgm
fi
#-----------------------------------------------------------------------------------------------------------------------------------------
# ******** wrap Moses translation
if [ "$moses_sentence" != "" ]; then
echo '<tstset setid="'$scorebasename'" srclang="'$lang1'" trglang="'$lang2'">' > $scoredir/$scorebasename.moses.sgm
echo '<DOC docid="'$scorebasename'" sysid="moses">' >> $scoredir/$scorebasename.moses.sgm
echo "<seg id=$counter>"$moses_sentence"</seg>" >> $scoredir/$scorebasename.moses.sgm
echo "</DOC>" >> $scoredir/$scorebasename.moses.sgm
echo "</tstset>" >> $scoredir/$scorebasename.moses.sgm
fi
#-----------------------------------------------------------------------------------------------------------------------------------------
sed -e 's/\x1E/\-/g' $scoredir/$scorebasename-src.$lang1.sgm > $scoredir/temp2
mv $scoredir/temp2 $scoredir/$scorebasename-src.$lang1.sgm
sed -e 's/\x1E/\-/g' $scoredir/$scorebasename-ref.$lang2.sgm > $scoredir/temp2
mv $scoredir/temp2 $scoredir/$scorebasename-ref.$lang2.sgm
sed -e 's/\x1E/\-/g' $scoredir/$scorebasename.moses.sgm > $scoredir/temp2
mv $scoredir/temp2 $scoredir/$scorebasename.moses.sgm
# ******** get segment score"
#in our experience, the mteval-v13a and the mteval-v12 (more recent scorers) stopped with errors (and no score) with strings like " & " and U+001E
score=`$toolsdir/mteval-v11b.pl -s $scoredir/$scorebasename-src.$lang1.sgm -r $scoredir/$scorebasename-ref.$lang2.sgm -t $scoredir/$scorebasename.moses.sgm -c`
scoretemp=${score%% for system *}
scoretemp1=${scoretemp#*NIST score = }
NIST=${scoretemp1%% *}
BLEUtemp=${scoretemp1#*BLEU score = }
BLEU=${BLEUtemp%% *}
set -f
BLEUcorr=$(echo "scale=0; $BLEU*10000" | bc)
set +f
if [ "$remove_equal" = "1" ]; then
if [ "$ref_sentence" != "$moses_sentence" ]; then
echo "$BLEU|$NIST|<$counter>|<seg>$source_sentence</seg>|<seg>$ref_sentence</seg>|<seg>$moses_sentence</seg>" >> $scoredir/$scorefile
elif [ "$BLEUcorr" = "0" ]; then
: #do nothing
else
echo "$BLEU|$NIST|<$counter>|<seg>$source_sentence</seg>|<seg>$ref_sentence</seg>|<seg>$moses_sentence</seg>" >> $scoredir/$scorefile
fi
else
echo "$BLEU|$NIST|<$counter>|<seg>$source_sentence</seg>|<seg>$ref_sentence</seg>|<seg>$moses_sentence</seg>" >> $scoredir/$scorefile
fi
done
#-----------------------------------------------------------------------------------------------------------------------------------------
#Sort the output file by score
sort -g $scoredir/$scorefile -o $scoredir/$scorefile
echo "===========================================================================" >> $scoredir/temp
cat $scoredir/$scorefile >> $scoredir/temp
mv $scoredir/temp $scoredir/$scorefile
remove_garbage
else
#=========================================================================================================================================================
#2. SCORE WHOLE DOCUMENT
#=========================================================================================================================================================
if [ -f $scoredir/$scorefile ]; then
rm -f $scoredir/$scorefile
fi
echo "************************** Score whole document"
sed -e 's#\& #\&\; #g' -e 's#<#\<\;#g' $scoredir/$s > $scoredir/$s.tmp
mv $scoredir/$s.tmp $scoredir/$s
sed -e 's#\& #\&\; #g' -e 's#<#\<\;#g' $scoredir/$r > $scoredir/$r.tmp
mv $scoredir/$r.tmp $scoredir/$r
sed -e 's#\& #\&\; #g' -e 's#<#\<\;#g' $scoredir/$m > $scoredir/$m.tmp
mv $scoredir/$m.tmp $scoredir/$m
echo "***************** wrap test result in SGM"
echo "******** wrap source file"
exec<$scoredir/$s
echo '<srcset setid="'$scorebasename'" srclang="'$lang1'">' > $scoredir/$scorebasename-src.$lang1.sgm
echo '<DOC docid="'$scorebasename'">' >> $scoredir/$scorebasename-src.$lang1.sgm
numseg=0
while read line
do
numseg=$(($numseg+1))
echo "<seg id=$numseg>"$line"</seg>" >> $scoredir/$scorebasename-src.$lang1.sgm
done
echo "</DOC>" >> $scoredir/$scorebasename-src.$lang1.sgm
echo "</srcset>" >> $scoredir/$scorebasename-src.$lang1.sgm
#-----------------------------------------------------------------------------------------------------------------------------------------
echo "******** wrap reference (human-made) translation"
exec<$scoredir/$r
echo '<refset setid="'$scorebasename'" srclang="'$lang1'" trglang="'$lang2'">' > $scoredir/$scorebasename-ref.$lang2.sgm
echo '<DOC docid="'$scorebasename'" sysid="ref">' >> $scoredir/$scorebasename-ref.$lang2.sgm
numseg=0
while read line
do
numseg=$(($numseg+1))
echo "<seg id=$numseg>"$line"</seg>" >> $scoredir/$scorebasename-ref.$lang2.sgm
done
echo "</DOC>" >> $scoredir/$scorebasename-ref.$lang2.sgm
echo "</refset>" >> $scoredir/$scorebasename-ref.$lang2.sgm
#-----------------------------------------------------------------------------------------------------------------------------------------
echo "******** wrap Moses translation"
exec<$scoredir/$m
echo '<tstset setid="'$scorebasename'" srclang="'$lang1'" trglang="'$lang2'">' > $scoredir/$scorebasename.moses.sgm
echo '<DOC docid="'$scorebasename'" sysid="moses">' >> $scoredir/$scorebasename.moses.sgm
numseg=0
while read line
do
numseg=$(($numseg+1))
echo "<seg id=$numseg>"$line"</seg>" >> $scoredir/$scorebasename.moses.sgm
done
echo "</DOC>" >> $scoredir/$scorebasename.moses.sgm
echo "</tstset>" >> $scoredir/$scorebasename.moses.sgm
sed -e 's/\x1E/\-/g' $scoredir/$scorebasename-src.$lang1.sgm > $scoredir/temp2
mv $scoredir/temp2 $scoredir/$scorebasename-src.$lang1.sgm
sed -e 's/\x1E/\-/g' $scoredir/$scorebasename-ref.$lang2.sgm > $scoredir/temp2
mv $scoredir/temp2 $scoredir/$scorebasename-ref.$lang2.sgm
sed -e 's/\x1E/\-/g' $scoredir/$scorebasename.moses.sgm > $scoredir/temp2
mv $scoredir/temp2 $scoredir/$scorebasename.moses.sgm
if [ ! -f $scoredir/$scorebasename-src.$lang1.sgm -o ! -f $scoredir/$scorebasename-ref.$lang2.sgm -o ! -f $scoredir/$scorebasename.moses.sgm ]; then
echo "There was a problem creating the files used by the scorer. Exiting..."
IFS=$SAVEIFS
exit 0
else
#-----------------------------------------------------------------------------------------------------------------------------------------
echo "***************** scoring"
startscoringdate=`date +day:%d/%m/%y-time:%H:%M:%S`
#in our experience, the mteval-v13a and the mteval-v12 (more recent scorers) stopped with errors (and no score) with strings like " & " and U+001E
score=`$toolsdir/mteval-v11b.pl -s $scoredir/$scorebasename-src.$lang1.sgm -r $scoredir/$scorebasename-ref.$lang2.sgm -t $scoredir/$scorebasename.moses.sgm -c`
scoretemp=${score%% for system *}
scoretemp1=${scoretemp#*NIST score = }
NIST=${scoretemp1%% *}
BLEUtemp=${scoretemp1#*BLEU score = }
BLEU=${BLEUtemp%% *}
echo $score
scoretemp2=${score#*NIST score =}
echo "NIST score = $scoretemp2" > $scoredir/$scorefile
newscorefile=$scorebasename-BLEU-$BLEU-NIST-$NIST-$batch_user_note-$lang1-$lang2.F-$scoreTMXdocuments-R-$remove_equal-T-$tokenize.L-$lowercase.whole-doc
echo "===================================================================================" >> $scoredir/$scorefile
mv -f $scoredir/$scorefile $scoredir/$newscorefile
#-----------------------------------------------------------------------------------------------------------------------------------------
fi
cat $scoredir/$newscorefile >> $scoredir/temp
mv $scoredir/temp $scoredir/$newscorefile
remove_garbage
fi
else
filename=${filetoscore##*/}
if [ "$filename" != "*" ]; then
let i=$i+1
echo "--------------------------------------------------------------------"
echo -e "$filename file (in the $mosestranslationdir directory):\n\tName of moses translation file is illegal (doesn't end in '.moses' or includes spaces)."
error_msg="Name of moses translation file is illegal (doesn't end in '.moses' or includes spaces)."
log_wrong_file
continue
fi
fi
fi
done
IFS=$SAVEIFS
echo "--------------------------------------------------------------------"
echo -e "Score finished.\n$i files treated.\nResults directory:\n\t$scoredir"
#=================================================================================================================================================
# Changes in version 0.85
#=================================================================================================================================================
# Allows batch processing of the whole $mosesdir/$translation_output directory
# Extracts automatically the source language and target language, the names of the source file, moses translation file and reference translation file and the batch_user_note
# Checks for more file naming errors and informs about them
# More informative report, even in case of error
# Creation of a new file that lists the translations that could not be scored and the reason why
# Corrects a bug that made it fail when the scorer files included the word "for" in their name
# Maintains SGM scorer because newer scorers have caused us more problems with characters that crashed them (ex: " & " and U+001E)
#=================================================================================================================================================
|