Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/moses-smt/mosesdecoder.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorOzan Caglayan <ozancag@gmail.com>2018-11-07 12:59:54 +0300
committerOzan Caglayan <ozancag@gmail.com>2018-11-07 12:59:54 +0300
commit9fc964da7fbe91b1fb3da69ed192cf9be217d256 (patch)
tree02da4942d736d2719d05ba6f2819b276e455571a
parentd2b558728f0872a41badbe4c8e8e61481e2117f9 (diff)
tokenizer.perl: split final dots unconditionally
Allow tokenization of non-breaking prefixes at end of sentences. This should be a fair compromise in many cases to construct a cleaner vocabulary. EN-old: So am I. EN-new: So am I . DE-old: ... schwer wie ein iPhone 5. DE-new: ... schwer wie ein iPhone 5 . FR-old: Des gens admirent une œuvre d&apos; art. FR-new: Des gens admirent une œuvre d&apos; art . CS-old: Dvě děti, které běží bez bot. CS-new: Dvě děti, které běží bez bot .
-rwxr-xr-xscripts/tokenizer/tokenizer.perl8
1 files changed, 6 insertions, 2 deletions
diff --git a/scripts/tokenizer/tokenizer.perl b/scripts/tokenizer/tokenizer.perl
index f9b5cd60b..b84b9eb31 100755
--- a/scripts/tokenizer/tokenizer.perl
+++ b/scripts/tokenizer/tokenizer.perl
@@ -346,10 +346,14 @@ sub tokenize
if ( $word =~ /^(\S+)\.$/)
{
my $pre = $1;
- if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/)))
+ if ($i == scalar(@words)-1) {
+ # split last words independently as they are unlikely to be non-breaking prefixes
+ $word = $pre." .";
+ }
+ elsif (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/)))
{
#no change
- }
+ }
elsif (($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==2) && ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[0-9]+/)))
{
#no change