diff options
author | Ozan Caglayan <ozancag@gmail.com> | 2018-11-07 12:59:54 +0300 |
---|---|---|
committer | Ozan Caglayan <ozancag@gmail.com> | 2018-11-07 12:59:54 +0300 |
commit | 9fc964da7fbe91b1fb3da69ed192cf9be217d256 (patch) | |
tree | 02da4942d736d2719d05ba6f2819b276e455571a | |
parent | d2b558728f0872a41badbe4c8e8e61481e2117f9 (diff) |
tokenizer.perl: split final dots unconditionally
Allow tokenization of non-breaking prefixes at end of sentences. This should
be a fair compromise in many cases to construct a cleaner vocabulary.
EN-old: So am I.
EN-new: So am I .
DE-old: ... schwer wie ein iPhone 5.
DE-new: ... schwer wie ein iPhone 5 .
FR-old: Des gens admirent une œuvre d' art.
FR-new: Des gens admirent une œuvre d' art .
CS-old: Dvě děti, které běží bez bot.
CS-new: Dvě děti, které běží bez bot .
-rwxr-xr-x | scripts/tokenizer/tokenizer.perl | 8 |
1 files changed, 6 insertions, 2 deletions
diff --git a/scripts/tokenizer/tokenizer.perl b/scripts/tokenizer/tokenizer.perl index f9b5cd60b..b84b9eb31 100755 --- a/scripts/tokenizer/tokenizer.perl +++ b/scripts/tokenizer/tokenizer.perl @@ -346,10 +346,14 @@ sub tokenize if ( $word =~ /^(\S+)\.$/) { my $pre = $1; - if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/))) + if ($i == scalar(@words)-1) { + # split last words independently as they are unlikely to be non-breaking prefixes + $word = $pre." ."; + } + elsif (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/))) { #no change - } + } elsif (($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==2) && ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[0-9]+/))) { #no change |