Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/moses-smt/mosesdecoder.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorLinas Vepstas <linasvepstas@gmail.com>2017-01-08 19:08:53 +0300
committerLinas Vepstas <linasvepstas@gmail.com>2017-01-08 19:08:53 +0300
commit2e48f83ab4cbf93b4f39eb8a8f91d1662cc9f5e0 (patch)
tree630a5330e9a52edb2f524dcbab739a0f7e437b3a /scripts/ems/support
parent6fb2c9702963422a4d4f3dec0eeb390fd77eeab3 (diff)
Handle punctuation+CJK combinations.
Diffstat (limited to 'scripts/ems/support')
-rwxr-xr-xscripts/ems/support/split-sentences.perl10
1 files changed, 8 insertions, 2 deletions
diff --git a/scripts/ems/support/split-sentences.perl b/scripts/ems/support/split-sentences.perl
index 160c5d548..c8ff87dde 100755
--- a/scripts/ems/support/split-sentences.perl
+++ b/scripts/ems/support/split-sentences.perl
@@ -128,13 +128,19 @@ sub preprocess {
# A normal full-stop or other Western sentence enders followed
# by an ideograph is an and-of-sentence, always.
- $text =~ s/([\.?!]) *(\p{InCJK})/$1\n$2/g;
+ $text =~ s/([\.?!]) *(\p{CJK})/$1\n$2/g;
+
+ # Split close-paren-then-comma into two.
+ $text =~ s/(\p{Punctuation}) *(\p{Punctuation})/ $1 $2 /g;
# Chinese does not use any sort of white-space between ideographs.
# Nominally, each single ideograph corresponds to one word. Add
# spaces here, so that later processing stages can tokenize readily.
# Note that this handles mixed latinate+CJK.
- $text =~ s/(\p{InCJK})/ $1 /g;
+ # TODO: perhaps also CJKExtA CJKExtB etc ??? CJK_Radicals_Sup ?
+ $text =~ s/(\p{Punctuation}) *(\p{CJK})/ $1 $2/g
+ $text =~ s/(\p{CJK}) *(\p{Punctuation})/$1 $2 /g;
+ $text =~ s/([\p{CJK}\p{CJKSymbols}])/ $1 /g;
$text =~ s/ +/ /g;
# Special punctuation cases are covered. Check all remaining periods.