Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/marian-nmt/marian-examples.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMarcin Junczys-Dowmunt <marcinjd@microsoft.com>2018-11-26 05:32:57 +0300
committerGitHub <noreply@github.com>2018-11-26 05:32:57 +0300
commit44696a0d30be11cca1fd9e2e3207b6b052ecc9cf (patch)
treeba62901ba242a8b0a3223073d69d38e3b396cd43
parent350a5af93d8696a4818ca2d61cad8eb6c55b6f1b (diff)
Update README.md
-rw-r--r--training-basics-spm/README.md8
1 files changed, 6 insertions, 2 deletions
diff --git a/training-basics-spm/README.md b/training-basics-spm/README.md
index 2563775..b96bf82 100644
--- a/training-basics-spm/README.md
+++ b/training-basics-spm/README.md
@@ -95,6 +95,8 @@ We assume you are running these commands from the examples directory of the main
### Preparing the test and validation sets
+We can use SacreBLEU to produce the original WMT16 development and test sets for Romanian-English. We first clone the SacreBLEU repository from our fork and then generate the test files.
+
```
# get our fork of sacrebleu
git clone https://github.com/marian-nmt/sacreBLEU.git sacreBLEU
@@ -110,6 +112,8 @@ sacreBLEU/sacrebleu.py -t wmt16 -l ro-en --echo ref > data/newstest2016.en
### Downloading the training files
+Similarly, we download the training files from different sources and concatenate them into two training files. Note, there is no preprocessing whatsoever. Downloading may take a while, the servers are not particularly fast.
+
```
# change into data directory
cd data
@@ -139,9 +143,9 @@ cd ..
### Normalization of Romanian diacritics with SentencePiece
It seems that the training data is quite noisy and multiple similar characters are used in place of the one correct character.
-Barry Haddow from Edinburgh who created the original Python scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus unnormalized text.
+Barry Haddow from Edinburgh who created the original normalization Python scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus unnormalized text. The original scripts are located in the old Romanian-English example folder in `marian/examples/training-basics/scripts`. We do not need to use them here.
-SentencePiece allows to specify normalization or replacement tables for character sequences. These replacements are applied before tokenization/segmentation and included in the SentencePiece model. Based on these preprocessing scripts from `test`, we manually create a tab-separated normalization rule file `data/norm_romanian.tsv` looking like this (see the [SentencePiece documentation on normalization](https://) for details):
+SentencePiece allows to specify normalization or replacement tables for character sequences. These replacements are applied before tokenization/segmentation and included in the SentencePiece model. Based on the mentioned preprocessing scripts, we manually create a tab-separated normalization rule file `data/norm_romanian.tsv` like this (see the [SentencePiece documentation on normalization](https://github.com/google/sentencepiece/blob/master/doc/normalization.md) for details):
```
015E 53 # Ş => S