diff options
author | Marcin Junczys-Dowmunt <marcinjd@microsoft.com> | 2018-11-26 05:32:57 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2018-11-26 05:32:57 +0300 |
commit | 44696a0d30be11cca1fd9e2e3207b6b052ecc9cf (patch) | |
tree | ba62901ba242a8b0a3223073d69d38e3b396cd43 | |
parent | 350a5af93d8696a4818ca2d61cad8eb6c55b6f1b (diff) |
Update README.md
-rw-r--r-- | training-basics-spm/README.md | 8 |
1 files changed, 6 insertions, 2 deletions
diff --git a/training-basics-spm/README.md b/training-basics-spm/README.md index 2563775..b96bf82 100644 --- a/training-basics-spm/README.md +++ b/training-basics-spm/README.md @@ -95,6 +95,8 @@ We assume you are running these commands from the examples directory of the main ### Preparing the test and validation sets +We can use SacreBLEU to produce the original WMT16 development and test sets for Romanian-English. We first clone the SacreBLEU repository from our fork and then generate the test files. + ``` # get our fork of sacrebleu git clone https://github.com/marian-nmt/sacreBLEU.git sacreBLEU @@ -110,6 +112,8 @@ sacreBLEU/sacrebleu.py -t wmt16 -l ro-en --echo ref > data/newstest2016.en ### Downloading the training files +Similarly, we download the training files from different sources and concatenate them into two training files. Note, there is no preprocessing whatsoever. Downloading may take a while, the servers are not particularly fast. + ``` # change into data directory cd data @@ -139,9 +143,9 @@ cd .. ### Normalization of Romanian diacritics with SentencePiece It seems that the training data is quite noisy and multiple similar characters are used in place of the one correct character. -Barry Haddow from Edinburgh who created the original Python scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus unnormalized text. +Barry Haddow from Edinburgh who created the original normalization Python scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus unnormalized text. The original scripts are located in the old Romanian-English example folder in `marian/examples/training-basics/scripts`. We do not need to use them here. -SentencePiece allows to specify normalization or replacement tables for character sequences. These replacements are applied before tokenization/segmentation and included in the SentencePiece model. Based on these preprocessing scripts from `test`, we manually create a tab-separated normalization rule file `data/norm_romanian.tsv` looking like this (see the [SentencePiece documentation on normalization](https://) for details): +SentencePiece allows to specify normalization or replacement tables for character sequences. These replacements are applied before tokenization/segmentation and included in the SentencePiece model. Based on the mentioned preprocessing scripts, we manually create a tab-separated normalization rule file `data/norm_romanian.tsv` like this (see the [SentencePiece documentation on normalization](https://github.com/google/sentencepiece/blob/master/doc/normalization.md) for details): ``` 015E 53 # Ş => S |