Update README.md

author: Marcin Junczys-Dowmunt <marcinjd@microsoft.com> 2018-11-26 05:32:57 +0300
committer: GitHub <noreply@github.com> 2018-11-26 05:32:57 +0300
commit: 44696a0d30be11cca1fd9e2e3207b6b052ecc9cf (patch)
tree: ba62901ba242a8b0a3223073d69d38e3b396cd43
parent: 350a5af93d8696a4818ca2d61cad8eb6c55b6f1b (diff)
1 files changed, 6 insertions, 2 deletions
diff --git a/training-basics-spm/README.md b/training-basics-spm/README.md
index 2563775..b96bf82 100644
--- a/training-basics-spm/README.md
+++ b/training-basics-spm/README.md
@@ -95,6 +95,8 @@ We assume you are running these commands from the examples directory of the main
 
 ### Preparing the test and validation sets
 
+We can use SacreBLEU to produce the original WMT16 development and test sets for Romanian-English. We first clone the SacreBLEU repository from our fork and then generate the test files. 
+
 ```
 # get our fork of sacrebleu
 git clone https://github.com/marian-nmt/sacreBLEU.git sacreBLEU
@@ -110,6 +112,8 @@ sacreBLEU/sacrebleu.py -t wmt16 -l ro-en --echo ref > data/newstest2016.en
 
 ### Downloading the training files
 
+Similarly, we download the training files from different sources and concatenate them into two training files. Note, there is no preprocessing whatsoever. Downloading may take a while, the servers are not particularly fast. 
+
 ```
 # change into data directory
 cd data
@@ -139,9 +143,9 @@ cd ..
 ### Normalization of Romanian diacritics with SentencePiece
 
 It seems that the training data is quite noisy and multiple similar characters are used in place of the one correct character.
-Barry Haddow from Edinburgh who created the original Python scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus unnormalized text. 
+Barry Haddow from Edinburgh who created the original normalization Python scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus unnormalized text. The original scripts are located in the old Romanian-English example folder in `marian/examples/training-basics/scripts`. We do not need to use them here.
 
-SentencePiece allows to specify normalization or replacement tables for character sequences. These replacements are applied before tokenization/segmentation and included in the SentencePiece model. Based on these preprocessing scripts from `test`, we manually create a tab-separated normalization rule file `data/norm_romanian.tsv` looking like this (see the [SentencePiece documentation on normalization](https://) for details):
+SentencePiece allows to specify normalization or replacement tables for character sequences. These replacements are applied before tokenization/segmentation and included in the SentencePiece model. Based on the mentioned preprocessing scripts, we manually create a tab-separated normalization rule file `data/norm_romanian.tsv` like this (see the [SentencePiece documentation on normalization](https://github.com/google/sentencepiece/blob/master/doc/normalization.md) for details):
 
 ```
 015E    53 # Ş => S
author	Marcin Junczys-Dowmunt <marcinjd@microsoft.com>	2018-11-26 05:32:57 +0300
committer	GitHub <noreply@github.com>	2018-11-26 05:32:57 +0300
commit	44696a0d30be11cca1fd9e2e3207b6b052ecc9cf (patch)
tree	ba62901ba242a8b0a3223073d69d38e3b396cd43
parent	350a5af93d8696a4818ca2d61cad8eb6c55b6f1b (diff)