From e23416707a81fc40d82d96550e24d807e0ab8824 Mon Sep 17 00:00:00 2001
From: Marcin Junczys-Dowmunt <marcinjd@microsoft.com>
Date: Sun, 25 Nov 2018 23:14:46 -0800
Subject: update readme

---
 training-basics-sentencepiece/README.md | 62 ++++++++++++++++++++++++---------
 1 file changed, 46 insertions(+), 16 deletions(-)

diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md
index 1e145de..643eb89 100644
--- a/training-basics-sentencepiece/README.md
+++ b/training-basics-sentencepiece/README.md
@@ -1,8 +1,10 @@
 # Marian with Built-in SentencePiece
 
-In this example, we modify the Romanian-English example from `examples/training-basics` to use Taku Kudo's 
-[SentencePiece](https://github.com/google/sentencepiece) instead of a complicated pre/prost-processing pipeline. 
-We also replace the evaluation scripts with Matt Post's [SacreBLEU](https://github.com/mjpost/sacreBLEU). Both tools greatly simplify the training and evaluation process by providing ways to have reversible hidden preprocessing and repeatable evaluation. 
+In this example, we modify the Romanian-English example from `examples/training-basics` to use Taku Kudo's
+[SentencePiece](https://github.com/google/sentencepiece) instead of a complicated pre/prost-processing pipeline.
+We also replace the evaluation scripts with Matt Post's [SacreBLEU](https://github.com/mjpost/sacreBLEU).
+Both tools greatly simplify the training and evaluation process by providing ways to have reversible hidden
+preprocessing and repeatable evaluation.
 
 ## Building Marian with SentencePiece Support
 
@@ -76,7 +78,7 @@ Assuming you one GPU, to execute the complete example type:
 ./run-me.sh
 ```
 
-which downloads the Romanian-English training files and concatenates them into training files. 
+which downloads the Romanian-English training files and concatenates them into training files.
 No preprocessing is required as the Marian command will train a SentencePiece vocabulary from
 the raw text. Next the translation model will be trained and after convergence, the dev and test
 sets are translated and evaluated with sacreBLEU.
@@ -89,13 +91,18 @@ To use with a different GPUs than device 0 or more GPUs (here 0 1 2 3) use the c
 
 ## Step-by-step Walkthrough
 
-In this section we repeat the content from the above `run-me.sh` script with explanations. You should be able to copy and paste the commands and follow through all the steps. 
+In this section we repeat the content from the above `run-me.sh` script with explanations. You should be
+able to copy and paste the commands and follow through all the steps.
 
-We assume you are running these commands from the examples directory of the main Marian directory tree `marian/examples/training-basics-sentencepiece` and that the Marian binaries have been compiled in `marian/build`. The localization of the Marian binary relative to the current directory is therefore `../../build/marian`.
+We assume you are running these commands from the examples directory of the main Marian directory tree
+ `marian/examples/training-basics-sentencepiece` and that the Marian binaries have been compiled in
+ `marian/build`. The localization of the Marian binary relative to the current directory is
+ therefore `../../build/marian`.
 
 ### Preparing the test and validation sets
 
-We can use SacreBLEU to produce the original WMT16 development and test sets for Romanian-English. We first clone the SacreBLEU repository from our fork and then generate the test files. 
+We can use SacreBLEU to produce the original WMT16 development and test sets for Romanian-English. We first
+clone the SacreBLEU repository from our fork and then generate the test files.
 
 ```
 # get our fork of sacrebleu
@@ -112,7 +119,8 @@ sacreBLEU/sacrebleu.py -t wmt16 -l ro-en --echo ref > data/newstest2016.en
 
 ### Downloading the training files
 
-Similarly, we download the training files from different sources and concatenate them into two training files. Note, there is no preprocessing whatsoever. Downloading may take a while, the servers are not particularly fast. 
+Similarly, we download the training files from different sources and concatenate them into two training files.
+Note, there is no preprocessing whatsoever. Downloading may take a while, the servers are not particularly fast.
 
 ```
 # change into data directory
@@ -142,10 +150,19 @@ cd ..
 
 ### Normalization of Romanian diacritics with SentencePiece
 
-It seems that the training data is quite noisy and multiple similar characters are used in place of the one correct character.
-Barry Haddow from Edinburgh who created the original normalization Python scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus unnormalized text. The original scripts are located in the old Romanian-English example folder in `marian/examples/training-basics/scripts`. We do not need to use them here.
+It seems that the training data is quite noisy and multiple similar characters are used in place of
+the one correct character. Barry Haddow from Edinburgh who created the original normalization Python
+scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in
+translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus
+unnormalized text. The original scripts are located in the old Romanian-English example folder
+in `marian/examples/training-basics/scripts`. We do not need to use them here.
 
-SentencePiece allows to specify normalization or replacement tables for character sequences. These replacements are applied before tokenization/segmentation and included in the SentencePiece model. Based on the mentioned preprocessing scripts, we manually create a tab-separated normalization rule file `data/norm_romanian.tsv` like this (see the [SentencePiece documentation on normalization](https://github.com/google/sentencepiece/blob/master/doc/normalization.md) for details):
+SentencePiece allows to specify normalization or replacement tables for character sequences. These
+replacements are applied before tokenization/segmentation and included in the SentencePiece model.
+Based on the mentioned preprocessing scripts, we manually create a tab-separated normalization
+rule file `data/norm_romanian.tsv` like this (see the
+[SentencePiece documentation on normalization](https://github.com/google/sentencepiece/blob/master/doc/normalization.md)
+for details):
 
 ```
 015E    53 # Ş => S
@@ -166,9 +183,9 @@ SentencePiece allows to specify normalization or replacement tables for characte
 
 ### Training the NMT model
 
-Next, we execute a training run with `marian`. Note how the training command is called passing the 
-raw training and validation data into Marian. A single joint SentencePiece model will be saved to 
-`model/vocab.roen.spm`. The `*.spm` suffix is required and tells Marian to train a SentencePiece 
+Next, we execute a training run with `marian`. Note how the training command is called passing the
+raw training and validation data into Marian. A single joint SentencePiece model will be saved to
+`model/vocab.roen.spm`. The `*.spm` suffix is required and tells Marian to train a SentencePiece
 vocabulary. When the same vocabulary file is specified multiple times - like in this example - a single
 vocabulary is built for the union of the corresponding training files. This also enables us to use
 tied embeddings (`--tied-embeddings-all`).
@@ -178,7 +195,7 @@ argument. The values of this option are passed on to the SentencePiece trainer,
 quotes around the SentencePiece options: `--sentencepiece-options '--normalization_rule_tsv=data/norm_romanian.tsv'`.
 
 Another new feature is the `bleu-detok` validation metric. When used with SentencePiece this should
-give you in-training BLEU scores that are very close to sacreBLEU's scores. Differences may appear 
+give you in-training BLEU scores that are very close to sacreBLEU's scores. Differences may appear
 if unexpected SentencePiece normalization rules are used. You should still report only official
 sacreBLEU scores for publications.
 
@@ -235,10 +252,23 @@ after which BLEU scores for the dev and test set are reported.
 sacreBLEU/sacrebleu.py -t wmt16/dev -l ro-en < data/newsdev2016.ro.output
 sacreBLEU/sacrebleu.py -t wmt16 -l ro-en < data/newstest2016.ro.output
 ```
-You should see results somewhere in the area of 36.5 BLEU for the dev set and 35.1 BLEU for the test set. This is actually a bit better than for the BPE version from `marian/examples/training-basics` with the complex preprocessing.
+You should see results somewhere in the area of 36.5 BLEU for the dev set and 35.1 BLEU for the test set.
+This is actually a bit better than for the BPE version from `marian/examples/training-basics` with the
+complex preprocessing.
 ```
 BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16/dev+tok.13a+version.1.2.12 = 36.5 67.9/42.7/29.4/20.9 (BP = 1.000 ratio = 1.006 hyp_len = 49816 ref_len = 49526)
 BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16+tok.13a+version.1.2.12 = 35.1 66.6/41.3/28.0/19.6 (BP = 1.000 ratio = 1.005 hyp_len = 47804 ref_len = 47562)
 ```
 
+## Is normalization actually required?
+
+We also quickly tested if the normalization of Romanian characters is actually neccessary and if there are other methods
+of dealing with the noise. Here's the table:
+
+|            | dev  | test |
+|------------|------|------|
+| raw        |      |      |
+| normalized | 36.5 | 35.1 |
+| sampling   |      |      |
+
 That's all folks.
-- 
cgit v1.2.3