From e23416707a81fc40d82d96550e24d807e0ab8824 Mon Sep 17 00:00:00 2001 From: Marcin Junczys-Dowmunt Date: Sun, 25 Nov 2018 23:14:46 -0800 Subject: update readme --- training-basics-sentencepiece/README.md | 62 ++++++++++++++++++++++++--------- 1 file changed, 46 insertions(+), 16 deletions(-) diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md index 1e145de..643eb89 100644 --- a/training-basics-sentencepiece/README.md +++ b/training-basics-sentencepiece/README.md @@ -1,8 +1,10 @@ # Marian with Built-in SentencePiece -In this example, we modify the Romanian-English example from `examples/training-basics` to use Taku Kudo's -[SentencePiece](https://github.com/google/sentencepiece) instead of a complicated pre/prost-processing pipeline. -We also replace the evaluation scripts with Matt Post's [SacreBLEU](https://github.com/mjpost/sacreBLEU). Both tools greatly simplify the training and evaluation process by providing ways to have reversible hidden preprocessing and repeatable evaluation. +In this example, we modify the Romanian-English example from `examples/training-basics` to use Taku Kudo's +[SentencePiece](https://github.com/google/sentencepiece) instead of a complicated pre/prost-processing pipeline. +We also replace the evaluation scripts with Matt Post's [SacreBLEU](https://github.com/mjpost/sacreBLEU). +Both tools greatly simplify the training and evaluation process by providing ways to have reversible hidden +preprocessing and repeatable evaluation. ## Building Marian with SentencePiece Support @@ -76,7 +78,7 @@ Assuming you one GPU, to execute the complete example type: ./run-me.sh ``` -which downloads the Romanian-English training files and concatenates them into training files. +which downloads the Romanian-English training files and concatenates them into training files. No preprocessing is required as the Marian command will train a SentencePiece vocabulary from the raw text. Next the translation model will be trained and after convergence, the dev and test sets are translated and evaluated with sacreBLEU. @@ -89,13 +91,18 @@ To use with a different GPUs than device 0 or more GPUs (here 0 1 2 3) use the c ## Step-by-step Walkthrough -In this section we repeat the content from the above `run-me.sh` script with explanations. You should be able to copy and paste the commands and follow through all the steps. +In this section we repeat the content from the above `run-me.sh` script with explanations. You should be +able to copy and paste the commands and follow through all the steps. -We assume you are running these commands from the examples directory of the main Marian directory tree `marian/examples/training-basics-sentencepiece` and that the Marian binaries have been compiled in `marian/build`. The localization of the Marian binary relative to the current directory is therefore `../../build/marian`. +We assume you are running these commands from the examples directory of the main Marian directory tree + `marian/examples/training-basics-sentencepiece` and that the Marian binaries have been compiled in + `marian/build`. The localization of the Marian binary relative to the current directory is + therefore `../../build/marian`. ### Preparing the test and validation sets -We can use SacreBLEU to produce the original WMT16 development and test sets for Romanian-English. We first clone the SacreBLEU repository from our fork and then generate the test files. +We can use SacreBLEU to produce the original WMT16 development and test sets for Romanian-English. We first +clone the SacreBLEU repository from our fork and then generate the test files. ``` # get our fork of sacrebleu @@ -112,7 +119,8 @@ sacreBLEU/sacrebleu.py -t wmt16 -l ro-en --echo ref > data/newstest2016.en ### Downloading the training files -Similarly, we download the training files from different sources and concatenate them into two training files. Note, there is no preprocessing whatsoever. Downloading may take a while, the servers are not particularly fast. +Similarly, we download the training files from different sources and concatenate them into two training files. +Note, there is no preprocessing whatsoever. Downloading may take a while, the servers are not particularly fast. ``` # change into data directory @@ -142,10 +150,19 @@ cd .. ### Normalization of Romanian diacritics with SentencePiece -It seems that the training data is quite noisy and multiple similar characters are used in place of the one correct character. -Barry Haddow from Edinburgh who created the original normalization Python scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus unnormalized text. The original scripts are located in the old Romanian-English example folder in `marian/examples/training-basics/scripts`. We do not need to use them here. +It seems that the training data is quite noisy and multiple similar characters are used in place of +the one correct character. Barry Haddow from Edinburgh who created the original normalization Python +scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in +translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus +unnormalized text. The original scripts are located in the old Romanian-English example folder +in `marian/examples/training-basics/scripts`. We do not need to use them here. -SentencePiece allows to specify normalization or replacement tables for character sequences. These replacements are applied before tokenization/segmentation and included in the SentencePiece model. Based on the mentioned preprocessing scripts, we manually create a tab-separated normalization rule file `data/norm_romanian.tsv` like this (see the [SentencePiece documentation on normalization](https://github.com/google/sentencepiece/blob/master/doc/normalization.md) for details): +SentencePiece allows to specify normalization or replacement tables for character sequences. These +replacements are applied before tokenization/segmentation and included in the SentencePiece model. +Based on the mentioned preprocessing scripts, we manually create a tab-separated normalization +rule file `data/norm_romanian.tsv` like this (see the +[SentencePiece documentation on normalization](https://github.com/google/sentencepiece/blob/master/doc/normalization.md) +for details): ``` 015E 53 # Ş => S @@ -166,9 +183,9 @@ SentencePiece allows to specify normalization or replacement tables for characte ### Training the NMT model -Next, we execute a training run with `marian`. Note how the training command is called passing the -raw training and validation data into Marian. A single joint SentencePiece model will be saved to -`model/vocab.roen.spm`. The `*.spm` suffix is required and tells Marian to train a SentencePiece +Next, we execute a training run with `marian`. Note how the training command is called passing the +raw training and validation data into Marian. A single joint SentencePiece model will be saved to +`model/vocab.roen.spm`. The `*.spm` suffix is required and tells Marian to train a SentencePiece vocabulary. When the same vocabulary file is specified multiple times - like in this example - a single vocabulary is built for the union of the corresponding training files. This also enables us to use tied embeddings (`--tied-embeddings-all`). @@ -178,7 +195,7 @@ argument. The values of this option are passed on to the SentencePiece trainer, quotes around the SentencePiece options: `--sentencepiece-options '--normalization_rule_tsv=data/norm_romanian.tsv'`. Another new feature is the `bleu-detok` validation metric. When used with SentencePiece this should -give you in-training BLEU scores that are very close to sacreBLEU's scores. Differences may appear +give you in-training BLEU scores that are very close to sacreBLEU's scores. Differences may appear if unexpected SentencePiece normalization rules are used. You should still report only official sacreBLEU scores for publications. @@ -235,10 +252,23 @@ after which BLEU scores for the dev and test set are reported. sacreBLEU/sacrebleu.py -t wmt16/dev -l ro-en < data/newsdev2016.ro.output sacreBLEU/sacrebleu.py -t wmt16 -l ro-en < data/newstest2016.ro.output ``` -You should see results somewhere in the area of 36.5 BLEU for the dev set and 35.1 BLEU for the test set. This is actually a bit better than for the BPE version from `marian/examples/training-basics` with the complex preprocessing. +You should see results somewhere in the area of 36.5 BLEU for the dev set and 35.1 BLEU for the test set. +This is actually a bit better than for the BPE version from `marian/examples/training-basics` with the +complex preprocessing. ``` BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16/dev+tok.13a+version.1.2.12 = 36.5 67.9/42.7/29.4/20.9 (BP = 1.000 ratio = 1.006 hyp_len = 49816 ref_len = 49526) BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16+tok.13a+version.1.2.12 = 35.1 66.6/41.3/28.0/19.6 (BP = 1.000 ratio = 1.005 hyp_len = 47804 ref_len = 47562) ``` +## Is normalization actually required? + +We also quickly tested if the normalization of Romanian characters is actually neccessary and if there are other methods +of dealing with the noise. Here's the table: + +| | dev | test | +|------------|------|------| +| raw | | | +| normalized | 36.5 | 35.1 | +| sampling | | | + That's all folks. -- cgit v1.2.3