Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/marian-nmt/marian-examples.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMarcin Junczys-Dowmunt <marcinjd@microsoft.com>2018-11-26 10:14:46 +0300
committerMarcin Junczys-Dowmunt <marcinjd@microsoft.com>2018-11-26 10:14:46 +0300
commite23416707a81fc40d82d96550e24d807e0ab8824 (patch)
tree3ce4dfe514f309ba3234507dc9fafd074b3aa8d6
parent7dd5b07ec4aa9a8fb46bb04226d7c760c746aa8d (diff)
update readme
-rw-r--r--training-basics-sentencepiece/README.md62
1 files changed, 46 insertions, 16 deletions
diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md
index 1e145de..643eb89 100644
--- a/training-basics-sentencepiece/README.md
+++ b/training-basics-sentencepiece/README.md
@@ -1,8 +1,10 @@
# Marian with Built-in SentencePiece
-In this example, we modify the Romanian-English example from `examples/training-basics` to use Taku Kudo's
-[SentencePiece](https://github.com/google/sentencepiece) instead of a complicated pre/prost-processing pipeline.
-We also replace the evaluation scripts with Matt Post's [SacreBLEU](https://github.com/mjpost/sacreBLEU). Both tools greatly simplify the training and evaluation process by providing ways to have reversible hidden preprocessing and repeatable evaluation.
+In this example, we modify the Romanian-English example from `examples/training-basics` to use Taku Kudo's
+[SentencePiece](https://github.com/google/sentencepiece) instead of a complicated pre/prost-processing pipeline.
+We also replace the evaluation scripts with Matt Post's [SacreBLEU](https://github.com/mjpost/sacreBLEU).
+Both tools greatly simplify the training and evaluation process by providing ways to have reversible hidden
+preprocessing and repeatable evaluation.
## Building Marian with SentencePiece Support
@@ -76,7 +78,7 @@ Assuming you one GPU, to execute the complete example type:
./run-me.sh
```
-which downloads the Romanian-English training files and concatenates them into training files.
+which downloads the Romanian-English training files and concatenates them into training files.
No preprocessing is required as the Marian command will train a SentencePiece vocabulary from
the raw text. Next the translation model will be trained and after convergence, the dev and test
sets are translated and evaluated with sacreBLEU.
@@ -89,13 +91,18 @@ To use with a different GPUs than device 0 or more GPUs (here 0 1 2 3) use the c
## Step-by-step Walkthrough
-In this section we repeat the content from the above `run-me.sh` script with explanations. You should be able to copy and paste the commands and follow through all the steps.
+In this section we repeat the content from the above `run-me.sh` script with explanations. You should be
+able to copy and paste the commands and follow through all the steps.
-We assume you are running these commands from the examples directory of the main Marian directory tree `marian/examples/training-basics-sentencepiece` and that the Marian binaries have been compiled in `marian/build`. The localization of the Marian binary relative to the current directory is therefore `../../build/marian`.
+We assume you are running these commands from the examples directory of the main Marian directory tree
+ `marian/examples/training-basics-sentencepiece` and that the Marian binaries have been compiled in
+ `marian/build`. The localization of the Marian binary relative to the current directory is
+ therefore `../../build/marian`.
### Preparing the test and validation sets
-We can use SacreBLEU to produce the original WMT16 development and test sets for Romanian-English. We first clone the SacreBLEU repository from our fork and then generate the test files.
+We can use SacreBLEU to produce the original WMT16 development and test sets for Romanian-English. We first
+clone the SacreBLEU repository from our fork and then generate the test files.
```
# get our fork of sacrebleu
@@ -112,7 +119,8 @@ sacreBLEU/sacrebleu.py -t wmt16 -l ro-en --echo ref > data/newstest2016.en
### Downloading the training files
-Similarly, we download the training files from different sources and concatenate them into two training files. Note, there is no preprocessing whatsoever. Downloading may take a while, the servers are not particularly fast.
+Similarly, we download the training files from different sources and concatenate them into two training files.
+Note, there is no preprocessing whatsoever. Downloading may take a while, the servers are not particularly fast.
```
# change into data directory
@@ -142,10 +150,19 @@ cd ..
### Normalization of Romanian diacritics with SentencePiece
-It seems that the training data is quite noisy and multiple similar characters are used in place of the one correct character.
-Barry Haddow from Edinburgh who created the original normalization Python scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus unnormalized text. The original scripts are located in the old Romanian-English example folder in `marian/examples/training-basics/scripts`. We do not need to use them here.
+It seems that the training data is quite noisy and multiple similar characters are used in place of
+the one correct character. Barry Haddow from Edinburgh who created the original normalization Python
+scripts noticed that removing diacritics on the Romanian side leads to a significant improvment in
+translation quality. And indeed we saw gains of up to 2 BLEU points due to normalization versus
+unnormalized text. The original scripts are located in the old Romanian-English example folder
+in `marian/examples/training-basics/scripts`. We do not need to use them here.
-SentencePiece allows to specify normalization or replacement tables for character sequences. These replacements are applied before tokenization/segmentation and included in the SentencePiece model. Based on the mentioned preprocessing scripts, we manually create a tab-separated normalization rule file `data/norm_romanian.tsv` like this (see the [SentencePiece documentation on normalization](https://github.com/google/sentencepiece/blob/master/doc/normalization.md) for details):
+SentencePiece allows to specify normalization or replacement tables for character sequences. These
+replacements are applied before tokenization/segmentation and included in the SentencePiece model.
+Based on the mentioned preprocessing scripts, we manually create a tab-separated normalization
+rule file `data/norm_romanian.tsv` like this (see the
+[SentencePiece documentation on normalization](https://github.com/google/sentencepiece/blob/master/doc/normalization.md)
+for details):
```
015E 53 # Ş => S
@@ -166,9 +183,9 @@ SentencePiece allows to specify normalization or replacement tables for characte
### Training the NMT model
-Next, we execute a training run with `marian`. Note how the training command is called passing the
-raw training and validation data into Marian. A single joint SentencePiece model will be saved to
-`model/vocab.roen.spm`. The `*.spm` suffix is required and tells Marian to train a SentencePiece
+Next, we execute a training run with `marian`. Note how the training command is called passing the
+raw training and validation data into Marian. A single joint SentencePiece model will be saved to
+`model/vocab.roen.spm`. The `*.spm` suffix is required and tells Marian to train a SentencePiece
vocabulary. When the same vocabulary file is specified multiple times - like in this example - a single
vocabulary is built for the union of the corresponding training files. This also enables us to use
tied embeddings (`--tied-embeddings-all`).
@@ -178,7 +195,7 @@ argument. The values of this option are passed on to the SentencePiece trainer,
quotes around the SentencePiece options: `--sentencepiece-options '--normalization_rule_tsv=data/norm_romanian.tsv'`.
Another new feature is the `bleu-detok` validation metric. When used with SentencePiece this should
-give you in-training BLEU scores that are very close to sacreBLEU's scores. Differences may appear
+give you in-training BLEU scores that are very close to sacreBLEU's scores. Differences may appear
if unexpected SentencePiece normalization rules are used. You should still report only official
sacreBLEU scores for publications.
@@ -235,10 +252,23 @@ after which BLEU scores for the dev and test set are reported.
sacreBLEU/sacrebleu.py -t wmt16/dev -l ro-en < data/newsdev2016.ro.output
sacreBLEU/sacrebleu.py -t wmt16 -l ro-en < data/newstest2016.ro.output
```
-You should see results somewhere in the area of 36.5 BLEU for the dev set and 35.1 BLEU for the test set. This is actually a bit better than for the BPE version from `marian/examples/training-basics` with the complex preprocessing.
+You should see results somewhere in the area of 36.5 BLEU for the dev set and 35.1 BLEU for the test set.
+This is actually a bit better than for the BPE version from `marian/examples/training-basics` with the
+complex preprocessing.
```
BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16/dev+tok.13a+version.1.2.12 = 36.5 67.9/42.7/29.4/20.9 (BP = 1.000 ratio = 1.006 hyp_len = 49816 ref_len = 49526)
BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16+tok.13a+version.1.2.12 = 35.1 66.6/41.3/28.0/19.6 (BP = 1.000 ratio = 1.005 hyp_len = 47804 ref_len = 47562)
```
+## Is normalization actually required?
+
+We also quickly tested if the normalization of Romanian characters is actually neccessary and if there are other methods
+of dealing with the noise. Here's the table:
+
+| | dev | test |
+|------------|------|------|
+| raw | | |
+| normalized | 36.5 | 35.1 |
+| sampling | | |
+
That's all folks.