Merge branch 'master' of https://github.com/marian-nmt/marian-examples

author: Roman Grundkiewicz <rgrundki@exseed.ed.ac.uk> 2018-11-27 11:48:40 +0300
committer: Roman Grundkiewicz <rgrundki@exseed.ed.ac.uk> 2018-11-27 11:48:40 +0300
commit: 87ce4e0ed7c497578c1cffb7cd27df3072a981ab (patch)
tree: 4d2d1cf755c132fc28e6a7af1bb2d157e8beeb3d
parent: 864aafd0aaa1929e3fb53862cf0a028da4b6982f (diff)
parent: bac54ff3b6de74c73acaf27f45e4b7271b67a9ee (diff)
1 files changed, 41 insertions, 23 deletions
diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md
index 3d9a99f..ce14f28 100644
--- a/training-basics-sentencepiece/README.md
+++ b/training-basics-sentencepiece/README.md
@@ -185,7 +185,11 @@ for details):
 00EE    69 # î => i
 ```
 
-<!-- @TODO: add example for ../../build/spm_normalize --normalization_rule_tsv=data/romanian.tsv -->
+The effect of normalization can be inspected via the following command:
+```
+cat data/newsdev2016.ro | ../../build/spm_normalize --normalization_rule_tsv=data/norm_romanian.tsv | less
+```
+Notice how all diacritics are gone. 
 
 ### Training the NMT model
 
@@ -196,7 +200,7 @@ vocabulary. When the same vocabulary file is specified multiple times - like in
 vocabulary is built for the union of the corresponding training files. This also enables us to use
 tied embeddings (`--tied-embeddings-all`). The SentencePiece training process takes a couple of 
 minutes depending on the input data size. The same `*.spm` can be later reused for other experiments
-with the same language pair and training is then of course omitted. 
+with the same language pair and training is then of course omitted.
 
 We can pass the Romanian-specific normalizaton rules via the `--sentencepiece-options` command line
 argument. The values of this option are passed on to the SentencePiece trainer, note the required single
@@ -237,9 +241,15 @@ mkdir model
 ```
 
 The training should stop if cross-entropy on the validation set
-stops improving. Depending on the number of and generation of GPUs you are using that may take a while.
+stops improving. Depending on the number of and generation of GPUs you are using that 
+may take a while.
 
-<!-- @TODO: add example for ../../build/spm_encode/spm_decode --model=model/vocab.roen.spm -->
+To inspect the created SentencePiece model `model/vocab.roen.spm`, you can now segment any
+Romanian or English text with the following command:
+```
+cat data/newsdev2016.ro | ../../build/spm_encode --model=model/vocab.roen.spm | less
+```
+Notice how the text is not only split, but also normalized with regard to diacritics. 
 
 ### Translating the test and validation sets with evaluation
 
@@ -276,15 +286,19 @@ BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16+tok.13a+version.1.2.1
 
 ## Is Normalization Actually Required?
 
-We also quickly tested if the normalization of Romanian characters is actually neccessary and if there are other methods
-of dealing with the noise. SentencePiece supports a method called subword-regularization ([Kudo 2018](https://arxiv.org/abs/1804.10959)) that samples different
-subword splits at training time; ideally resulting in a more robust translation at inference time. You can enable sampling for the source language by replacing
-this line `--sentencepiece-options '--normalization_rule_tsv=data/norm_romanian.tsv'` with `--sentencepiece-alphas 0.2 0`; the sampling rate was recommended
-by [Kudo 2018](https://arxiv.org/abs/1804.10959).
+We also quickly tested if the normalization of Romanian characters is actually neccessary 
+and if there are other methods of dealing with the noise. SentencePiece supports a method 
+called subword-regularization ([Kudo 2018](https://arxiv.org/abs/1804.10959)) that samples 
+different subword splits at training time; ideally resulting in a more robust translation 
+at inference time. You can enable sampling for the source language by replacing
+this line `--sentencepiece-options '--normalization_rule_tsv=data/norm_romanian.tsv'` with
+ `--sentencepiece-alphas 0.2 0`; the sampling rate was recommended by [Kudo 2018](https://arxiv.org/abs/1804.10959).
 
-We compare against the University of Edinburgh's WMT16 submission (UEdin WMT16 - this is a Nematus ensemble with BPE and normalization),
-and against our own old example from `marian/examples/training-basics` (old-prepro - single Marian model with complex preprocessing pipeline,
-including tokenization, normalization, BPE). Raw training data should be identical for all models.
+We compare against the University of Edinburgh's WMT16 submission (UEdin WMT16 - this is 
+a Nematus ensemble with BPE and normalization), and against our own old example from 
+`marian/examples/training-basics` (old-prepro - single Marian model with complex preprocessing 
+pipeline, including tokenization, normalization, BPE). Raw training data should be identical 
+for all models.
 
 Here's the table:
 
@@ -292,19 +306,23 @@ Here's the table:
 |------------------|------|------|
 | UEdin WMT16      | 35.3 | 33.9 |
 | old-prepro       | 35.9 | 34.5 |
-| SPM-raw          | 35.5 |      |
-| SPM-raw+sampling | 35.5 |      |
+| SPM-raw          | 35.6 | 33.0 |
+| SPM-raw+sampling | 35.7 | 33.0 |
 | SPM-normalized   | 36.5 | 35.1 |
 
-The SentencePiece models are all better than the original Edinburgh systems (an emsemble!), but normalization is important. 
-We see that keeping the noise untouched (SPM-raw) results indeed in the worst of the three system, normalization (SPM-normalized) is best.
-Surprisingly there is no gain from sampled subwords splits (SPM-raw+sampling) over deterministic splits. 
-
-This is an interesting result: I would expected subword-sampling to help at least a little bit, but no. It seems we need to stick with
-normalization which is unfortunate for the following reasons: it is not trivial to discover the normalization problem in the first place and 
-creating a normalization table is another added difficulty; on top of that normalization breaks reversibility. The reversiblity problem is a 
-little less annoying if we only normalize the source and more-or-less keep the target (as in this case). For translation into Romanian we would 
-probably need to keep the diacritics. 
+The SentencePiece models are all better than the original Edinburgh systems (an emsemble!) on 
+the dev set, not necessarily on the test set. And indeed, normalization seems to be is important. 
+We see that keeping the noise untouched (SPM-raw) results in the worst of the three system, 
+normalization (SPM-normalized) is best. Surprisingly, there is no gain from sampled subwords 
+splits (SPM-raw+sampling) over deterministic splits. 
+
+This is an interesting (and disappointing) result: I would have expected subword-sampling to 
+help a good deal more. It seems we need to stick to normalization which is unfortunate for the 
+following reasons: it is not trivial to discover the normalization problem in the first place and 
+creating a normalization table is another added difficulty; on top of that normalization breaks 
+reversibility. The reversiblity problem is a little less annoying if we only normalize the 
+source and more-or-less keep the target (as in this case). For translation into Romanian we 
+would probably need to keep the diacritics. 
 
 That's all folks. More to come soon.
author	Roman Grundkiewicz <rgrundki@exseed.ed.ac.uk>	2018-11-27 11:48:40 +0300
committer	Roman Grundkiewicz <rgrundki@exseed.ed.ac.uk>	2018-11-27 11:48:40 +0300
commit	87ce4e0ed7c497578c1cffb7cd27df3072a981ab (patch)
tree	4d2d1cf755c132fc28e6a7af1bb2d157e8beeb3d
parent	864aafd0aaa1929e3fb53862cf0a028da4b6982f (diff)
parent	bac54ff3b6de74c73acaf27f45e4b7271b67a9ee (diff)