diff options
author | Marcin Junczys-Dowmunt <marcinjd@microsoft.com> | 2018-11-26 22:17:46 +0300 |
---|---|---|
committer | Marcin Junczys-Dowmunt <marcinjd@microsoft.com> | 2018-11-26 22:17:46 +0300 |
commit | 47db5b55d39d7d616b33ab70d0e2c44a91687b97 (patch) | |
tree | a7090eafc0e5b644c5205173dfee7a8b11e28cf2 | |
parent | 7f8d6b435a8e45c7c22c9a321b16a9391eba6b82 (diff) |
added comment on sampling rate
-rw-r--r-- | training-basics-sentencepiece/README.md | 4 |
1 files changed, 3 insertions, 1 deletions
diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md index 1e445c1..03f34e5 100644 --- a/training-basics-sentencepiece/README.md +++ b/training-basics-sentencepiece/README.md @@ -274,7 +274,9 @@ BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16+tok.13a+version.1.2.1 We also quickly tested if the normalization of Romanian characters is actually neccessary and if there are other methods of dealing with the noise. SentencePiece supports a method called subword-regularization ([Kudo 2018](https://arxiv.org/abs/1804.10959)) that samples different -subword splits at training time; ideally resulting in a more robust translation at inference time. +subword splits at training time; ideally resulting in a more robust translation at inference time. You can enable sampling for the source language by replacing +this line `--sentencepiece-options '--normalization_rule_tsv=data/norm_romanian.tsv'` with `--sentencepiece-alphas 0.2 0`; the sampling rate was recommended +by [Kudo 2018](https://arxiv.org/abs/1804.10959). We compare against the University of Edinburgh's WMT16 submission (UEdin WMT16 - this is a Nematus ensemble with BPE and normalization), and against our own old example from `marian/examples/training-basics` (old-prepro - single Marian model with complex preprocessing pipeline, |