Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/marian-nmt/marian-examples.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMarcin Junczys-Dowmunt <marcinjd@microsoft.com>2018-11-26 22:17:46 +0300
committerMarcin Junczys-Dowmunt <marcinjd@microsoft.com>2018-11-26 22:17:46 +0300
commit47db5b55d39d7d616b33ab70d0e2c44a91687b97 (patch)
treea7090eafc0e5b644c5205173dfee7a8b11e28cf2
parent7f8d6b435a8e45c7c22c9a321b16a9391eba6b82 (diff)
added comment on sampling rate
-rw-r--r--training-basics-sentencepiece/README.md4
1 files changed, 3 insertions, 1 deletions
diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md
index 1e445c1..03f34e5 100644
--- a/training-basics-sentencepiece/README.md
+++ b/training-basics-sentencepiece/README.md
@@ -274,7 +274,9 @@ BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16+tok.13a+version.1.2.1
We also quickly tested if the normalization of Romanian characters is actually neccessary and if there are other methods
of dealing with the noise. SentencePiece supports a method called subword-regularization ([Kudo 2018](https://arxiv.org/abs/1804.10959)) that samples different
-subword splits at training time; ideally resulting in a more robust translation at inference time.
+subword splits at training time; ideally resulting in a more robust translation at inference time. You can enable sampling for the source language by replacing
+this line `--sentencepiece-options '--normalization_rule_tsv=data/norm_romanian.tsv'` with `--sentencepiece-alphas 0.2 0`; the sampling rate was recommended
+by [Kudo 2018](https://arxiv.org/abs/1804.10959).
We compare against the University of Edinburgh's WMT16 submission (UEdin WMT16 - this is a Nematus ensemble with BPE and normalization),
and against our own old example from `marian/examples/training-basics` (old-prepro - single Marian model with complex preprocessing pipeline,