added comment on sampling rate

author: Marcin Junczys-Dowmunt <marcinjd@microsoft.com> 2018-11-26 22:17:46 +0300
committer: Marcin Junczys-Dowmunt <marcinjd@microsoft.com> 2018-11-26 22:17:46 +0300
commit: 47db5b55d39d7d616b33ab70d0e2c44a91687b97 (patch)
tree: a7090eafc0e5b644c5205173dfee7a8b11e28cf2
parent: 7f8d6b435a8e45c7c22c9a321b16a9391eba6b82 (diff)
1 files changed, 3 insertions, 1 deletions
diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md
index 1e445c1..03f34e5 100644
--- a/training-basics-sentencepiece/README.md
+++ b/training-basics-sentencepiece/README.md
@@ -274,7 +274,9 @@ BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16+tok.13a+version.1.2.1
 
 We also quickly tested if the normalization of Romanian characters is actually neccessary and if there are other methods
 of dealing with the noise. SentencePiece supports a method called subword-regularization ([Kudo 2018](https://arxiv.org/abs/1804.10959)) that samples different
-subword splits at training time; ideally resulting in a more robust translation at inference time.
+subword splits at training time; ideally resulting in a more robust translation at inference time. You can enable sampling for the source language by replacing
+this line `--sentencepiece-options '--normalization_rule_tsv=data/norm_romanian.tsv'` with `--sentencepiece-alphas 0.2 0`; the sampling rate was recommended
+by [Kudo 2018](https://arxiv.org/abs/1804.10959).
 
 We compare against the University of Edinburgh's WMT16 submission (UEdin WMT16 - this is a Nematus ensemble with BPE and normalization),
 and against our own old example from `marian/examples/training-basics` (old-prepro - single Marian model with complex preprocessing pipeline,
author	Marcin Junczys-Dowmunt <marcinjd@microsoft.com>	2018-11-26 22:17:46 +0300
committer	Marcin Junczys-Dowmunt <marcinjd@microsoft.com>	2018-11-26 22:17:46 +0300
commit	47db5b55d39d7d616b33ab70d0e2c44a91687b97 (patch)
tree	a7090eafc0e5b644c5205173dfee7a8b11e28cf2
parent	7f8d6b435a8e45c7c22c9a321b16a9391eba6b82 (diff)