diff options
author | Marcin Junczys-Dowmunt <marcinjd@microsoft.com> | 2018-11-26 10:43:13 +0300 |
---|---|---|
committer | Marcin Junczys-Dowmunt <marcinjd@microsoft.com> | 2018-11-26 10:43:13 +0300 |
commit | 626b5bde373844b35f806c82d31e12016d30e1b2 (patch) | |
tree | 46cf84cc58519c8b6bc0d154ff14c0f46f4f07d3 | |
parent | 1b41c24aff90e121ee992f43804495a333f3bb94 (diff) | |
parent | f3d292740fc7ea19e0a4bc970d66e49ef982a698 (diff) |
Merge branch 'master' of https://github.com/marian-nmt/marian-examples
-rw-r--r-- | training-basics-sentencepiece/README.md | 6 |
1 files changed, 3 insertions, 3 deletions
diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md index 4facab0..b15d996 100644 --- a/training-basics-sentencepiece/README.md +++ b/training-basics-sentencepiece/README.md @@ -267,7 +267,7 @@ BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16+tok.13a+version.1.2.1 ## Is Normalization Actually Required? We also quickly tested if the normalization of Romanian characters is actually neccessary and if there are other methods -of dealing with the noise. SentencePiece supports a method called subword-regularization ((Kudo 2018)[]) that samples different +of dealing with the noise. SentencePiece supports a method called subword-regularization ([Kudo 2018](https://arxiv.org/abs/1804.10959)) that samples different subword splits at training time; ideally resulting in a more robust translation at inference time. Here's the table: @@ -279,8 +279,8 @@ Here's the table: | raw+sampling | | | We see that keeping the noise untouched (raw) results indeed in the worst of the three system, normalization (normalized) is best, -closely followed by sampled subwords splits (raw+sampling). This is an interesting result: although normalization is generally better -it is not trivial to discover the problem in the first place. Creating a normalization table is another added difficulty and on top of +closely followed by sampled subwords splits (raw+sampling). This is an interesting result: although normalization is generally better, +it is not trivial to discover the problem in the first place. Creating a normalization table is another added difficulty - and on top of that normalization breaks reversibility. Subword sampling seems to be a viable alternative when dealing with character-level noise with no added complexity compared to raw text. It does however take longer to converge, being a regularization method. |