update readme

author: Marcin Junczys-Dowmunt <marcinjd@microsoft.com> 2018-11-26 21:12:19 +0300
committer: Marcin Junczys-Dowmunt <marcinjd@microsoft.com> 2018-11-26 21:12:19 +0300
commit: 39d1fa288891c17f0b5f4b02cf2a84b6ea942ef9 (patch)
tree: a64a20bd3ce42b952cc1ec7da11527c6decdc549
parent: 29346583cfaca11b18ebcf748c293f03e3a82975 (diff)
1 files changed, 21 insertions, 11 deletions
diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md
index 9b122d7..8e97851 100644
--- a/training-basics-sentencepiece/README.md
+++ b/training-basics-sentencepiece/README.md
@@ -270,19 +270,29 @@ We also quickly tested if the normalization of Romanian characters is actually n
 of dealing with the noise. SentencePiece supports a method called subword-regularization ([Kudo 2018](https://arxiv.org/abs/1804.10959)) that samples different
 subword splits at training time; ideally resulting in a more robust translation at inference time.
 
+We compare against the University of Edinburgh's WMT16 submission (UEdin WMT16 - this is a Nematus ensemble with BPE and normalization),
+and against our own old example from `marian/examples/training-basics` (old-prepro - single Marian model with complex preprocessing pipeline,
+including tokenization, normalization, BPE). Raw training data should be identical for all models.
+
 Here's the table:
 
-|              | dev  | test |
-|--------------|------|------|
-| raw text     |      |      |
-| normalized   | 36.5 | 35.1 |
-| raw+sampling |      |      |
-
-We see that keeping the noise untouched (raw) results indeed in the worst of the three system, normalization (normalized) is best,
-closely followed by sampled subwords splits (raw+sampling). This is an interesting result: although normalization is generally better,
-it is not trivial to discover the problem in the first place. Creating a normalization table is another added difficulty - and on top of
-that normalization breaks reversibility. Subword sampling seems to be a viable alternative when dealing with character-level noise with
-no added complexity compared to raw text. It does however take longer to converge, being a regularization method.
+| system           | dev  | test |
+|------------------|------|------|
+| UEdin WMT16      | 35.3 | 33.9 |
+| old-prepro       | 35.9 | 34.5 |
+| SPM-raw          | 35.5 |      |
+| SPM-raw+sampling | 35.5 |      |
+| SPM-normalized   | 36.5 | 35.1 |
+
+The SentencePiece models are all better than the original Edinburgh systems (an emsemble!), but normalization is important. 
+We see that keeping the noise untouched (SPM-raw) results indeed in the worst of the three system, normalization (SPM-normalized) is best.
+Surprisingly there is no gain from sampled subwords splits (SPM-raw+sampling) over deterministic splits. 
+
+This is an interesting result: I would expected subword-sampling to help at least a little bit, but no. It seems we need to stick with
+normalization which is unfortunate for the following reasons: it is not trivial to discover the normalization problem in the first place and 
+creating a normalization table is another added difficulty; on top of that normalization breaks reversibility. The reversiblity problem is a 
+little less annoying if we only normalize the source and more-or-less keep the target (as in this case). For translation into Romanian we would 
+probably need to keep the diacritics. 
 
 That's all folks. More to come soon.
author	Marcin Junczys-Dowmunt <marcinjd@microsoft.com>	2018-11-26 21:12:19 +0300
committer	Marcin Junczys-Dowmunt <marcinjd@microsoft.com>	2018-11-26 21:12:19 +0300
commit	39d1fa288891c17f0b5f4b02cf2a84b6ea942ef9 (patch)
tree	a64a20bd3ce42b952cc1ec7da11527c6decdc549
parent	29346583cfaca11b18ebcf748c293f03e3a82975 (diff)