From 1d5df93e22a2dfb58330a98f182cfd4f32825aba Mon Sep 17 00:00:00 2001
From: Marcin Junczys-Dowmunt <marcinjd@microsoft.com>
Date: Sun, 25 Nov 2018 23:34:22 -0800
Subject: update readme

---
 training-basics-sentencepiece/README.md | 32 +++++++++++++++++++++++---------
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md
index 643eb89..0941189 100644
--- a/training-basics-sentencepiece/README.md
+++ b/training-basics-sentencepiece/README.md
@@ -1,4 +1,4 @@
-# Marian with Built-in SentencePiece
+# Tutorial: Marian with Built-in SentencePiece
 
 In this example, we modify the Romanian-English example from `examples/training-basics` to use Taku Kudo's
 [SentencePiece](https://github.com/google/sentencepiece) instead of a complicated pre/prost-processing pipeline.
@@ -6,6 +6,10 @@ We also replace the evaluation scripts with Matt Post's [SacreBLEU](https://gith
 Both tools greatly simplify the training and evaluation process by providing ways to have reversible hidden
 preprocessing and repeatable evaluation.
 
+The model we build here is a simple Nematus-style shallow RNN model, similar to the one in the older
+`marian/examples/training-basics` folder. We will soon update our WMT Transformer examples to use
+SentencePiece.
+
 ## Building Marian with SentencePiece Support
 
 Since version 1.7.0, Marian has built-in support for SentencePiece,
@@ -260,15 +264,25 @@ BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16/dev+tok.13a+version.1
 BLEU+case.mixed+lang.ro-en+numrefs.1+smooth.exp+test.wmt16+tok.13a+version.1.2.12 = 35.1 66.6/41.3/28.0/19.6 (BP = 1.000 ratio = 1.005 hyp_len = 47804 ref_len = 47562)
 ```
 
-## Is normalization actually required?
+## Is Normalization Actually Required?
 
 We also quickly tested if the normalization of Romanian characters is actually neccessary and if there are other methods
-of dealing with the noise. Here's the table:
+of dealing with the noise. SentencePiece supports a method called subword-regularization ((Kudo 2018)[]) that samples different
+subword splits at training time; ideally resulting in a more robust translation at inference time.
+
+Here's the table:
+
+|              | dev  | test |
+|--------------|------|------|
+| raw text     |      |      |
+| normalized   | 36.5 | 35.1 |
+| raw+sampling |      |      |
+
+We see that keeping the noise untouched (raw) results indeed in the worst of the three system, normalization (normalized) is best,
+closely followed by sampled subwords splits (raw+sampling). This is an interesting result: although normalization is generally better
+it is not trivial to discover the problem in the first place. Creating a normalization table is another added difficulty and on top of
+that normalization breaks reversibility. Subword sampling seems to be a viable alternative when dealing with character-level noise with
+no added complexity compared to raw text. It does however take longer to converge, being a regularization method.
 
-|            | dev  | test |
-|------------|------|------|
-| raw        |      |      |
-| normalized | 36.5 | 35.1 |
-| sampling   |      |      |
+That's all folks. More to come soon.
 
-That's all folks.
-- 
cgit v1.2.3