From dc285935e12a560a82033873d79445402e2e7ec0 Mon Sep 17 00:00:00 2001 From: Marcin Junczys-Dowmunt Date: Sun, 25 Nov 2018 23:36:58 -0800 Subject: Update README.md --- training-basics-sentencepiece/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md index 0941189..0fb6a85 100644 --- a/training-basics-sentencepiece/README.md +++ b/training-basics-sentencepiece/README.md @@ -279,8 +279,8 @@ Here's the table: | raw+sampling | | | We see that keeping the noise untouched (raw) results indeed in the worst of the three system, normalization (normalized) is best, -closely followed by sampled subwords splits (raw+sampling). This is an interesting result: although normalization is generally better -it is not trivial to discover the problem in the first place. Creating a normalization table is another added difficulty and on top of +closely followed by sampled subwords splits (raw+sampling). This is an interesting result: although normalization is generally better, +it is not trivial to discover the problem in the first place. Creating a normalization table is another added difficulty - and on top of that normalization breaks reversibility. Subword sampling seems to be a viable alternative when dealing with character-level noise with no added complexity compared to raw text. It does however take longer to converge, being a regularization method. -- cgit v1.2.3