Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/marian-nmt/sentencepiece.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTaku Kudo <taku910@users.noreply.github.com>2018-05-01 15:42:24 +0300
committerGitHub <noreply@github.com>2018-05-01 15:42:24 +0300
commit8b7ed4a1d6a5a39b07cccf1abac712cdfaae2dac (patch)
tree86ff38f625757aff3bc502174a97f73fa2292804 /README.md
parenteb9c6095d389866055e66d951eee9691212331a4 (diff)
Update README.md
Diffstat (limited to 'README.md')
-rw-r--r--README.md8
1 files changed, 6 insertions, 2 deletions
diff --git a/README.md b/README.md
index 509a58c..a8562ac 100644
--- a/README.md
+++ b/README.md
@@ -39,8 +39,8 @@ Note that BPE algorithm used in WordPiece is slightly different from the origina
## Overview
### What is SentencePiece?
-SentencePiece is an unsupervised text tokenizer and detokenizer designed mainly for Neural Network-based text generation, for example Neural Network Machine Translation. SentencePiece is a re-implementation of **sub-word units** (e.g., **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and
-**unigram language model** [[Kudo.](http://acl2018.org/conference/accepted-papers/)]). Unlike previous sub-word approaches that train tokenizers from pretokenized sentences, SentencePiece directly trains the tokenizer and detokenizer from raw sentences. SentencePiece might seem like a sort of unsupervised word segmentation, but there are several differences and constraints in SentencePiece.
+SentencePiece is a re-impelemtation of **sub-word units**, an effective way to alleviate the open vocabulary
+ problems in neural machine translation. SentencePiece supports two segmentation algorithms **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](http://acl2018.org/conference/accepted-papers/)]. Here are the high level differences from other implementations.
#### The number of unique tokens is predetermined
Neural Machine Translation models typically operate with a fixed
@@ -52,6 +52,10 @@ Note that SentencePices specifies the final vocabulary size for training, which
[subword-nmt](https://github.com/rsennrich/subword-nmt) that uses the number of merge operations.
The number of merge operations is a BPE-specific parameter and not applicable to other segmentation algorithms, including unigram, word and character.
+#### Trains from raw sentences
+Previous sub-word implementations assume that the input sentences are pre-tokenized. This constraint was required for efficient training, but makes the preprocessing complicated as we have to run language dependent tokenizers in advance.
+The implementation of SentencePiece is fast enough to train the model from raw sentences. This is useful for training the tokenizer and detokenizer for Chinese, Japanese and Korean where no explicit spaces exist between words.
+
#### Whitespace is treated as a basic symbol
The first step of Natural Language processing is text tokenization. For
example, a standard English tokenizer would segment the text "Hello world." into the