diff options
author | Matthew Mistele <mmistele@stanford.edu> | 2018-07-16 20:28:42 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2018-07-16 20:28:42 +0300 |
commit | 59be29a7271109cf2f1f16d3e7cac1ad89ff0a73 (patch) | |
tree | 78b110b97c847586338765778a212991118de2cc /README.md | |
parent | d1151a04ccf6b866f26ebb261ae2131296087743 (diff) |
Typo fix in README
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 2 |
1 files changed, 1 insertions, 1 deletions
@@ -18,7 +18,7 @@ with the extension of direct training from raw sentences. SentencePiece allows u ## Technical highlights - **Purely data driven**: SentencePiece trains tokenization and detokenization - models from from sentences. Pre-tokenization ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/)) is not always required. + models from sentences. Pre-tokenization ([Moses tokenizer](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl)/[MeCab](http://taku910.github.io/mecab/)/[KyTea](http://www.phontron.com/kytea/)) is not always required. - **Language independent**: SentencePiece treats the sentences just as sequences of Unicode characters. There is no language-dependent logic. - **Multiple subword algorithms**: **BPE** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)] and **unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)] are supported. - **Subword regularization**: SentencePiece implements subword sampling for [subword regularization](https://arxiv.org/abs/1804.10959) which helps to improve the robustness and accuracy of NMT models. |