Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/marian-nmt/sentencepiece.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTaku Kudo <taku910@users.noreply.github.com>2020-10-02 18:00:38 +0300
committerGitHub <noreply@github.com>2020-10-02 18:00:38 +0300
commit4e32230a06164a1349a091f0951c710b06d59254 (patch)
tree6d5ccae92e892eed9317198dc940198a45b27f71
parent3b70b1a07344768df3b0236d8c94af682ea9e847 (diff)
parentac31887d465a9253b2f4e68943eccf90ba11b401 (diff)
Merge pull request #551 from stephantul/master
Add options list for training to documentation
-rw-r--r--README.md12
-rw-r--r--doc/options.md51
2 files changed, 57 insertions, 6 deletions
diff --git a/README.md b/README.md
index 0c0ef9f..5543d79 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for
Neural Network-based text generation systems where the vocabulary size
is predetermined prior to the neural model training. SentencePiece implements
-**subword units** (e.g., **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and
+**subword units** (e.g., **byte-pair-encoding (BPE)** [[Sennrich et al.](http://www.aclweb.org/anthology/P16-1162)]) and
**unigram language model** [[Kudo.](https://arxiv.org/abs/1804.10959)])
with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre/postprocessing.
@@ -100,7 +100,7 @@ special symbol. Tokenized sequences do not preserve the necessary information to
Subword regularization [[Kudo.](https://arxiv.org/abs/1804.10959)] and BPE-droptout [Provilkov et al](https://arxiv.org/abs/1910.13267) are simple regularization methods
that virtually augment training data with on-the-fly subword sampling, which helps to improve the accuracy as well as robustness of NMT models.
-To enable subword regularization, you would like to integrate SentencePiece library
+To enable subword regularization, you would like to integrate SentencePiece library
([C++](doc/api.md#sampling-subword-regularization)/[Python](python/README.md)) into the NMT system to sample one segmentation for each parameter update, which is different from the standard off-line data preparations. Here's the example of [Python library](python/README.md). You can find that 'New York' is segmented differently on each ``SampleEncode (C++)`` or ``encode with enable_sampling=True (Python)`` calls. The details of sampling parameters are found in [sentencepiece_processor.h](src/sentencepiece_processor.h).
```
@@ -108,7 +108,7 @@ To enable subword regularization, you would like to integrate SentencePiece libr
>>> s = spm.SentencePieceProcessor(model_file='spm.model')
>>> for n in range(5):
... s.encode('New York', out_type=str, enable_sampling=True, alpha=0.1, nbest=-1)
-...
+...
['▁', 'N', 'e', 'w', '▁York']
['▁', 'New', '▁York']
['▁', 'New', '▁Y', 'o', 'r', 'k']
@@ -179,7 +179,7 @@ See [tensorflow/README.md](tensorflow/README.md)
* `--character_coverage`: amount of characters covered by the model, good defaults are: `0.9995` for languages with rich character set like Japanse or Chinese and `1.0` for other languages with small character set.
* `--model_type`: model type. Choose from `unigram` (default), `bpe`, `char`, or `word`. The input sentence must be pretokenized when using `word` type.
-Use `--help` flag to display all parameters for training.
+Use `--help` flag to display all parameters for training, or see [here](doc/options.md) for an overview.
### Encode raw text into sentence pieces/ids
```
@@ -239,9 +239,9 @@ You can find that the original input sentence is restored from the vocabulary id
### Redefine special meta tokens
By default, SentencePiece uses Unknown (&lt;unk&gt;), BOS (&lt;s&gt;) and EOS (&lt;/s&gt;) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.
-
+
```
-% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...
+% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=... --character_coverage=...
```
When setting -1 id e.g., ```bos_id=-1```, this special token is disabled. Note that the unknow id cannot be disabled. We can define an id for padding (&lt;pad&gt;) as ```--pad_id=3```.  
diff --git a/doc/options.md b/doc/options.md
new file mode 100644
index 0000000..7861fdc
--- /dev/null
+++ b/doc/options.md
@@ -0,0 +1,51 @@
+# Training options
+
+The training options for the `spm_train` can be listed using `spm_train --help`. Since the standard `pip install` of sentencepiece does not necessarily install `spm_train`, the options are also listed here.
+
+```
+--help (show help) type: bool default: false
+--version (show version) type: bool default: false
+--minloglevel (Messages logged at a lower level than this don't actually get logged anywhere) type: int default: 0
+--input (comma separated list of input sentences) type: std::string default: ""
+--input_format (Input format. Supported format is `text` or `tsv`.) type: std::string default: ""
+--model_prefix (output model prefix) type: std::string default: "" --model_type (model algorithm: unigram, bpe, word or char) type: std::string default: "unigram"
+--vocab_size (vocabulary size) type: int32 default: 8000
+--accept_language (comma-separated list of languages this model can accept) type: std::string default: ""
+--self_test_sample_size (the size of self test samples) type: int32 default: 0
+--character_coverage (character coverage to determine the minimum symbols) type: double default: 0.9995
+--input_sentence_size (maximum size of sentences the trainer loads) type: int32 default: 0
+--shuffle_input_sentence (Randomly sample input sentences in advance. Valid when --input_sentence_size > 0) type: bool default: true
+--seed_sentencepiece_size (the size of seed sentencepieces) type: int32 default: 1000000
+--shrinking_factor (Keeps top shrinking_factor pieces with respect to the loss) type: double default: 0.75
+--num_threads (number of threads for training) type: int32 default: 16
+--num_sub_iterations (number of EM sub-iterations) type: int32 default: 2
+--max_sentencepiece_length (maximum length of sentence piece) type: int32 default: 16
+--max_sentence_length (maximum length of sentence in byte) type: int32 default: 4192
+--split_by_unicode_script (use Unicode script to split sentence pieces) type: bool default: true
+--split_by_number (split tokens by numbers (0-9)) type: bool default: true
+--split_by_whitespace (use a white space to split sentence pieces) type: bool default: true
+--split_digits (split all digits (0-9) into separate pieces) type: bool default: false
+--treat_whitespace_as_suffix (treat whitespace marker as suffix instead of prefix.) type: bool default: false
+--control_symbols (comma separated list of control symbols) type: std::string default: ""
+--user_defined_symbols (comma separated list of user defined symbols) type: std::string default: ""
+--required_chars (UTF8 characters in this flag are always used in the character set regardless of --character_coverage) type: std::string default: ""
+--byte_fallback (decompose unknown pieces into UTF-8 byte pieces) type: bool default: false
+--vocabulary_output_piece_score (Define score in vocab file) type: bool default: true
+--normalization_rule_name (Normalization rule name. Choose from nfkc or identity) type: std::string default: "nmt_nfkc"
+--normalization_rule_tsv (Normalization rule TSV file. ) type: std::string default: ""
+--denormalization_rule_tsv (Denormalization rule TSV file.) type: std::string default: ""
+--add_dummy_prefix (Add dummy whitespace at the beginning of text) type: bool default: true
+--remove_extra_whitespaces (Removes leading, trailing, and duplicate internal whitespace) type: bool default: true
+--hard_vocab_limit (If set to false, --vocab_size is considered as a soft limit.) type: bool default: true
+--use_all_vocab (If set to true, use all tokens as vocab. Valid for word/char models.) type: bool default: false
+--unk_id (Override UNK (<unk>) id.) type: int32 default: 0
+--bos_id (Override BOS (<s>) id. Set -1 to disable BOS.) type: int32 default: 1
+--eos_id (Override EOS (</s>) id. Set -1 to disable EOS.) type: int32 default: 2
+--pad_id (Override PAD (<pad>) id. Set -1 to disable PAD.) type: int32 default: -1
+--unk_piece (Override UNK (<unk>) piece.) type: std::string default: "<unk>"
+--bos_piece (Override BOS (<s>) piece.) type: std::string default: "<s>"
+--eos_piece (Override EOS (</s>) piece.) type: std::string default: "</s>"
+--pad_piece (Override PAD (<pad>) piece.) type: std::string default: "<pad>"
+--unk_surface (Dummy surface string for <unk>. In decoding <unk> is decoded to `unk_surface`.) type: std::string default: " ⁇ "
+--train_extremely_large_corpus (Increase bit depth for unigram tokenization.) type: bool default: false
+```