Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/marian-nmt/sentencepiece.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTaku Kudo <taku910@users.noreply.github.com>2018-06-08 18:26:48 +0300
committerGitHub <noreply@github.com>2018-06-08 18:26:48 +0300
commit53e4ae415e641d1e9211a38654ef11cc5817b699 (patch)
treecfdb19d0595e4e84e7ced91a3e58e8c0137cc100 /README.md
parentb35f05dda37b7395545d7c2a098a26a2d069d690 (diff)
Update README.md
Diffstat (limited to 'README.md')
-rw-r--r--README.md27
1 files changed, 24 insertions, 3 deletions
diff --git a/README.md b/README.md
index 3aa8b77..cf56c85 100644
--- a/README.md
+++ b/README.md
@@ -247,15 +247,36 @@ You can find that the original input sentence is restored from the vocabulary id
```<output file>``` stores a list of vocabulary and emission log probabilities. The vocabulary id corresponds to the line number in this file.
### Redefine special meta tokens
- By default, SentencePiece uses Unknown (&lt;unk&gt;), BOS (&lt;s&gt;) and EOS (&lt;/s&gt;) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in training phase as follows.
+ By default, SentencePiece uses Unknown (&lt;unk&gt;), BOS (&lt;s&gt;) and EOS (&lt;/s&gt;) tokens which have the ids of 0, 1, and 2 respectively. We can redefine this mapping in the training phase as follows.
```
-% spm_train --bos_id=0 --eos_id=1 --unk_id=2 --input=... --model_prefix=...
+% spm_train --bos_id=0 --eos_id=1 --unk_id=5 --input=... --model_prefix=...
```
-When setting -1 id e.g., ```bos_id=-1```, this special token is disabled. Note that the unknow id cannot be removed. In addition, these ids must start with 0 and be continous. We can define an id for padding (&lt;pad&gt;). Padding id is disabled by default. You can assign an id as ```--pad_id=3```.  
+When setting -1 id e.g., ```bos_id=-1```, this special token is disabled. Note that the unknow id cannot be disabled. We can define an id for padding (&lt;pad&gt;) as ```--pad_id=3```.  
If you want to assign another special tokens, please see [Use custom symbols](doc/special_symbols.md).
+### Vocabulary restriction
+```spm_encode``` accepts a ```--vocabulary``` and a ```--vocabulary_threshold``` option so that ```spm_encode``` will only produce symbols which also appear in the vocabulary (with at least some frequency). The background of this feature is decribed in [subword-nmt page](https://github.com/rsennrich/subword-nmt#best-practice-advice-for-byte-pair-encoding-in-nmt).
+
+The usage is basically the same as that of ```subword-nmt```. Assming that L1 and L2 are the two languages (source/target languages), train the shared spm model, and get resulting vocabulary for each:
+
+```
+% cat {train_file}.L1 {train_file}.L2 | shuffle > trian
+% spm_train --input=train --model_prefix=spm --vocab_size=8000
+% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L1 > {vocab_file}.L1
+% spm_encode --model=spm.model --generate_vocabulary < {train_file}.L2 > {vocab_file}.L2
+```
+
+```shuffle``` command is used just in case because ```spm_encode``` loads the first 10M lines of corpus by default.
+
+
+Then segment train/test corpus with ```--vocabulary``` option
+```
+% spm_encode --model=spm.model --vocabulary={vocab_file}.L1 --vocabulary_threshold=50 < {test_file}.L1 > {test_file}.seg.L1
+% spm_encode --model=spm.model --vocabulary={vocab_file}.L2 --vocabulary_threshold=50 < {test_file}.L2 > {test_file}.seg.L2
+```
+
## Advanced topics
* [SentencePiece Experiments](doc/experiments.md)