diff options
author | Taku Kudo <taku910@users.noreply.github.com> | 2020-05-21 06:12:53 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2020-05-21 06:12:53 +0300 |
commit | a32d7dc6ce6f383a65ad6e1cbe1983f94ab11932 (patch) | |
tree | 37060c5b85180ac02142c871df35a1b60cac59d0 | |
parent | c1fbda8995514172141c02356cdc9518a53aac94 (diff) |
Update README.md
-rw-r--r-- | python/README.md | 25 |
1 files changed, 25 insertions, 0 deletions
diff --git a/python/README.md b/python/README.md index 8ed7d9e..b683082 100644 --- a/python/README.md +++ b/python/README.md @@ -92,6 +92,31 @@ trainer_interface.cc(619) LOG(INFO) Saving vocabs: m.vocab >>> ``` +### Training without local filesystem +Sentencepiece trainer can receive any iterable object to feed training sentences. You can also pass a file object (instance with write() method) to emit the output model to any devices. These features are useful to run sentencepiece on environment that have limited access to the local file system (e.g., Google colab.) + +``` +import urllib.request +import io +import sentencepiece as spm + +# Loads model from URL as iterator and stores the model to BytesIO. +model = io.BytesIO() +with urllib.request.urlopen( + 'https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt' +) as response: + spm.SentencePieceTrainer.train( + sentence_iterator=response, model_writer=model, vocab_size=1000) + +# Serialize the model as file. +# with open('out.model', 'wb') as f: +# f.write(model.getvalue()) + +# Directly load the model from serialized model. +sp = spm.SentencePieceProcessor(model_proto=model.getvalue()) +print(sp.encode('this is test')) +``` + ### Segmentation (old interface) ``` |