Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/marian-nmt/sentencepiece.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorTaku Kudo <taku910@users.noreply.github.com>2020-05-21 06:12:53 +0300
committerGitHub <noreply@github.com>2020-05-21 06:12:53 +0300
commita32d7dc6ce6f383a65ad6e1cbe1983f94ab11932 (patch)
tree37060c5b85180ac02142c871df35a1b60cac59d0
parentc1fbda8995514172141c02356cdc9518a53aac94 (diff)
Update README.md
-rw-r--r--python/README.md25
1 files changed, 25 insertions, 0 deletions
diff --git a/python/README.md b/python/README.md
index 8ed7d9e..b683082 100644
--- a/python/README.md
+++ b/python/README.md
@@ -92,6 +92,31 @@ trainer_interface.cc(619) LOG(INFO) Saving vocabs: m.vocab
>>>
```
+### Training without local filesystem
+Sentencepiece trainer can receive any iterable object to feed training sentences. You can also pass a file object (instance with write() method) to emit the output model to any devices. These features are useful to run sentencepiece on environment that have limited access to the local file system (e.g., Google colab.)
+
+```
+import urllib.request
+import io
+import sentencepiece as spm
+
+# Loads model from URL as iterator and stores the model to BytesIO.
+model = io.BytesIO()
+with urllib.request.urlopen(
+ 'https://raw.githubusercontent.com/google/sentencepiece/master/data/botchan.txt'
+) as response:
+ spm.SentencePieceTrainer.train(
+ sentence_iterator=response, model_writer=model, vocab_size=1000)
+
+# Serialize the model as file.
+# with open('out.model', 'wb') as f:
+# f.write(model.getvalue())
+
+# Directly load the model from serialized model.
+sp = spm.SentencePieceProcessor(model_proto=model.getvalue())
+print(sp.encode('this is test'))
+```
+
### Segmentation (old interface)
```