diff options
author | Taku Kudo <taku910@users.noreply.github.com> | 2020-05-12 21:06:12 +0300 |
---|---|---|
committer | GitHub <noreply@github.com> | 2020-05-12 21:06:12 +0300 |
commit | 6f584362d0c319b0e38260ec2d99ed8e6d485670 (patch) | |
tree | 5498a3ae4bc99e09def4251b748e34994277fd3f | |
parent | 9907536ef68e0d6b15dc6092d51ad0e6f236eafe (diff) |
Update README.md
-rw-r--r-- | python/README.md | 79 |
1 files changed, 72 insertions, 7 deletions
diff --git a/python/README.md b/python/README.md index 81eb850..8ed7d9e 100644 --- a/python/README.md +++ b/python/README.md @@ -1,10 +1,6 @@ # SentencePiece Python Wrapper -Python wrapper for SentencePiece with SWIG. This module wraps sentencepiece::SentencePieceProcessor class with the following modifications: -* Encode and Decode methods are re-defined as EncodeAsIds, EncodeAsPieces, DecodeIds and DecodePieces respectively. -* Support model training with SentencePieceTrainer.Train method. -* SentencePieceText proto is not supported. -* Added __len__ and __getitem__ methods. len(obj) and obj[key] returns vocab size and vocab id respectively. +Python wrapper for SentencePiece. This API will offer the encoding, decoding and training of Sentencepiece. ## Build and Install SentencePiece For Linux (x64/i686), macOS, and Windows(win32/x64) environment, you can simply use pip command to install SentencePiece python module. @@ -26,12 +22,81 @@ If you don’t have write permission to the global site-packages directory or do ## Usage -See [this google colab page](https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb) to run sentencepiece interactively. +See [this google colab page](https://github.com/google/sentencepiece/blob/master/python/sentencepiece_python_module_example.ipynb) to run sentencepiece interactively. (Note: this sample is written in old interface.) ### Segmentation ``` % python >>> import sentencepiece as spm +>>> sp = spm.SentencePieceProcessor(model_file='test/test_model.model') +>>> sp.encode('This is a test') +[284, 47, 11, 4, 15, 400] +>>> sp.encode(['This is a test', 'Hello world'], out_type=int) +[[284, 47, 11, 4, 15, 400], [151, 88, 21, 887]] +>>> sp.encode('This is a test', out_type=str) +['▁This', '▁is', '▁a', '▁', 't', 'est'] +>>> sp.encode(['This is a test', 'Hello world'], out_type=str) +[['▁This', '▁is', '▁a', '▁', 't', 'est'], ['▁He', 'll', 'o', '▁world']] +>>> for _ in range(10): +... sp.encode('This is a test', out_type=str, enable_sampling=True, alpha=0.1, nbest_size=-1) +... +['▁', 'This', '▁', 'is', '▁a', '▁', 't', 'e', 'st'] +['▁T', 'h', 'i', 's', '▁is', '▁a', '▁', 'te', 's', 't'] +['▁T', 'h', 'is', '▁', 'is', '▁', 'a', '▁', 't', 'est'] +['▁', 'This', '▁is', '▁', 'a', '▁', 't', 'e', 'st'] +['▁', 'This', '▁', 'is', '▁', 'a', '▁', 't', 'e', 's', 't'] +['▁This', '▁is', '▁a', '▁', 'te', 's', 't'] +['▁This', '▁is', '▁', 'a', '▁', 't', 'e', 'st'] +['▁', 'T', 'h', 'is', '▁', 'is', '▁', 'a', '▁', 'te', 'st'] +['▁', 'This', '▁', 'i', 's', '▁a', '▁', 't', 'e', 'st'] +['▁This', '▁', 'is', '▁a', '▁', 't', 'est'] +>>> sp.decode([284, 47, 11, 4, 15, 400]) +'This is a test' +>>> sp.decode([[284, 47, 11, 4, 15, 400], [151, 88, 21, 887]]) +['This is a test', 'Hello world'] +>>> sp.decode(['▁', 'This', '▁', 'is', '▁a', '▁', 't', 'e', 'st']) +'This is a test' +>>> sp.decode([['▁This', '▁is', '▁a', '▁', 't', 'est'], ['▁He', 'll', 'o', '▁world']]) +['This is a test', 'Hello world'] +>>> sp.get_piece_size() +1000 +>>> sp.id_to_piece(2) +'</s>' +>>> sp.id_to_piece([2, 3, 4]) +['</s>', '\r', '▁'] +>>> sp.piece_to_id('<s>') +1 +>>> sp.piece_to_id(['</s>', '\r', '▁']) +[2, 3, 4] +>>> len(sp) +1000 +>>> sp['</s>'] +2 +``` + +### Model Training +Training is performed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.train() function. + +``` +>>> import sentencepiece as spm +>>> spm.SentencePieceTrainer.train(input='test/botchan.txt', model_prefix='m', vocab_size=1000, user_defined_symbols=['foo', 'bar']) +sentencepiece_trainer.cc(73) LOG(INFO) Starts training with : +trainer_spec { + input: test/botchan.txt + .. snip +unigram_model_trainer.cc(500) LOG(INFO) EM sub_iter=1 size=1188 obj=10.2839 num_tokens=32182 num_tokens/piece=27.0892 +unigram_model_trainer.cc(500) LOG(INFO) EM sub_iter=0 size=1100 obj=10.4269 num_tokens=33001 num_tokens/piece=30.0009 +unigram_model_trainer.cc(500) LOG(INFO) EM sub_iter=1 size=1100 obj=10.4069 num_tokens=33002 num_tokens/piece=30.0018 +trainer_interface.cc(595) LOG(INFO) Saving model: m.model +trainer_interface.cc(619) LOG(INFO) Saving vocabs: m.vocab +>>> +``` + + +### Segmentation (old interface) +``` +% python +>>> import sentencepiece as spm >>> sp = spm.SentencePieceProcessor() >>> sp.Load("test/test_model.model") True @@ -70,7 +135,7 @@ True 2 ``` -### Model Training +### Model Training (old interface) Training is performed by passing parameters of [spm_train](https://github.com/google/sentencepiece#train-sentencepiece-model) to SentencePieceTrainer.Train() function. ``` |