From 6ab33f71542e48d0c47628281a3ff6776dacd1f0 Mon Sep 17 00:00:00 2001 From: Marcin Junczys-Dowmunt Date: Mon, 26 Nov 2018 11:09:02 -0800 Subject: add comment on training time for sentencepiece models --- training-basics-sentencepiece/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/training-basics-sentencepiece/README.md b/training-basics-sentencepiece/README.md index 176ef1b..5841db7 100644 --- a/training-basics-sentencepiece/README.md +++ b/training-basics-sentencepiece/README.md @@ -192,7 +192,9 @@ raw training and validation data into Marian. A single joint SentencePiece model `model/vocab.roen.spm`. The `*.spm` suffix is required and tells Marian to train a SentencePiece vocabulary. When the same vocabulary file is specified multiple times - like in this example - a single vocabulary is built for the union of the corresponding training files. This also enables us to use -tied embeddings (`--tied-embeddings-all`). +tied embeddings (`--tied-embeddings-all`). The SentencePiece training process takes a couple of +minutes depending on the input data size. The same `*.spm` can be later reused for other experiments +with the same language pair and training is then of course omitted. We can pass the Romanian-specific normalizaton rules via the `--sentencepiece-options` command line argument. The values of this option are passed on to the SentencePiece trainer, note the required single -- cgit v1.2.3