training-basics-spm/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137

# Example for training with Marian and SentencePiece

In this example, we modify the Romanian-English example from `examples/training-basics` to use Tako Kudo's 
[SentencePiece](https://github.com/google/sentencepiece) instead of a complicated pre/prost-processing pipeline. 
We also replace the evaluation scripts with Matt Post's [SacreBLEU](https://github.com/mjpost/sacreBLEU). 

## Building Marian with SentencePiece support

Since version 1.7.0, Marian has built-in support for SentencePiece,
but this needs to be enabled at compile-time. We decided to make the compilation of SentencePiece
optional as SentencePiece has a number of dependencies - especially Google's Protobuf - that
are potentially non-trivial to install.

Following the the SentencePiece Readme, we list a couple of packages you would need to
install for a coule of Ubuntu versions:

On Ubuntu 14.04 LTS (Trusty Tahr):

```
sudo apt-get install libprotobuf8 protobuf-compiler libprotobuf-dev
```

On Ubuntu 16.04 LTS (Xenial Xerus):

```
sudo apt-get install libprotobuf9v5 protobuf-compiler libprotobuf-dev
```

On Ubuntu 17.10 (Artful Aardvark) and Later:

```
sudo apt-get install libprotobuf10 protobuf-compiler libprotobuf-dev
```

For more details see the documentation in the SentencePiece repo:
https://github.com/marian-nmt/sentencepiece#c-from-source

With these dependencies met, you can compile Marian as follows:

```
git clone https://github.com/marian-nmt/marian
cd marian
mkdir build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_SENTENCEPIECE=ON
make -j 8
```

To test if `marian` has been compiled with SentencePiece support run

```
./marian --help |& grep sentencepiece
```

which should display the following new options:

```
  --sentencepiece-alphas VECTOR ...     Sampling factors for SentencePieceVocab; i-th factor corresponds to i-th vocabulary
  --sentencepiece-options TEXT          Pass-through command-line options to SentencePiece trainer
  --sentencepiece-max-lines UINT=10000000
```

## Walkthrough

Files and scripts in this folder have been adapted from the Romanian-English
sample from https://github.com/rsennrich/wmt16-scripts. We also add the
back-translated data from
http://data.statmt.org/rsennrich/wmt16_backtranslations/ as desribed in
http://www.aclweb.org/anthology/W16-2323. The resulting system should be
competitive or even slightly better than reported in the Edinburgh WMT2016
paper.

Assuming you one GPU, to execute the complete example type:

```
./run-me.sh
```

which downloads the Romanian-English training files and concatenates them into training files. 
No preprocessing is required as the Marian command will train a SentencePiece vocabulary from
the raw text. 

To use with a different GPUs than device 0 or more GPUs (here 0 1 2 3) use the command below:

```
./run-me.sh 0 1 2 3
```

Next it executes a training run with `marian`. Note how the training command is called passing the 
raw training and validation data into Marian. A single joint SentencePiece model will be saved to 
`model/vocab.roen.spm`. 

```
$MARIAN/build/marian \
    --devices $GPUS \
    --type s2s \
    --model model/model.npz \
    --train-sets data/corpus.ro data/corpus.en \
    --vocabs model/vocab.roen.spm model/vocab.roen.spm \
    --sentencepiece-options '--normalization_rule_tsv=data/norm_romanian.tsv' \
    --dim-vocabs 32000 32000 \
    --mini-batch-fit -w 5000 \
    --layer-normalization --tied-embeddings-all \
    --dropout-rnn 0.2 --dropout-src 0.1 --dropout-trg 0.1 \
    --early-stopping 5 --max-length 100 \
    --valid-freq 10000 --save-freq 10000 --disp-freq 1000 \
    --cost-type ce-mean-words --valid-metrics ce-mean-words bleu-detok \
    --valid-sets data/newsdev2016.ro data/newsdev2016.en \
    --log model/train.log --valid-log model/valid.log --tempdir model \
    --overwrite --keep-best \
    --seed 1111 --exponential-smoothing \
    --normalize=0.6 --beam-size=6 --quiet-translation
```

After training (the training should stop if cross-entropy on the validation set
stops improving) the model with the highest translation validation score is used
to translate the WMT2016 dev set and test set with `marian-decoder`:

```
# translate dev set
cat data/newsdev2016.ro \
    | $MARIAN/build/marian-decoder -c model/model.npz.best-bleu-detok.npz.decoder.yml -d $GPUS -b 6 -n0.6 \
      --mini-batch 64 --maxi-batch 100 --maxi-batch-sort src > data/newsdev2016.ro.output

# translate test set
cat data/newstest2016.ro \
    | $MARIAN/build/marian-decoder -c model/model.npz.best-bleu-detok.npz.decoder.yml -d $GPUS -b 6 -n0.6 \
      --mini-batch 64 --maxi-batch 100 --maxi-batch-sort src > data/newstest2016.ro.output
```
after which BLEU scores for the dev and test set are reported. Results should
be somewhere in the area of:

```
# calculate bleu scores on dev and test set
sacreBLEU/sacrebleu.py -t wmt16/dev -l ro-en < data/newsdev2016.ro.output
sacreBLEU/sacrebleu.py -t wmt16 -l ro-en < data/newstest2016.ro.output
```