wmt2017-transformer/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72

# Example: Beating Edinburgh's WMT2017 system for en-de with Marian's Transformer model

Files and scripts in this folder show how to train a complete better than (!) WMT-grade system
based on Google's Transformer model [Vaswani et al, 2017](https://arxiv.org/abs/1706.03762)
and [Edinburgh's WMT submission description](http://www.aclweb.org/anthology/W17-4739) for en-de.

This example is a combination of [Reproducing Edinburgh's WMT2017 system for en-de with Marian](../wmt2017-uedin/)
and the example for [Transformer training](../transformer)

This examples script does the following:

* Downloads WMT2017 bilingual data for en-de
* Downloads a small subset of WMT2017 monolingual news data
* Preprocesses the above files to produce BPE segmented training data
* Trains a shallow RNN de-en model for back-translation
* Translates 10M lines from de to en
* Trains 4 default transformer models on original training data augmented with back-translated data for 8 epochs
* Trains 4 default transformer models on original training data augmented with back-translated data with right-to-left orientation for 8 epochs
* Produces n-best lists for the validation set (newstest-2016) and test sets 2014, 2015 and 2017 using the left-to-right ensemble of 4 models.
* Rescores n-best lists with 4 right-to-left models
* Produces final rescores and resorted outputs and scores them with [sacreBLEU](https://github.com/mjpost/sacreBLEU)

Assuming four GPUs are available (here 0 1 2 3), execute the command below
to run the complete example

```
./run-me.sh 0 1 2 3
```

We assume GPUs with at least 12GB of RAM are used. Change the WORKSPACE setting in the script for smaller RAM, but
be aware that this changes batch size and might lead to slighly reduced quality.
The final system should be on-par or slighly better than the Edinburgh system due to better tuned hyper-parameters.

The model architecture should be identical to Google's transformer paper, but follow procedures from the Edinburgh submission.
The model is configured as follows:

```
$MARIAN/build/marian \
    --model model/ens$i/model.npz --type transformer --pretrained-model mono/model.npz \
    --train-sets data/all.bpe.en data/all.bpe.de \
    --max-length 100 \
    --vocabs model/vocab.ende.yml model/vocab.ende.yml \
    --mini-batch-fit -w $WORKSPACE --mini-batch 1000 --maxi-batch 1000 \
    --valid-freq 5000 --save-freq 5000 --disp-freq 500 \
    --valid-metrics ce-mean-words perplexity translation \
    --valid-sets data/valid.bpe.en data/valid.bpe.de \
    --valid-script-path ./scripts/validate.sh \
    --valid-translation-output data/valid.bpe.en.output --quiet-translation \
    --beam-size 12 --normalize=1 \
    --valid-mini-batch 64 \
    --overwrite --keep-best \
    --early-stopping 5 --after-epochs $EPOCHS --cost-type=ce-mean-words \
    --log model/ens$i/train.log --valid-log model/ens$i/valid.log \
    --enc-depth 6 --dec-depth 6 \
    --tied-embeddings-all \
    --transformer-dropout 0.1 --label-smoothing 0.1 \
    --learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
    --optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
    --devices $GPUS --sync-sgd --seed $i$i$i$i  \
    --exponential-smoothing
```

## Results

Running the complete script from start to end shoud results in numbers similar to the following:

System | test2014 | test2015 | test2016(valid) | test2017 |
|------|----------|----------|-----------------|----------|
|Edinburgh WMT17| --  |  --  | 36.20 |28.30|
|Example | 29.08 | 31.04 | 36.80 | 29.50|

Improving on Edinburgh's system submission by 1.2 BLEU. Training all components for more than 8 epochs is likely to improve results. So could increasing model dimensions, but the could require careful hyperparamter tuning, especially dropout regularization.