# Intro to Transformers This tutorial is designed to help you train your first machine translation model. To follow along, you'll need a Linux-based system and an NVIDIA GPU. In this example we will use Marian to create a English-German translation system. We'll follow a very simple pipeline with data acquisition, some basic corpus cleaning, generation of vocabulary with [SentencePiece], training of a transformer model, and evaluation with [sacreBLEU], and (optionally) [Comet]. We'll be using a subset of data from the WMT21 [news task] to train our model. For the validation and test sets, we'll use the test sets from WMT19 and WMT20, respectively. Lets get started by installing our dependencies! ## Install requirements If you haven't installed the common tools for `marian-examples`, you can do by doing to the `tools/` folder in the root of the repository and running `make`. ```shell cd ../tools make all cd - ``` In this example, we'll be using some [scripts](https://github.com/marian-nmt/moses-scripts) from [Moses]. We'll also use [sacreBLEU] and [Comet] from Python pip. To install these in a virtual environment, execute: ```shell python -m venv .venv source .venv/bin/activate pip install -r requirements.txt ``` You can skip the first two of these commands if you don't want to use a virtual environment. Next we'll install Marian! ## Getting Marian The development version of Marian can be obtained with ```shell git clone https://github.com/marian-nmt/marian-dev cd marian-dev ``` ### Compile To compile Marian we need to ensure we have the required packages. The list of requirements can be found in the [documentation][install_marian]. Since we're using SentencePiece, we also need to make sure we have satisfy its [requirements][install_sentencepiece] too. Then we can compile with ```shell mkdir build cd build cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_SENTENCEPIECE=ON cmake --build . ``` To speed up compilation we can use ```cmake --build . -j 8``` to run 8 tasks simultaneously. You may need to reduce this based on your system CPU and available memory. If it succeeded, running ```shell ./marian --version ``` will return the version you've compiled. To verify that Sentence support was enabled, running ```shell ./marian --help |& grep sentencepiece ``` will display the SentencePiece specific options: ``` --sentencepiece-alphas VECTOR ... Sampling factors for SentencePiece vocabulary; i-th factor corresponds to i-th vocabulary --sentencepiece-options TEXT Pass-through command-line options to SentencePiece trainer --sentencepiece-max-lines UINT=2000000 Maximum lines to train SentencePiece vocabulary, selected with sampling from all data. When set to 0 all lines are going to be used. ``` ## Running the Example The entire example can be run end-to-end by executing ```shell ./run-me.sh ``` This will acquire the data then apply cleaning. It uses the resulting corpus to train a transformer model, which is evaluated via sacreBLEU. By default, `run-me.sh` will run on a single GPU (`device 0`). To use a different set of GPUs, pass their IDs as an argument, e.g. training using the 4 GPUs ```shell ./run-me.sh 0 1 2 3 ``` You can run the commands from `run-me.sh` manually yourself. We'll walk through the different commands in the sections below. These commands assume that Marian is compiled, and accessible at `../../build/marian`. The `data/`, `scripts/` and `model/` directories will be contains at the same level as this README file. ## Acquire data We'll acquire a subset of the data from the WMT21 [news task]. In particular we'll make use of the following English-German parallel corpora: | Dataset | Sentences | |---------------------|--------------:| | Europarl v10 | 1,828,521 | | News Commentary v16 | 398,981 | | Common Crawl corpus | 2,399,123 | | **Total** | **4,626,625** | ### Download We'll store our data inside the `data/` directory. First lets change directory to that location: ```shell cd data ``` To download the datasets above, we can use the command: ```shell # Get en-de for training WMT21 wget -nc https://www.statmt.org/europarl/v10/training/europarl-v10.de-en.tsv.gz 2> /dev/null wget -nc https://data.statmt.org/news-commentary/v16/training/news-commentary-v16.de-en.tsv.gz 2> /dev/null wget -nc https://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz 2> /dev/null ``` This may take a little time to download the data from the server. The dev set and test set can be obtained directly from sacrebleu via the command line. We echo the source and reference texts to file. ``` # Dev Sets sacrebleu -t wmt19 -l en-de --echo src > valid.en sacrebleu -t wmt19 -l en-de --echo ref > valid.de # Test Sets sacrebleu -t wmt20 -l en-de --echo src > test.en sacrebleu -t wmt20 -l en-de --echo ref > test.de ``` This is relatively fast as these are typically only 1000-2000 lines. ### Combine Now we want to combine our data sources in to a single corpus. First we start by decompressing each of the EuroParl and news-commentary TSV files. ```shell for compressed in europarl-v10.de-en.tsv news-commentary-v16.de-en.tsv; do if [ ! -e $compressed ]; then gzip --keep -q -d $compressed.gz fi done ``` This leaves two TSV files: - `europarl-v10.de-en.tsv` - `news-commentary-v16.de-en.tsv` where the first field contains German text, and the second field contains English text. We can untar the common crawl archive. ```shell tar xf training-parallel-commoncrawl.tgz ``` This contains a collection of parallel text files across multiple languages, but we're only interested in those covering `en-de`: - `commoncrawl.de-en.de` - `commoncrawl.de-en.de` From these we can construct a parallel corpus. We concatenate the two TSV files, and extract the first field to populate the German combined corpus, and then the second field to populate the English combined corpus. To this, we then concatenate the commoncrawl data to the relevant file. ```shell # Corpus if [ ! -e corpus.de ] || [ ! -e corpus.en ]; then # TSVs cat europarl-v10.de-en.tsv news-commentary-v16.de-en.tsv | cut -f 1 > corpus.de cat europarl-v10.de-en.tsv news-commentary-v16.de-en.tsv | cut -f 2 > corpus.en # Plain text cat commoncrawl.de-en.de >> corpus.de cat commoncrawl.de-en.en >> corpus.en fi ``` ## Prepare data With our combined corpus we now apply some basic pre-processing. Firstly, we remove any non-printing characters using a script from [Moses]. ```shell for lang in en de; do # Remove non-printing characters cat corpus.$lang \ | perl $MOSES_SCRIPTS/tokenizer/remove-non-printing-char.perl \ > .corpus.norm.$lang done ``` This modifies the content separately for each language, but **does not** adjust the ordering. The parallel sentences pairs are associated by line, so it is crucial that any pre-processing preserves that. Then we constrain the sentences to be between 1 and 100 words with ```shell # Contrain length between 1 100 perl $MOSES_SCRIPTS/training/clean-corpus-n.perl .corpus.norm en de .corpus.trim 1 100 ``` This removes sentence pairs where either one does not meet the length requirements. To remove any duplicates we build a TSV file, sort it and retain only unique lines. ```shell # Deduplicate paste <(cat .corpus.trim.en) <(cat .corpus.trim.de) \ | LC_ALL=C sort -S 50% | uniq \ > .corpus.uniq.ende.tsv ``` Then clean corpus is obtained by separating our TSV file back to parallel text files. ```shell cat .corpus.uniq.ende.tsv | cut -f 1 > corpus.clean.en cat .corpus.uniq.ende.tsv | cut -f 2 > corpus.clean.de ``` The cleaned corpus has 4,552,319 parallel sentences, having discarded around 1.6% the total sentences. ## Training To train a transformer model, we make use of Marian's presets. The `--task transformer-base` preset gives a good baseline of hyperparameters for a transformer model. We'll put our configuration inside a YAML file `transformer-model.yml`. We can output the configuration for this preset using the `--dump-config expand` options: ```shell $MARIAN/marian --task transformer-base --dump-config expand > transformer-model.yml ``` We have shortened `../../build/marian` to `$MARIAN/marian` for brevity. You can inspect this file to see exactly which options have been set. We'll modify this file by adding options that training a little more verbose. ``` disp-freq: 1000 disp-first: 10 save-freq: 2ku ``` We also add line that will halt training after 10 updates without an improvement for on the validation set. ``` early-stopping: 10 ``` We will also validate with additional metrics, keep the best model per metric and validate more often. This is achieved via: ``` keep-best: true valid-freq: 2ku valid-metrics: - ce-mean-words - bleu - perplexity ``` Note that early-stopping criteria applies to `ce-mean-words`. ### SentencePiece (Optional) To generate a SentencePiece vocabulary model you can run the `spm_train` command built alongside Marian. An example invocation would look something like: ```shell  $MARIAN/spm_train \ --accept_language en,de \ --input data/corpus.clean.en,data/corpus.clean.de \ --model_prefix model/vocab.ende \ --vocab_size 32000 mv model/vocab.ende.{model,spm} ``` Where as a last step, we rename `.model` to `.spm` (SentencePiece Model) so that Marian recognises it as from SentencePiece. This step is listed as optional as in the absence of a vocabulary file, Marian will build one. This produces a combined vocabulary of 32000 tokens. ### Training Command To begin training, we call the `marian` command with the following arguments: ```shell $MARIAN/marian -c transformer-model.yml \ -d 0 1 2 3 --workspace 9000 \ --seed 1111 \ --after 10e \ --model model/model.npz \ --train-sets data/corpus.clean.{en,de} \ --vocabs model/vocab.ende.spm model/vocab.ende.spm \ --dim-vocabs 32000 32000 \ --valid-sets data/valid.{en,de} \ --log model/train.log --valid-log model/valid.log ``` The flag `-d` sets the devices to be ran on, which you'll have to update for your setup. Additionally `-w`, the workspace, depends on how much memory your GPUs have. The example was tested on a pair of NVIDIA RTX 2080 with 11GB using a workspace of 9000 MiB. You should reduce this if you have less available memory. For reproducibility, the seed is set to `1111`. As a reference, this took around 8 hours. The models will be stored at `model/model.npz`. The training and validation sets are specified, as well as the vocabular files and their dimension. Logs for the training and validation output are also retained. Finally, for this example we only train for a maximum of 10 epochs. The `save-freq` we specified of 2000, will result in the model state being saved at regular intervals of 2000 updates: - `model/model.iter2000.npz` - `model/model.iter4000.npz` - ... The current model is always `model/model.npz`. Additionally, the `keep-best` option produces an additional model file for every validator: - `model/model.npz.best-bleu.npz` - `model/model.npz.best-ce-mean-words.npz` - `model/model.npz.best-perplexity.npz` The training progress is tracked in `model/model.npz.progress.yml` with the full model configuration at `model/model.npz.yml`. In addition, Marian automatically generates a decoding config for each of these models: - `model/model.npz.decoder.yml` - `model/model.npz.best-*.npz.decoder.yml` These conveniently refer to the model and vocabulary files. They also include a default setting for beam-search and normalization, which can be overwritten by the command-line interface. ## Translation To translate we use the `marian-decoder` command: ```shell cat data/test.en \ | $MARIAN/marian-decoder \ -c model/model.npz.best-bleu.npz.decoder.yml \ -d 0 1 2 3 \ | tee evaluation/testset_output.txt \ | sacrebleu data/test.de --metrics bleu chrf -b -w 3 -f text ``` where we're using the model that produced the best BLEU score on the validation set. This snippet passes the source text to Marian over a pipe to `stdin`, and is output over `stdout`. We're capturing this output to file with `tee`, and passing the output into sacreBLEU for evaluation. We provide sacreBLEU our reference text, and ask it to compute both BLEU and chrF. The remaining sacreBLEU options return us only the score with 3 decimal places of precision in text format. You can experiment changing the `--beam-size` and `--normalization` to see how it changes the scores Additionally, if you want to compute the Comet score, there's a helper script: ``` ./scripts/comet-score.sh hyp.txt src.txt ref.txt ``` This returns the Comet score for `hyp.txt`, the translation output, based on `src.txt` the source input, and `ref.txt` the reference translation. ### Results Here we tabulate the scores for BLEU, chrF2 and Comet for our model. For each of the metrics, a larger score is better. You should achieve similar results with your own run! These are the results from decoding with best-BLEU model: | Test | BLEU | chrF2 | Comet | |--------|--------|--------|--------| | WMT20 | 24.573 | 52.368 | 0.1795 | | WMT19^ | 37.185 | 62.628 | 0.3312 | | WMT18 | 40.140 | 65.281 | 0.5363 | | WMT17 | 26.832 | 56.096 | 0.4061 | | WMT16 | 33.245 | 60.534 | 0.4552 | **^** Note that WMT19 was used as the validation set! ## Going Further If you want to improve on these results, you can continue training for longer, or incorporating other datasets from the WMT21 task. Take a look at the other examples and think about implementing some data augmentation through back-translation. Good luck! [sacrebleu]: https://github.com/mjpost/sacrebleu [comet]: https://github.com/Unbabel/COMET [moses]: https://github.com/moses-smt/mosesdecoder [news task]: https://www.statmt.org/wmt21/translation-task.html [sentencepiece]: https://github.com/google/sentencepiece [install_marian]: https://marian-nmt.github.io/docs/#installation [install_sentencepiece]: https://marian-nmt.github.io/docs/#sentencepiece