1 files changed, 398 insertions, 0 deletions
diff --git a/transformer-intro/README.md b/transformer-intro/README.md
new file mode 100644
index 0000000..4c8e0d6
--- /dev/null
+++ b/transformer-intro/README.md
@@ -0,0 +1,398 @@
+# Intro to Transformers
+
+This tutorial is designed to help you train your first machine translation
+model. To follow along, you'll need a Linux-based system and an NVIDIA GPU.
+
+In this example we will use Marian to create a English-German translation
+system. We'll follow a very simple pipeline with data acquisition, some basic
+corpus cleaning, generation of vocabulary with [SentencePiece], training of a
+transformer model, and evaluation with [sacreBLEU], and (optionally) [Comet].
+
+We'll be using a subset of data from the WMT21 [news task] to train our model.
+For the validation and test sets, we'll use the test sets from WMT19 and WMT20,
+respectively.
+
+Lets get started by installing our dependencies!
+
+
+## Install requirements
+If you haven't installed the common tools for `marian-examples`, you can do
+by doing to the `tools/` folder in the root of the repository and running `make`.
+```shell
+cd ../tools
+make all
+cd -
+```
+In this example, we'll be using some
+[scripts](https://github.com/marian-nmt/moses-scripts) from [Moses].
+
+We'll also use [sacreBLEU] and [Comet] from Python pip. To install these in a
+virtual environment, execute:
+```shell
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+You can skip the first two of these commands if you don't want to use a virtual
+environment.
+
+Next we'll install Marian!
+
+
+## Getting Marian
+The development version of Marian can be obtained with
+```shell
+git clone https://github.com/marian-nmt/marian-dev
+cd marian-dev
+```
+
+### Compile
+To compile Marian we need to ensure we have the required packages. The list of
+requirements can be found in the [documentation][install_marian]. Since we're
+using SentencePiece, we also need to make sure we have satisfy its
+[requirements][install_sentencepiece] too.
+
+Then we can compile with
+```shell
+mkdir build
+cd build
+cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_SENTENCEPIECE=ON
+cmake --build .
+```
+
+To speed up compilation we can use ```cmake --build . -j 8``` to run 8 tasks
+simultaneously. You may need to reduce this based on your system CPU and
+available memory.
+
+If it succeeded, running
+```shell
+./marian --version
+```
+will return the version you've compiled. To verify that Sentence support was enabled, running
+```shell
+./marian --help |& grep sentencepiece
+```
+will display the SentencePiece specific options:
+```
+--sentencepiece-alphas VECTOR ...     Sampling factors for SentencePiece vocabulary; i-th factor corresponds to i-th vocabulary
+--sentencepiece-options TEXT          Pass-through command-line options to SentencePiece trainer
+--sentencepiece-max-lines UINT=2000000
+                                      Maximum lines to train SentencePiece vocabulary, selected with sampling from all data. When set to 0 all lines are going to be used.
+```
+
+## Running the Example
+The entire example can be run end-to-end by executing
+```shell
+./run-me.sh
+```
+This will acquire the data then apply cleaning. It uses the resulting corpus to
+train a transformer model, which is evaluated via sacreBLEU.
+
+By default, `run-me.sh` will run on a single GPU (`device 0`). To use a
+different set of GPUs, pass their IDs as an argument, e.g. training using the 4
+GPUs
+```shell
+./run-me.sh 0 1 2 3
+```
+
+You can run the commands from `run-me.sh` manually yourself. We'll walk through
+the different commands in the sections below. These commands assume that Marian
+is compiled, and accessible at `../../build/marian`. The `data/`, `scripts/` and
+`model/` directories will be contains at the same level as this README file.
+
+## Acquire data
+We'll acquire a subset of the data from the WMT21 [news task].
+
+In particular we'll make use of the following English-German parallel corpora:
+
+| Dataset             |     Sentences |
+|---------------------|--------------:|
+| Europarl v10        |     1,828,521 |
+| News Commentary v16 |       398,981 |
+| Common Crawl corpus |     2,399,123 |
+| **Total**           | **4,626,625** |
+
+### Download
+We'll store our data inside the `data/` directory. First lets change directory
+to that location:
+```shell
+cd data
+```
+
+To download the datasets above, we can use the command:
+```shell
+# Get en-de for training WMT21
+wget -nc https://www.statmt.org/europarl/v10/training/europarl-v10.de-en.tsv.gz 2> /dev/null
+wget -nc https://data.statmt.org/news-commentary/v16/training/news-commentary-v16.de-en.tsv.gz 2> /dev/null
+wget -nc https://www.statmt.org/wmt13/training-parallel-commoncrawl.tgz 2> /dev/null
+```
+This may take a little time to download the data from the server.
+
+The dev set and test set can be obtained directly from sacrebleu via the command line. We echo the source and reference texts to file.
+```
+# Dev Sets
+sacrebleu -t wmt19 -l en-de --echo src > valid.en
+sacrebleu -t wmt19 -l en-de --echo ref > valid.de
+
+# Test Sets
+sacrebleu -t wmt20 -l en-de --echo src > test.en
+sacrebleu -t wmt20 -l en-de --echo ref > test.de
+```
+This is relatively fast as these are typically only 1000-2000 lines.
+
+
+### Combine
+Now we want to combine our data sources in to a single corpus. First we start by
+decompressing each of the EuroParl and news-commentary TSV files.
+```shell
+for compressed in europarl-v10.de-en.tsv news-commentary-v16.de-en.tsv; do
+  if [ ! -e $compressed ]; then
+    gzip --keep -q -d $compressed.gz
+  fi
+done
+```
+This leaves two TSV files:
+  - `europarl-v10.de-en.tsv`
+  - `news-commentary-v16.de-en.tsv`
+
+where the first field contains German text, and the second field contains
+English text.
+
+We can untar the common crawl archive.
+```shell
+tar xf training-parallel-commoncrawl.tgz
+```
+This contains a collection of parallel text files across multiple languages, but
+we're only interested in those covering `en-de`:
+  - `commoncrawl.de-en.de`
+  - `commoncrawl.de-en.de`
+
+From these we can construct a parallel corpus. We concatenate the two TSV files,
+and extract the first field to populate the German combined corpus, and then the
+second field to populate the English combined corpus. To this, we then
+concatenate the commoncrawl data to the relevant file.
+```shell
+# Corpus
+if [ ! -e corpus.de ] || [ ! -e corpus.en ]; then
+  # TSVs
+  cat europarl-v10.de-en.tsv news-commentary-v16.de-en.tsv | cut -f 1 > corpus.de
+  cat europarl-v10.de-en.tsv news-commentary-v16.de-en.tsv | cut -f 2 > corpus.en
+
+  # Plain text
+  cat commoncrawl.de-en.de >> corpus.de
+  cat commoncrawl.de-en.en >> corpus.en
+fi
+```
+
+## Prepare data
+With our combined corpus we now apply some basic pre-processing.
+
+Firstly, we remove any non-printing characters using a script from [Moses].
+```shell
+for lang in en de; do
+  # Remove non-printing characters
+  cat corpus.$lang \
+    | perl $MOSES_SCRIPTS/tokenizer/remove-non-printing-char.perl \
+    > .corpus.norm.$lang
+done
+```
+This modifies the content separately for each language, but **does not** adjust
+the ordering. The parallel sentences pairs are associated by line, so it is
+crucial that any pre-processing preserves that.
+
+Then we constrain the sentences to be between 1 and 100 words with
+```shell
+# Contrain length between 1 100
+perl $MOSES_SCRIPTS/training/clean-corpus-n.perl .corpus.norm en de .corpus.trim 1 100
+```
+This removes sentence pairs where either one does not meet the length
+requirements.
+
+To remove any duplicates we build a TSV file, sort it and retain only unique
+lines.
+```shell
+# Deduplicate
+paste <(cat .corpus.trim.en) <(cat .corpus.trim.de) \
+  | LC_ALL=C sort -S 50% | uniq \
+  > .corpus.uniq.ende.tsv
+```
+
+Then clean corpus is obtained by separating our TSV file back to parallel text
+files.
+```shell
+cat .corpus.uniq.ende.tsv | cut -f 1 > corpus.clean.en
+cat .corpus.uniq.ende.tsv | cut -f 2 > corpus.clean.de
+```
+
+The cleaned corpus has 4,552,319 parallel sentences, having discarded around
+1.6% the total sentences.
+
+## Training
+To train a transformer model, we make use of Marian's presets. The `--task
+transformer-base` preset gives a good baseline of hyperparameters for a
+transformer model.
+
+We'll put our configuration inside a YAML file `transformer-model.yml`. We can
+output the configuration for this preset using the `--dump-config expand`
+options:
+```shell
+$MARIAN/marian --task transformer-base --dump-config expand > transformer-model.yml
+```
+We have shortened `../../build/marian` to `$MARIAN/marian` for brevity.
+
+You can inspect this file to see exactly which options have been set.
+
+We'll modify this file by adding options that training a little more verbose.
+```
+disp-freq: 1000
+disp-first: 10
+save-freq: 2ku
+```
+
+We also add line that will halt training after 10 updates without an improvement
+for on the validation set.
+```
+early-stopping: 10
+```
+
+We will also validate with additional metrics, keep the best model per metric
+and validate more often. This is achieved via:
+```
+keep-best: true
+valid-freq: 2ku
+valid-metrics:
+  - ce-mean-words
+  - bleu
+  - perplexity
+```
+Note that early-stopping criteria applies to `ce-mean-words`.
+
+### SentencePiece (Optional)
+To generate a SentencePiece vocabulary model you can run the `spm_train` command
+built alongside Marian. An example invocation would look something like:
+```shell
+ $MARIAN/spm_train \
+  --accept_language en,de \
+  --input data/corpus.clean.en,data/corpus.clean.de \
+  --model_prefix model/vocab.ende \
+  --vocab_size 32000
+mv model/vocab.ende.{model,spm}
+```
+Where as a last step, we rename `.model` to `.spm` (SentencePiece Model) so that
+Marian recognises it as from SentencePiece. This step is listed as optional as
+in the absence of a vocabulary file, Marian will build one.
+
+This produces a combined vocabulary of 32000 tokens.
+
+### Training Command
+To begin training, we call the `marian` command with the following arguments:
+```shell
+$MARIAN/marian -c transformer-model.yml \
+  -d 0 1 2 3 --workspace 9000 \
+  --seed 1111 \
+  --after 10e \
+  --model model/model.npz \
+  --train-sets data/corpus.clean.{en,de} \
+  --vocabs model/vocab.ende.spm model/vocab.ende.spm \
+  --dim-vocabs 32000 32000 \
+  --valid-sets data/valid.{en,de} \
+  --log model/train.log --valid-log model/valid.log
+```
+The flag `-d` sets the devices to be ran on, which you'll have to update for
+your setup. Additionally `-w`, the workspace, depends on how much memory your
+GPUs have. The example was tested on a pair of NVIDIA RTX 2080 with 11GB using a
+workspace of 9000 MiB. You should reduce this if you have less available memory.
+For reproducibility, the seed is set to `1111`. As a reference, this took around
+8 hours.
+
+The models will be stored at `model/model.npz`. The training and validation sets
+are specified, as well as the vocabular files and their dimension. Logs for the
+training and validation output are also retained. Finally, for this example we
+only train for a maximum of 10 epochs.
+
+The `save-freq` we specified of 2000, will result in the model state being saved
+at regular intervals of 2000 updates:
+  - `model/model.iter2000.npz`
+  - `model/model.iter4000.npz`
+  - ...
+
+The current model is always `model/model.npz`. Additionally, the `keep-best`
+option produces an additional model file for every validator:
+  - `model/model.npz.best-bleu.npz`
+  - `model/model.npz.best-ce-mean-words.npz`
+  - `model/model.npz.best-perplexity.npz`
+
+The training progress is tracked in `model/model.npz.progress.yml` with the full
+model configuration at `model/model.npz.yml`. In addition, Marian automatically
+generates a decoding config for each of these models:
+  - `model/model.npz.decoder.yml`
+  - `model/model.npz.best-*.npz.decoder.yml`
+
+These conveniently refer to the model and vocabulary files. They also include a
+default setting for beam-search and normalization, which can be overwritten by
+the command-line interface.
+
+## Translation
+To translate we use the `marian-decoder` command:
+```shell
+cat data/test.en \
+  | $MARIAN/marian-decoder \
+      -c model/model.npz.best-bleu.npz.decoder.yml \
+      -d 0 1 2 3 \
+  | tee evaluation/testset_output.txt \
+  | sacrebleu data/test.de --metrics bleu chrf -b -w 3 -f text
+```
+where we're using the model that produced the best BLEU score on the validation
+set. This snippet passes the source text to Marian over a pipe to `stdin`, and
+is output over `stdout`. We're capturing this output to file with `tee`, and
+passing the output into sacreBLEU for evaluation. We provide sacreBLEU our
+reference text, and ask it to compute both BLEU and chrF. The remaining
+sacreBLEU options return us only the score with 3 decimal places of precision in
+text format.
+
+You can experiment changing the `--beam-size` and `--normalization` to see how
+it changes the scores
+
+
+Additionally, if you want to compute the Comet score, there's a helper script:
+```
+./scripts/comet-score.sh hyp.txt src.txt ref.txt
+```
+This returns the Comet score for `hyp.txt`, the translation output, based on
+`src.txt` the source input, and `ref.txt` the reference translation.
+
+### Results
+Here we tabulate the scores for BLEU, chrF2 and Comet for our model. For each of
+the metrics, a larger score is better. You should achieve similar results with
+your own run!
+
+These are the results from decoding with best-BLEU model:
+
+| Test   | BLEU   | chrF2  | Comet  |
+|--------|--------|--------|--------|
+| WMT20  | 24.573 | 52.368 | 0.1795 |
+| WMT19^ | 37.185 | 62.628 | 0.3312 |
+| WMT18  | 40.140 | 65.281 | 0.5363 |
+| WMT17  | 26.832 | 56.096 | 0.4061 |
+| WMT16  | 33.245 | 60.534 | 0.4552 |
+
+**^** Note that WMT19 was used as the validation set!
+
+## Going Further
+If you want to improve on these results, you can continue training for longer,
+or incorporating other datasets from the WMT21 task. Take a look at the other
+examples and think about implementing some data augmentation through
+back-translation.
+
+Good luck!
+
+<!-- Links -->
+[sacrebleu]: https://github.com/mjpost/sacrebleu
+[comet]: https://github.com/Unbabel/COMET
+[moses]: https://github.com/moses-smt/mosesdecoder
+
+[news task]: https://www.statmt.org/wmt21/translation-task.html
+
+[sentencepiece]: https://github.com/google/sentencepiece
+[install_marian]: https://marian-nmt.github.io/docs/#installation
+[install_sentencepiece]: https://marian-nmt.github.io/docs/#sentencepiece