doc/layer.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241

# Layers

In a typical deep neural network, highest-level blocks, which perform different kinds of
transformations on their inputs are called layers. A layer wraps a group of nodes and performs a
specific mathematical computation, offering a shortcut for building a more complex neural network.

In Marian, for example, the `mlp::dense` layer represents a fully connected layer, which implements
the operation `output = activation(input * weight + bias)`.  A dense layer in the graph can be
constructed with the following code:
```cpp
// add input node x
auto x = graph->constant({120,5}, inits::fromVector(inputData));
// construct a dense layer in the graph
auto layer1 = mlp::dense()
      ("prefix", "layer1")                  // prefix name is layer1
      ("dim", 5)                            // output dimension is 5
      ("activation", (int)mlp::act::tanh)   // activation function is tanh
      .construct(graph)->apply(x);          // construct this layer in graph
                                            // and link node x as the input
```
The options are passed to the layer using pairs of `(key, value)`, where `key` is a predefined
option, and `value` is the option value.  Then `construct()` is called to create a layer instance in
the graph, and `apply()` to link the input with this layer.

Alternatively, the same layer can be created defining nodes and operations directly:
```cpp
// construct a dense layer using nodes
auto W1 = graph->param("W1", {120, 5}, inits::glorotUniform());
auto b1 = graph->param("b1", {1, 5}, inits::zeros());
auto h = tanh(affine(x, W1, b1));
```
There are four categories of layers implemented in Marian, described in the sections below.

## Convolution layer

To use a `convolution` layer, you first need to install [NVIDIA cuDNN](https://developer.nvidia.com/cudnn).
The convolution layer supported by Marian is a 2D
[convolution layer](https://en.wikipedia.org/wiki/Convolutional_neural_network#Convolutional_layers).
This layer creates a convolution kernel which is used to convolved with the input. The options that
can be passed to a `convolution` layer are the following:

| Option Name   | Definition     | Value Type    | Default Value  |
| ------------- |----------------|---------------|---------------|
| prefix        | Prefix name (used to form the parameter names) | `std::string` | `None` |
| kernel-dims   | The height and width of the kernel | `std::pair<int, int>` | `None`|
| kernel-num    | The number of kernel | `int` | `None`       |
| paddings      | The height and width of paddings | `std::pair<int, int>` | `(0,0)`|
| strides       | The height and width of strides | `std::pair<int, int>` | `(1,1)` |

Example:
```cpp
// construct a convolution layer
auto conv_1 = convolution(graph)              // pass graph pointer to the layer
      ("prefix", "conv_1")                    // prefix name is conv_1
      ("kernel-dims", std::make_pair(3,3))    // kernel is 3*3
      ("kernel-num", 32)                      // kernel no. is 32
      .apply(x);                              // link node x as the input
```

## MLP layers

Marian offers `mlp::mlp`, which creates a
[multilayer perceptron (MLP)](https://en.wikipedia.org/wiki/Multilayer_perceptron) network.
It is a container which can stack multiple layers using `push_back()` function. There are two types
of MLP layers provided by Marian: `mlp::dense` and `mlp::output`.

The `mlp::dense` layer, as introduced before, is a fully connected layer, and it accepts the
following options:

| Option Name   | Definition     | Value Type    | Default Value  |
| ------------- |----------------|---------------|---------------|
| prefix        | Prefix name (used to form the parameter names) | `std::string` | `None` |
| dim           | Output dimension | `int` | `None` |
| layer-normalization | Whether to normalise the layer output or not | `bool` | `false` |
| nematus-normalization | Whether to use Nematus layer normalisation or not | `bool` | `false` |
| activation | Activation function | `int` | `mlp::act::linear` |

The available activation functions for mlp are `mlp::act::linear`, `mlp::act::tanh`,
`mlp::act::sigmoid`, `mlp::act::ReLU`, `mlp::act::LeakyReLU`, `mlp::act::PReLU`, and
`mlp::act::swish`.

Example:
```cpp
// construct a mlp::dense layer
auto dense_layer = mlp::dense()
      ("prefix", "dense_layer")                 // prefix name is dense_layer
      ("dim", 3)                                // output dimension is 3
      ("activation", (int)mlp::act::sigmoid)    // activation function is sigmoid
      .construct(graph)->apply(x);              // construct this layer in graph and link node x as the input
```

The `mlp::output` layer is used, as the name suggests, to construct an output layer. You can tie
embedding layers to `mlp::output` layer using `tieTransposed()`, or set shortlisted words using
`setShortlist()`. The general options of `mlp::output` layer are listed below:

| Option Name   | Definition     | Value Type    | Default Value  |
| ------------- |----------------|---------------|---------------|
| prefix        | Prefix name (used to form the parameter names) | `std::string` | `None` |
| dim           | Output dimension | `int` | `None` |
| vocab         | File path to the factored vocabulary | `std::string` | `None` |
| output-omit-bias | Whether this layer has a bias parameter | `bool` | `true` |
| lemma-dim-emb | Re-embedding dimension of lemma in factors, must be used with `vocab` option | `int` | `0` |
| output-approx-knn | Parameters for LSH-based output approximation, i.e., `k` (the first element) and `nbit` (the second element) | `std::vector<int>` | None |

Example:
```cpp
// construct a mlp::output layer
auto last = mlp::output()
      ("prefix", "last")    // prefix name is dense_layer
      ("dim", 5);           // output dimension is 5
```
Finally, an example showing how to create a `mlp::mlp` network containing multiple layers:
```cpp
// construct a mlp::mlp network
auto mlp_networks = mlp::mlp()                                       // construct a mpl container
                     .push_back(mlp::dense()                         // construct a dense layer
                                 ("prefix", "dense")                 // prefix name is dense
                                 ("dim", 5)                          // dimension is 5
                                 ("activation", (int)mlp::act::tanh))// activation function is tanh
                     .push_back(mlp::output()                        // construct a output layer
                                 ("dim", 5))                         // dimension is 5
                     ("prefix", "mlp_network")                       // prefix name is mlp_network
                     .construct(graph);                              // construct this mlp layers in graph
```

## RNN layers
Marian offers `rnn::rnn` for creating a [recurrent neural network
(RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network) network. Just like `mlp::mlp`,
`rnn::rnn` is a container which can stack multiple layers using `push_back()` function. Unlike mlp
layers, Marian only provides cell-level APIs to construct RNN. RNN cells only process a single
timestep instead of the whole batches of input sequences. There are two types of rnn layers provided
by Marian: `rnn::cell` and `rnn::stacked_cell`.

The `rnn::cell` is the base component of RNN and `rnn::stacked_cell` is a stack of `rnn::cell`. The
few options of `rnn::cell` layer are listed below:

| Option Name   | Definition     | Value Type    | Default Value  |
| ------------- |----------------|---------------|---------------|
| type          | Type of RNN cell  | `std::string` | `None` |

There are nine types of RNN cells provided by Marian: `gru`, `gru-nematus`, `lstm`, `mlstm`, `mgru`,
`tanh`, `relu`, `sru`, `ssru`. The general options for all RNN cells are the following:

| Option Name   | Definition     | Value Type    | Default Value  |
| ------------- |----------------|---------------|---------------|
| dimInput      | Input dimension  | `int` | `None` |
| dimState      | Dimension of hidden state  | `int` | `None` |
| prefix        | Prefix name (used to form the parameter names) | `std::string` | `None` |
| layer-normalization | Whether to normalise the layer output or not | `bool` | `false` |
| dropout       | Dropout probability | `float` | `0` |
| transition    | Whether it is a transition layer | `bool` | `false` |
| final         | Whether it is an RNN final layer or hidden layer | `bool` | `false` |

```{note}
Not all the options listed above are available for all the cells. For example, `final` option is
only used for `gru` and `gru-nematus` cells.
```

Example for `rnn::cell`:
```cpp
// construct a rnn cell
auto rnn_cell = rnn::cell()
         ("type", "gru")              // type of rnn cell is gru
         ("prefix", "gru_cell")       // prefix name is gru_cell
         ("final", false);            // this cell is the final layer
```
Example for `rnn::stacked_cell`:
```cpp
// construct a stack of rnn cells
auto highCell = rnn::stacked_cell();
// for loop to add rnn cells into the stack
for(size_t j = 1; j <= 512; j++) {
    auto paramPrefix ="cell" + std::to_string(j);
    highCell.push_back(rnn::cell()("prefix", paramPrefix));
}
```

The list of available options for `rnn::rnn` layers:

| Option Name   | Definition     | Value Type    | Default Value  |
| ------------- |----------------|---------------|---------------|
| type          | Type of RNN layer | `std::string` | `gru` |
| direction     | RNN direction  | `int` | `rnn::dir::forward` |
| dimInput      | Input dimension | `int` | `None` |
| dimState      | Dimension of hidden state | `int` | `None` |
| prefix        | Prefix name (used to form the parameter names) | `std::string` | `None` |
| layer-normalization | Whether to normalise the layer output or not | `bool` | `false` |
| nematus-normalization | Whether to use Nematus layer normalisation or not | `bool` | `false` |
| dropout       | Dropout probability | `float` | `0` |
| skip          | Whether to use skip connections | `bool` | `false` |
| skipFirst     | Whether to use skip connections for the layer(s) with `index > 0` | `bool` | `false` |

Examples for `rnn::rnn()`:
```cpp
// construct a `rnn::rnn()` container
auto rnn_container = rnn::rnn(
               "type", "gru",                  // type of rnn cell is gru
               "prefix", "rnn_layers",         // prefix name is rnn_layers
               "dimInput", 10,                 // input dimension is 10
               "dimState", 5,                  // dimension of hidden state is 5
               "dropout", 0,                   // dropout probability is 0
               "layer-normalization", false)   // do not normalise the layer output
               .push_back(rnn::cell())         // add a rnn::cell in this rnn container
               .construct(graph);              // construct this rnn container in graph
```
Marian provides four RNN directions in `rnn::dir` enumerator: `rnn::dir::forward`,
`rnn::dir::backward`, `rnn::dir::alternating_forward` and `rnn::dir::alternating_backward`.
For rnn::rnn(), you can use `transduce()` to map the input state to the output state.

An example for `transduce()`:
```cpp
auto output = rnn.construct(graph)->transduce(input);
```

## Embedding layer
Marian provides a shortcut to construct a regular embedding layer `embedding` for words embedding.
For `embedding` layers, there are following options available:

| Option Name   | Definition     | Value Type    | Default Value  |
| ------------- |----------------|---------------|---------------|
| dimVocab      | Size of vocabulary| `int` | `None` |
| dimEmb        | Size of embedding vector | `int` | `None` |
| dropout       | Dropout probability | `float` | `0` |
| inference     | Whether it is used for inference | `bool` | `false` |
| prefix        | Prefix name (used to form the parameter names) | `std::string` | `None` |
| fixed         | whether this layer is fixed (not trainable) | `bool` | `false` |
| dimFactorEmb  | Size of factored embedding vector | `int` | `None` |
| factorsCombine | Which strategy is chosen to combine the factor embeddings; it can be `"concat"` | `std::string` | `None` |
| vocab         | File path to the factored vocabulary | `std::string` | `None` |
| embFile       | Paths to the factored embedding vectors | `std::string>` | `None` |
| normalization | Whether to normalise the layer output or not | `bool` | `false` |

Example to construct an embedding layer:
```cpp
// construct an embedding layer
auto embedding_layer = embedding()
        ("prefix", "embedding")       // prefix name is embedding
        ("dimVocab", 1024)            // vocabulary size is 1024
        ("dimEmb", 512)               // size of embedding vector is 512
        .construct(graph);            // construct this embedding layer in graph
```