CHANGELOG.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458

# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).


## [Unreleased]

### Added

### Fixed

### Changed

## [1.11.0] - 2022-02-08

### Added
- Parallelized data reading with e.g. `--data-threads 8`
- Top-k sampling during decoding with e.g. `--output-sampling topk 10`
- Improved mixed precision training with `--fp16`
- Set FFN width in decoder independently from encoder with e.g. `--transformer-dim-ffn 4096 --transformer-decoder-dim-ffn 2048`
- Adds option `--add-lsh` to marian-conv which allows the LSH to be memory-mapped.
- Early stopping based on first, all, or any validation metrics via `--early-stopping-on`
- Compute 8.6 support if using CUDA>=11.1
- Support for RMSNorm as drop-in replace for LayerNorm from `Biao Zhang; Rico Sennrich (2019). Root Mean Square Layer Normalization`. Enabled in Transformer model via `--transformer-postprocess dar` instead of `dan`.
- Extend suppression of unwanted output symbols, specifically "\n" from default vocabulary if generated by SentencePiece with byte-fallback. Deactivates with --allow-special
- Allow for fine-grained CPU intrinsics overrides when BUILD_ARCH != native e.g. -DBUILD_ARCH=x86-64 -DCOMPILE_AVX512=off
- Adds custom bias epilogue kernel.
- Adds support for fusing relu and bias addition into gemms when using cuda 11.
- Better suppression of unwanted output symbols, specifically "\n" from SentencePiece with byte-fallback. Can be deactivated with --allow-special
- Display decoder time statistics with marian-decoder --stat-freq 10 ...
- Support for MS-internal binary shortlist
- Local/global sharding with MPI training via `--sharding local`
- fp16 support for factors.
- Correct training with fp16 via `--fp16`.
- Dynamic cost-scaling with `--cost-scaling`.
- Dynamic gradient-scaling with `--dynamic-gradient-scaling`.
- Add unit tests for binary files.
- Fix compilation with OMP
- Added `--model-mmap` option to enable mmap loading for CPU-based translation
- Compute aligned memory sizes using exact sizing
- Support for loading lexical shortlist from a binary blob
- Integrate a shortlist converter (which can convert a text lexical shortlist to a binary shortlist) into marian-conv with --shortlist option

### Fixed
- Fix AVX2 and AVX512 detection on MacOS
- Add GCC11 support into FBGEMM
- Added pragma to ignore unused-private-field error on elementType_ on macOS
- Do not set guided alignments for case augmented data if vocab is not factored
- Various fixes to enable LSH in Quicksand
- Added support to MPIWrappest::bcast (and similar) for count of type size_t
- Adding new validation metrics when training is restarted and --reset-valid-stalled is used
- Missing depth-scaling in transformer FFN
- Fixed an issue when loading intgemm16 models from unaligned memory.
- Fix building marian with gcc 9.3+ and FBGEMM
- Find MKL installed under Ubuntu 20.04 via apt-get
- Support for CUDA 11.
- General improvements and fixes for MPI handling, was essentially non-functional before (syncing, random seeds, deadlocks during saving, validation etc.)
- Allow to compile -DUSE_MPI=on with -DUSE_STATIC_LIBS=on although MPI gets still linked dynamically since it has so many dependencies.
- Fix building server with Boost 1.75
- Missing implementation for cos/tan expression operator
- Fixed loading binary models on architectures where `size_t` != `uint64_t`.
- Missing float template specialisation for elem::Plus
- Broken links to MNIST data sets
- Enforce validation for the task alias in training mode.

### Changed
- MacOS marian uses Apple Accelerate framework by default, as opposed to openblas/mkl.
- Optimize LSH for speed by treating is as a shortlist generator. No option changes in decoder
- Set REQUIRED_BIAS_ALIGNMENT = 16 in tensors/gpu/prod.cpp to avoid memory-misalignment on certain Ampere GPUs.
- For BUILD_ARCH != native enable all intrinsics types by default, can be disabled like this: -DCOMPILE_AVX512=off
- Moved FBGEMM pointer to commit c258054 for gcc 9.3+ fix
- Change compile options a la -DCOMPILE_CUDA_SM35 to -DCOMPILE_KEPLER, -DCOMPILE_MAXWELL,
-DCOMPILE_PASCAL, -DCOMPILE_VOLTA, -DCOMPILE_TURING and -DCOMPILE_AMPERE
- Disable -DCOMPILE_KEPLER, -DCOMPILE_MAXWELL by default.
- Dropped support for legacy graph groups.
- Developer documentation framework based on Sphinx+Doxygen+Breathe+Exhale
- Expresion graph documentation (#788)
- Graph operators documentation (#801)
- Remove unused variable from expression graph
- Factor groups and concatenation: doc/factors.md

## [1.10.0] - 2021-02-06

### Added
- Added `intgemm8(ssse3|avx|avx512)?`, `intgemm16(sse2|avx|avx512)?` types to marian-conv with uses intgemm backend. Types intgemm8 and intgemm16 are hardware-agnostic, the other ones hardware-specific.
- Shortlist is now always multiple-of-eight.
- Added intgemm 8/16bit integer binary architecture agnostic format.
- Add --train-embedder-rank for fine-tuning any encoder(-decoder) model for multi-lingual similarity via softmax-margin loss
- Add --logical-epoch that allows to redefine the displayed epoch counter as a multiple of n data epochs, updates or labels. Also allows to define width of fractional part with second argument.
- Add --metrics chrf for computing ChrF according to https://www.aclweb.org/anthology/W15-3049/ and SacreBLEU reference implementation
- Add --after option which is meant to replace --after-batches and --after-epochs and can take label based criteria
- Add --transformer-postprocess-top option to enable correctly normalized prenorm behavior
- Add --task transformer-base-prenorm and --task transformer-big-prenorm
- Turing and Ampere GPU optimisation support, if the CUDA version supports it.
- Printing word-level scores in marian-scorer
- Optimize LayerNormalization on CPU by 6x through vectorization (ffast-math) and fixing performance regression introduced with strides in 77a420
- Decoding multi-source models in marian-server with --tsv
- GitHub workflows on Ubuntu, Windows, and MacOS
- LSH indexing to replace short list
- ONNX support for transformer models (very experimental)
- Add topk operator like PyTorch's topk
- Use *cblas_sgemm_batch* instead of a for loop of *cblas_sgemm* on CPU as the batched_gemm implementation
- Supporting relative paths in shortlist and sqlite options
- Training and scoring from STDIN
- Support for reading from TSV files from STDIN and other sources during training
  and translation with options --tsv and --tsv-fields n.
- Internal optional parameter in n-best list generation that skips empty hypotheses.
- Quantized training (fixed point or log-based quantization) with --quantize-bits N command
- Support for using Apple Accelerate as the BLAS library

### Fixed
- Segfault of spm_train when compiled with -DUSE_STATIC_LIBS=ON seems to have gone away with update to newer SentencePiece version.
- Fix bug causing certain reductions into scalars to be 0 on the GPU backend. Removed unnecessary warp shuffle instructions.
- Do not apply dropout in embeddings layers during inference with dropout-src/trg
- Print "server is listening on port" message after it is accepting connections
- Fix compilation without BLAS installed
- Providing a single value to vector-like options using the equals sign, e.g. --models=model.npz
- Fix quiet-translation in marian-server
- CMake-based compilation on Windows
- Fix minor issues with compilation on MacOS
- Fix warnings in Windows MSVC builds using CMake
- Fix building server with Boost 1.72
- Make mini-batch scaling depend on mini-batch-words and not on mini-batch-words-ref
- In concatenation make sure that we do not multiply 0 with nan (which results in nan)
- Change Approx.epsilon(0.01) to Approx.margin(0.001) in unit tests. Tolerance is now
  absolute and not relative. We assumed incorrectly that epsilon is absolute tolerance.
- Fixed bug in finding .git/logs/HEAD when Marian is a submodule in another project.
- Properly record cmake variables in the cmake build directory instead of the source tree.
- Added default "none" for option shuffle in BatchGenerator, so that it works in executables where shuffle is not an option.
- Added a few missing header files in shortlist.h and beam_search.h.
- Improved handling for receiving SIGTERM during training. By default, SIGTERM triggers 'save (now) and exit'. Prior to this fix, batch pre-fetching did not check for this sigal, potentially delaying exit considerably. It now pays attention to that. Also, the default behaviour of save-and-exit can now be disabled on the command line with --sigterm exit-immediately.
- Fix the runtime failures for FASTOPT on 32-bit builds (wasm just happens to be 32-bit) because it uses hashing with an inconsistent mix of uint64_t and size_t.
- fix beam_search ABORT_IF(beamHypIdx >= beam.size(), "Out of bounds beamHypIdx??"); when enable openmp and OMP_NUM_THREADS > 1

### Changed
- Remove `--clip-gemm` which is obsolete and was never used anyway
- Removed `--optimize` switch, instead we now determine compute type based on binary model.
- Updated SentencePiece repository to version 8336bbd0c1cfba02a879afe625bf1ddaf7cd93c5 from https://github.com/google/sentencepiece.
- Enabled compilation of SentencePiece by default since no dependency on protobuf anymore.
- Changed default value of --sentencepiece-max-lines from 10000000 to 2000000 since apparently the new version doesn't sample automatically anymore (Not quite clear how that affects quality of the vocabulary).
- Change mini-batch-fit search stopping criterion to stop at ideal binary search threshold.
- --metric bleu now always detokenizes SacreBLEU-style if a vocabulary knows how to, use bleu-segmented to compute BLEU on word ids. bleu-detok is now a synonym for bleu.
- Move label-smoothing computation into Cross-entropy node
- Move Simple-WebSocket-Server to submodule
- Python scripts start with #!/usr/bin/env python3 instead of python
- Changed compile flags -Ofast to -O3 and remove --ffinite-math
- Moved old graph groups to depracated folder
- Make cublas and cusparse handle inits lazy to save memory when unused
- Replaced exception-based implementation for type determination in FastOpt::makeScalar

## [1.9.0] - 2020-03-10

### Added
- An option to print cached variables from CMake
- Add support for compiling on Mac (and clang)
- An option for resetting stalled validation metrics
- Add CMAKE options to disable compilation for specific GPU SM types
- An option to print word-level translation scores
- An option to turn off automatic detokenization from SentencePiece
- Separate quantization types for 8-bit FBGEMM for AVX2 and AVX512
- Sequence-level unliklihood training
- Allow file name templated valid-translation-output files
- Support for lexical shortlists in marian-server
- Support for 8-bit matrix multiplication with FBGEMM
- CMakeLists.txt now looks for SSE 4.2
- Purging of finished hypotheses during beam-search. A lot faster for large batches.
- Faster option look-up, up to 20-30% faster translation
- Added --cite and --authors flag
- Added optional support for ccache
- Switch to change abort to exception, only to be used in library mode
- Support for 16-bit packed models with FBGEMM
- Multiple separated parameter types in ExpressionGraph, currently inference-only
- Safe handling of sigterm signal
- Automatic vectorization of elementwise operations on CPU for tensors dims that
  are divisible by 4 (AVX) and 8 (AVX2)
- Replacing std::shared_ptr<T> with custom IntrusivePtr<T> for small objects like
  Tensors, Hypotheses and Expressions.
- Fp16 inference working for translation
- Gradient-checkpointing

### Fixed
- Replace value for INVALID_PATH_SCORE with std::numer_limits<float>::lowest()
  to avoid overflow with long sequences
- Break up potential circular references for GraphGroup*
- Fix empty source batch entries with batch purging
- Clear RNN chache in transformer model, add correct hash functions to nodes
- Gather-operation for all index sizes
- Fix word weighting with max length cropping
- Fixed compilation on CPUs without support for AVX
- FastOpt now reads "n" and "y" values as strings, not as boolean values
- Fixed multiple reduction kernels on GPU
- Fixed guided-alignment training with cross-entropy
- Replace IntrusivePtr with std::uniq_ptr in FastOpt, fixes random segfaults
  due to thread-non-safty of reference counting.
- Make sure that items are 256-byte aligned during saving
- Make explicit matmul functions respect setting of cublasMathMode
- Fix memory mapping for mixed paramter models
- Removed naked pointer and potential memory-leak from file_stream.{cpp,h}
- Compilation for GCC >= 7 due to exception thrown in destructor
- Sort parameters by lexicographical order during allocation to ensure consistent
  memory-layout during allocation, loading, saving.
- Output empty line when input is empty line. Previous behavior might result in
  hallucinated outputs.
- Compilation with CUDA 10.1

### Changed
- Combine two for-loops in nth_element.cpp on CPU
- Revert LayerNorm eps to old position, i.e. sigma' = sqrt(sigma^2 + eps)
- Downgrade NCCL to 2.3.7 as 2.4.2 is buggy (hangs with larger models)
- Return error signal on SIGTERM
- Dropped support for CUDA 8.0, CUDA 9.0 is now minimal requirement
- Removed autotuner for now, will be switched back on later
- Boost depdendency is now optional and only required for marian_server
- Dropped support for g++-4.9
- Simplified file stream and temporary file handling
- Unified node intializers, same function API.
- Remove overstuff/understuff code

## [1.8.0] - 2019-09-04

### Added
- Alias options and new --task option
- Automatic detection of CPU intrisics when building with -arch=native
- First version of BERT-training and BERT-classifier, currently not compatible with TF models
- New reduction operators
- Use Cmake's ExternalProject to build NCCL and potentially other external libs
- Code for Factored Vocabulary, currently not usable yet without outside tools

### Fixed
- Issue with relative paths in automatically generated decoder config files
- Bug with overlapping CXX flags and building spm_train executable
- Compilation with gcc 8
- Overwriting and unsetting vector options
- Windows build with recent changes
- Bug with read-ahead buffer
- Handling of "dump-config: false" in YAML config
- Errors due to warnings
- Issue concerning failed saving with single GPU training and --sync-sgd option.
- NaN problem when training with Tensor Cores on Volta GPUs
- Fix pipe-handling
- Fix compilation with GCC 9.1
- Fix CMake build types

### Changed
- Error message when using left-to-right and right-to-left models together in ensembles
- Regression tests included as a submodule
- Update NCCL to 2.4.2
- Add zlib source to Marian's source tree, builds now as object lib
- -DUSE_STATIC_LIBS=on now also looks for static versions of CUDA libraries
- Include NCCL build from github.com/marian-nmt/nccl and compile within source tree
- Set nearly all warnings as errors for Marian's own targets. Disable warnings for 3rd party
- Refactored beam search

## [1.7.0] - 2018-11-27

### Added
- Word alignment generation in scorer
- Attention output generation in decoder and scorer with `--alignment soft`
- Support for SentencePiece vocabularies and run-time segmentation/desegmentation
- Support for SentencePiece vocabulary training during model training
- Group training files by filename when creating vocabularies for joint vocabularies
- Updated examples
- Synchronous multi-node training (early version)

### Fixed
- Delayed output in line-by-line translation

### Changed
- Generated word alignments include alignments for target EOS tokens
- Boost::program_options has been replaced by another CLI library
- Replace boost::file_system with Pathie
- Expansion of unambiguous command-line arguments is no longer supported

## [1.6.0] - 2018-08-08

### Added
- Faster training (20-30%) by optimizing gradient popagation of biases
- Returning Moses-style hard alignments during decoding single models,
  ensembles and n-best lists
- Hard alignment extraction strategy taking source words that have the
  attention value greater than the threshold
- Refactored sync sgd for easier communication and integration with NCCL
- Smaller memory-overhead for sync-sgd
- NCCL integration (version 2.2.13)
- New binary format for saving/load of models, can be used with _*.bin_
  extension (can be memory mapped)
- Memory-mapping of graphs for inferece with `ExpressionGraph::mmap(const void*
  ptr)` function. (assumes _*.bin_ model is mapped or in buffer)
- Added SRU (--dec-cell sru) and ReLU (--dec-cell relu) cells to inventory of
  RNN cells
- RNN auto-regression layers in transformer (`--transformer-decoder-autreg
  rnn`), work with gru, lstm, tanh, relu, sru cells
- Recurrently stacked layers in transformer (`--transformer-tied-layers 1 1 1 2
  2 2` means 6 layers with 1-3 and 4-6 tied parameters, two groups of
  parameters)
- Seamless training continuation with exponential smoothing

### Fixed
- A couple of bugs in "selection" (transpose, shift, cols, rows) operators
  during back-prob for a very specific case: one of the operators is the first
  operator after a branch, in that case gradient propgation might be
  interrupted. This did not affect any of the existing models as such a case
  was not present, but might have caused future models to not train properly
- Bug in mini-batch-fit, tied embeddings would result in identical embeddings
  in fake source and target batch. Caused under-estimation of memory usage and
  re-allocation

## [1.5.0] - 2018-06-17

### Added
- Average Attention Networks for Transformer model
- 16-bit matrix multiplication on CPU
- Memoization for constant nodes for decoding
- Autotuning for decoding

### Fixed
- GPU decoding optimizations, about 2x faster decoding of transformer models
- Multi-node MPI-based training on GPUs

## [1.4.0] - 2018-03-13

### Added
- Data weighting with `--data-weighting` at sentence or word level
- Persistent SQLite3 corpus storage with `--sqlite file.db`
- Experimental multi-node asynchronous training
- Restoring optimizer and training parameters such as learning rate, validation
  results, etc.
- Experimental multi-CPU training/translation/scoring with `--cpu-threads=N`
- Restoring corpus iteration after training is restarted
- N-best-list scoring in marian-scorer

### Fixed
- Deterministic data shuffling with specific seed for SQLite3 corpus storage
- Mini-batch fitting with binary search for faster fitting
- Better batch packing due to sorting


## [1.3.1] - 2018-02-04

### Fixed
- Missing final validation when done with training
- Differing summaries for marian-scorer when used with multiple GPUs

## [1.3.0] - 2018-01-24

### Added
- SQLite3 based corpus storage for on-disk shuffling etc. with `--sqlite`
- Asynchronous maxi-batch preloading
- Using transpose in SGEMM to tie embeddings in output layer

## [1.2.1] - 2018-01-19

### Fixed
- Use valid-mini-batch size during validation with "translation" instead of
  mini-batch
- Normalize gradients with multi-gpu synchronous SGD
- Fix divergence between saved models and validated models in asynchronous SGD

## [1.2.0] - 2018-01-13

### Added
- Option `--pretrained-model` to be used for network weights initialization
  with a pretrained model
- Version number saved in the model file
- CMake option `-DCOMPILE_SERVER=ON`
- Right-to-left training, scoring, decoding with `--right-left`

### Fixed
- Fixed marian-server compilation with Boost 1.66
- Fixed compilation on g++-4.8.4
- Fixed compilation without marian-server if openssl is not available

## [1.1.3] - 2017-12-06

### Added
- Added back gradient-dropping

### Fixed
- Fixed parameters initialization for `--tied-embeddings` during translation

## [1.1.2] - 2017-12-05

### Fixed
- Fixed ensembling with language model and batched decoding
- Fixed attention reduction kernel with large matrices (added missing
  `syncthreads()`), which should fix stability with large batches and beam-size
  during batched decoding

## [1.1.1] - 2017-11-30

### Added
- Option `--max-length-crop` to be used together with `--max-length N` to crop
  sentences to length N rather than omitting them.
- Experimental model with convolution over input characters

### Fixed
- Fixed a number of bugs for vocabulary and directory handling

## [1.1.0] - 2017-11-21

### Added
- Batched translation for all model types, significant translation speed-up
- Batched translation during validation with translation
- `--maxi-batch-sort` option for `marian-decoder`
- Support for CUBLAS_TENSOR_OP_MATH mode for cublas in cuda 9.0
- The "marian-vocab" tool to create vocabularies

## [1.0.0] - 2017-11-13

### Added
- Multi-gpu validation, scorer and in-training translation
- summary-mode for scorer
- New "transformer" model based on [Attention is all you
  need](https://arxiv.org/abs/1706.03762)
- Options specific for the transformer model
- Linear learning rate warmup with and without initial value
- Cyclic learning rate warmup
- More options for learning rate decay, including: optimizer history reset,
  repeated warmup
- Continuous inverted square root decay of learning (`--lr-decay-inv-sqrt`)
  rate based on number of updates
- Exposed optimizer parameters (e.g. momentum etc. for Adam)
- Version of deep RNN-based models compatible with Nematus (`--type nematus`)
- Synchronous SGD training for multi-gpu (enable with `--sync-sgd`)
- Dynamic construction of complex models with different encoders and decoders,
  currently only available through the C++ API
- Option `--quiet` to suppress output to stderr
- Option to choose different variants of optimization criterion: mean
  cross-entropy, perplexity, cross-entropy sum
- In-process translation for validation, uses the same memory as training
- Label Smoothing
- CHANGELOG.md
- CONTRIBUTING.md
- Swish activation function default for Transformer
  (https://arxiv.org/pdf/1710.05941.pdf)

### Changed
- Changed shape organization to follow numpy.
- Changed option `--moving-average` to `--exponential-smoothing` and inverted
  formula to `s_t = (1 - \alpha) * s_{t-1} + \alpha * x_t`, `\alpha` is now
  `1-e4` by default
- Got rid of thrust for compile-time mathematical expressions
- Changed boolean option `--normalize` to `--normalize [arg=1] (=0)`. New
  behaviour is backwards-compatible and can also be specified as
  `--normalize=0.6`
- Renamed "s2s" binary to "marian-decoder"
- Renamed "rescorer" binary to "marian-scorer"
- Renamed "server" binary to "marian-server"
- Renamed option name `--dynamic-batching` to `--mini-batch-fit`
- Unified cross-entropy-based validation, supports now perplexity and other CE
- Changed `--normalize (bool)` to `--normalize (float)arg`, allow to change
  length normalization weight as `score / pow(length, arg)`

### Removed
- Temporarily removed gradient dropping (`--drop-rate X`) until refactoring.