Factor concatenation improvements and documentation (#748)

* concatenation combining option added when embeding using factors * crossMask not used by default * added an option to better clarify when choosing factor predictor options * fixed bug when choosing re-embedding option and not setting embedding size * avoid uncessary string copy * Check in factors documentation * Fix duplication in merge * Self-referential repository * change --factors-predictor to --lemma-dependency. Default behaviour changed. * factor related options are now stored with the model * Update doc/factors.md * add backward compability for the target factors * Move backward compatibility checks for factors to happen after the model.npz config is loaded * Add explicit error msg if using concat on target * Update func comments. Fix spaces * Add Marian version requirement * delete experimental code Co-authored-by: Pedro Coelho <pedrodiascoelho97@gmail.com> Co-authored-by: Pedro Coelho <pedro.coelho@unbabel.com> Co-authored-by: Roman Grundkiewicz <rgrundkiewicz@gmail.com>
author: Kenneth Heafield <kpu@users.noreply.github.com> 2021-09-08 16:02:21 +0300
committer: GitHub <noreply@github.com> 2021-09-08 16:02:21 +0300
commit: 4dd30b5065efba61fc044e9dc4303205c9d2ac53 (patch)
tree: 59d43288e309801d0d63331bfabb48bdea534607
parent: 8d0a3c0c2749f234acf60a6c33fa93d5918f8fe7 (diff)
16 files changed, 377 insertions, 111 deletions
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 05658fe1..e0b85314 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -62,6 +62,7 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
 - Expresion graph documentation (#788)
 - Graph operators documentation (#801)
 - Remove unused variable from expression graph
+- Factor groups and concatenation: doc/factors.md
 
 ## [1.10.0] - 2021-02-06
 
diff --git a/doc/factors.md b/doc/factors.md
new file mode 100644
index 00000000..59e14b68
--- /dev/null
+++ b/doc/factors.md
@@ -0,0 +1,218 @@
+# Using marian with factors
+
+Following this README should allow the user to train a model with source and/or target side factors. To train with factors, the data must be formatted in a certain way. A special vocabulary file format is also required, and its extension should be `.fsv` as providing a source and/or target vocabulary file with this extension is what triggers the usage of source and/or target factors. See details below.
+
+### Requirements:
+
+In order to use factors in Marian, you should use at least Marian 1.9.0 unless you want to use a factors functionality that requires setting one of the following command line options to their non default values: `--factors-combine`, `-—factors-dim-emb` and `--lemma-dependency` as they were only introduced after Marian 1.10.20+.
+
+## Define factors
+
+Factors should be organized in "groups," where each group represents a different feature. For example, there could be a group denoting capitalization and another denoting subword divisions.
+
+Factors within a single group should start with the same string.
+
+For example, for a capitalization factor group, the individual factors could be:
+
+* `c0`: all lowercase
+
+* `c1`: first character capitalized, rest lowercase
+
+* `c2`: all uppercase
+
+If there were a second factor group for subword divisions, the individual factors could be:
+
+* `s0`: end of word, whitespace should follow
+
+* `s1`: join token with next subword
+
+There is no limit on the number of factor groups barring some practical limitations having to do with how the vocabulary is stored by `marian`. If the limit is exceeded `marian` will throw an error.
+
+Factor group zero is always the actual words in the text, referred to as *lemmas*.
+
+## Data preparation
+
+Factors are appended to the *lemmas* with a pipe `|`. The pipe also separates factors of multiple groups.
+
+Example sentence:
+
+```
+Trump tested positive for COVID-19.
+```
+
+Preprocessed sentence:
+```
+trump test@@ ed positive for c@@ o@@ v@@ i@@ d - 19 .
+```
+
+Apply factors:
+```
+trump|c1|s0 test|c0|s1 ed|c0|s0 positive|c0|s0 for|c0|s0 c|c2|s1 o|c2|s1 v|c2|s1 i|c2|s1 d|c2|s0 -|c0|s0 19|c0|s0 .|c0|s0
+```
+
+
+## Create the factored vocabulary
+
+Factored vocabularies must have the extension `.fsv`. How to structure the vocabulary file is described below. If using factors only on the source or target side, the vocabulary of the other side can be a normal `json`, `yaml`, etc.
+
+The `.fsv` vocabulary must have two sections:
+
+1. **Factors**
+
+    The factor groups are defined with an underscore prepended. The colon indicates which factor group each factor inherits from. `_has_c` is used in the definition of the words in the vocabulary (see #2 below) to indicate that that word has that factor group. The `_lemma` factor is used for the words/tokens themselves; this must be present.
+
+    ```
+    _lemma
+
+    _c
+    c0 : _c
+    c1 : _c
+    c2 : _c
+    _has_c
+
+    _s
+    s0 : _s
+    s1 : _s
+    _has_s
+    ```
+
+2. **Lemmas**
+
+    These are the vocabulary entries themselves. They have the format of `LEMMA : _lemma [_has_c] [_has_s]`. The `_has_X` should only apply to lemmas that can have an `X` factor anywhere in the data (which will likely be all of the tokens except `</s>` and `<unk>`).
+
+    Examples:
+    ```
+    </s> : _lemma
+    <unk> : _lemma
+    , : _lemma _has_c _has_s
+    . : _lemma _has_c _has_s
+    the : _lemma _has_c _has_s
+    for : _lemma _has_c _has_s
+    ```
+
+
+#### Other Requirements
+
+Certain characters are used by the `.fsv` vocabulary that will have to be escaped/replaced in the data: `#:_\|`
+
+The tokens in the factor vocabularies (`c0`, `c1`, `s0`, etc.) cannot be present in any of the *lemmas*.
+
+### Full `.fsv` file
+
+Putting everything together, the final `.fsv` file should look like this. It can have comments (lines started by `#`).
+
+ ```
+ # factors
+
+_lemma
+
+_c
+c0 : _c
+c1 : _c
+c2 : _c
+_has_c
+
+_s
+s0 : _s
+s1 : _s
+_has_s
+
+ # lemmas
+
+</s> : _lemma
+<unk> : _lemma
+, : _lemma _has_c _has_s
+. : _lemma _has_c _has_s
+the : _lemma _has_c _has_s
+for : _lemma _has_c _has_s
+ ```
+
+## Training options
+
+There are two choices for how factor embeddings are combined with *lemma* embeddings: summation and concatenation.
+
+```
+--factors-combine TEXT=sum      How to combine the factors and lemma embeddings.
+                                Options available: sum, concat
+```
+
+The dimension of the factor embeddings must be specified if using combine option `concat`. If using `sum`, the factor embedding dimension matches that of the lemmas.
+
+```
+--factors-dim-emb INT           Embedding dimension of the factors. Only used if concat is selected as factors combining form
+```
+
+Note: At the moment `concat` is only implemented for usage in the source side.
+
+### Prediction
+
+If using factors on the target side, there are multiple options for how factor predictions are generated related to the form of conditioning / dependencies of factors and lemmas. If no option is set with `--lemma-dependency`, the default behavior will be predicting the factors with no lemma dependency.
+
+```
+--lemma-dependency TEXT         Lemma dependency method to use when predicting target factors.
+                                Options: soft-transformer-layer, hard-transformer-layer, lemma-dependent-bias, re-embedding
+
+--lemma-dim-emb INT=0           Re-embedding dimension of lemma in factors
+```
+
+* `soft-transformer-layer`: Uses an additional transformer layer to predict the factors using the previously predicted lemma
+* `hard-transformer-layer`: Like `soft-transformer-layer` but with hard-max
+* `lemma-dependent-bias`: Adds a learned bias term based on the predicted lemma to the logits of the factors. There is no additional transformer layer introduced with this option
+* `re-embedding`: After predicting a lemma, re-embed the lemma and add this new vector before predicting the factors
+* `lemma-dim-emb`: Controls the dimension of the re-embedded lemma when using the option `re-embedding`
+
+
+### Weight tying
+
+If you use factors both on the source and target side, and the factors are the same for both sides you can tie the embeddings exactly as you do for non factored models.
+
+If factors are used only on one side (either source or target) with a joint vocabulary, there are two options for tying source and target embedding weights:
+
+1. Use combine option `concat` (If using factors only on the source side).
+2. Use combine option `sum`, and create "dummy" factors on the non-factorized side. This entails creating a factored vocabulary where the same number of factors are present as are on the side with meaningful factors. In the previous example, if we have the capitalization and subword factors on the source side, the target side would have five different dummy factors (they can all be in the same group). In the *lemma* section of the `.fsv` file we would just not put `_has_X` for any lemma.
+
+    ```
+    # factors
+
+    _lemma
+
+    _d
+    d0 : _d
+    d1 : _d
+    d2 : _d
+    d3 : _d
+    d4 : _d
+    _has_d
+
+    # lemmas
+
+    </s> : _lemma
+    <unk> : _lemma
+    , : _lemma
+    . : _lemma
+    le : _lemma
+    pour : _lemma
+    ```
+
+## Examples
+Some examples of possible commands to train factored models in marian:
+* Using factors on both source and target. Using `sum` to combine lemma and factor embeddings. No tied embeddings and no lemma dependency when predicting the factors:
+```
+path_to/build/marian -t corpus.fact.{src,trg} \
+                     -v vocab.{src,trg}.fsv
+```
+* Using factors only on the source side. Using `concat` to combine lemma and factor embeddings. Source, target and output embeddings matrices tied:
+```
+path_to/build/marian -t corpus.fact.src corpus.trg \
+                     -v vocab.src.fsv vocab.trg.yml \
+                     --factors-combine concat \
+                     --factors-dim-emb 8 \
+                     --tied-embeddings-all
+```
+* Using factors only on the target side. Using `sum` to combine lemma and factor embeddings. Target and output embedding matrices tied. Predicting factors with `soft-transformer-layer` lemma dependency:
+```
+path_to/build/marian -t corpus.src corpus.fact.trg \
+                     -v vocab.src.yml vocab.fsv.trg \
+                     --tied-embeddings \
+                     --lemma-dependency soft-transformer-layer
+```
diff --git a/src/common/config.cpp b/src/common/config.cpp
index a1c4ed5a..9878c70b 100644
--- a/src/common/config.cpp
+++ b/src/common/config.cpp
@@ -116,6 +116,21 @@ void Config::initialize(ConfigParser const& cp) {
     config_["tsv-fields"] = tsvFields;
   }
 
+  // ensures factors backward compatibility whilst keeping the more user friendly CLI
+  if(get<std::string>("lemma-dependency").empty()) {
+    YAML::Node config;
+    int lemmaDimEmb = get<int>("lemma-dim-emb");
+    if(lemmaDimEmb > 0) {
+      config_["lemma-dependency"] = "re-embedding";
+    } else if(lemmaDimEmb == -1) {
+      config_["lemma-dependency"] = "lemma-dependent-bias";
+    } else if(lemmaDimEmb == -2) {
+      config_["lemma-dependency"] = "soft-transformer-layer";
+    } else if(lemmaDimEmb == -3) {
+      config_["lemma-dependency"] = "hard-transformer-layer";
+    }
+  }
+
   // echo full configuration
   log();
 
diff --git a/src/common/config_parser.cpp b/src/common/config_parser.cpp
index d7818afb..30d77e36 100644
--- a/src/common/config_parser.cpp
+++ b/src/common/config_parser.cpp
@@ -195,6 +195,13 @@ void ConfigParser::addOptionsModel(cli::CLIWrapper& cli) {
   cli.add<int>("--dim-emb",
       "Size of embedding vector",
       512);
+  cli.add<int>("--factors-dim-emb",
+      "Embedding dimension of the factors. Only used if concat is selected as factors combining form");
+  cli.add<std::string>("--factors-combine",
+    "How to combine the factors and lemma embeddings. Options available: sum, concat",
+    "sum");
+  cli.add<std::string>("--lemma-dependency",
+      "Lemma dependency method to use when predicting target factors. Options: soft-transformer-layer, hard-transformer-layer, lemma-dependent-bias, re-embedding");
   cli.add<int>("--lemma-dim-emb",
       "Re-embedding dimension of lemma in factors",
       0);
diff --git a/src/data/corpus_base.cpp b/src/data/corpus_base.cpp
index 5f9a9ee3..9d95a121 100644
--- a/src/data/corpus_base.cpp
+++ b/src/data/corpus_base.cpp
@@ -566,54 +566,5 @@ void SentenceTuple::setWeights(const std::vector<float>& weights) {
   weights_ = weights;
 }
 
-// experimental: hide inline-fix source tokens from cross attention
-std::vector<float> SubBatch::crossMaskWithInlineFixSourceSuppressed() const
-{
-  const auto& srcVocab = *vocab();
-
-  auto factoredVocab = vocab()->tryAs<FactoredVocab>();
-  size_t inlineFixGroupIndex = 0, inlineFixSrc = 0;
-  auto hasInlineFixFactors = factoredVocab && factoredVocab->tryGetFactor(FactoredVocab_INLINE_FIX_WHAT_serialized, /*out*/ inlineFixGroupIndex, /*out*/ inlineFixSrc);
-
-  auto fixSrcId = srcVocab[FactoredVocab_FIX_SRC_ID_TAG];
-  auto fixTgtId = srcVocab[FactoredVocab_FIX_TGT_ID_TAG];
-  auto fixEndId = srcVocab[FactoredVocab_FIX_END_ID_TAG];
-  auto unkId = srcVocab.getUnkId();
-  auto hasInlineFixTags = fixSrcId != unkId && fixTgtId != unkId && fixEndId != unkId;
-
-  auto m = mask(); // default return value, which we will modify in-place below in case we need to
-  if (hasInlineFixFactors || hasInlineFixTags) {
-    LOG_ONCE(info, "[data] Suppressing cross-attention into inline-fix source tokens");
-
-    // example: force French translation of name "frank" to always be "franck"
-    //  - hasInlineFixFactors: "frank|is franck|it", "frank|is" cannot be cross-attended to
-    //  - hasInlineFixTags:    "<IOPEN> frank <IDELIM> franck <ICLOSE>", "frank" and all tags cannot be cross-attended to
-    auto dimBatch = batchSize();  // number of sentences in the batch
-    auto dimWidth = batchWidth(); // number of words in the longest sentence in the batch
-    const auto& d = data();
-    size_t numWords = 0;
-    for (size_t b = 0; b < dimBatch; b++) {     // loop over batch entries
-      bool inside = false;
-      for (size_t s = 0; s < dimWidth; s++) {  // loop over source positions
-        auto i = locate(/*batchIdx=*/b, /*wordPos=*/s);
-        if (!m[i])
-          break;
-        numWords++;
-        // keep track of entering/exiting the inline-fix source tags
-        auto w = d[i];
-        if (w == fixSrcId)
-          inside = true;
-        else if (w == fixTgtId)
-          inside = false;
-        bool wHasSrcIdFactor = hasInlineFixFactors && factoredVocab->getFactor(w, inlineFixGroupIndex) == inlineFixSrc;
-        if (inside || w == fixSrcId || w == fixTgtId || w == fixEndId || wHasSrcIdFactor)
-          m[i] = 0.0f; // decoder must not look at embedded source, nor the markup tokens
-      }
-    }
-    ABORT_IF(batchWords() != 0/*n/a*/ && numWords != batchWords(), "batchWords() inconsistency??");
-  }
-  return m;
-}
-
 }  // namespace data
 }  // namespace marian
diff --git a/src/data/corpus_base.h b/src/data/corpus_base.h
index 251df5bc..63a6fb99 100644
--- a/src/data/corpus_base.h
+++ b/src/data/corpus_base.h
@@ -236,9 +236,6 @@ public:
   }
 
   void setWords(size_t words) { words_ = words; }
-
-  // experimental: hide inline-fix source tokens from cross attention
-  std::vector<float> crossMaskWithInlineFixSourceSuppressed() const;
 };
 
 /**
diff --git a/src/data/factored_vocab.cpp b/src/data/factored_vocab.cpp
index cc715993..e05f3122 100644
--- a/src/data/factored_vocab.cpp
+++ b/src/data/factored_vocab.cpp
@@ -663,6 +663,44 @@ std::string FactoredVocab::surfaceForm(const Words& sentence) const /*override f
   return res;
 }
 
+/**
+ * Auxiliary function that return the total number of factors (no lemmas) in a factored vocabulary.
+ * @return number of factors
+ */
+size_t FactoredVocab::getTotalFactorCount() const {
+  return factorVocabSize() - groupRanges_[0].second;
+}
+
+/**
+ * Decodes the indexes of lemma and factor for each word and outputs that information separately.
+ * It will return two data structures that contain separate information regarding lemmas and factors indexes
+ * by receiving a list with the word indexes of a batch.
+ * @param[in] words           vector of words
+ * @param[out] lemmaIndices   lemma index for each word
+ * @param[out] factorIndices  factor usage information for each word (1 if the factor is used 0 if not)
+ */
+void FactoredVocab::lemmaAndFactorsIndexes(const Words& words, std::vector<IndexType>& lemmaIndices, std::vector<float>& factorIndices) const {
+  lemmaIndices.reserve(words.size());
+  factorIndices.reserve(words.size() * getTotalFactorCount());
+
+  auto numGroups = getNumGroups();
+  std::vector<size_t> lemmaAndFactorIndices;
+
+  for (auto &word : words) {
+    if (vocab_.contains(word.toWordIndex())) { // skip invalid combinations in the space (can only happen during initialization)  --@TODO: add a check?
+      word2factors(word, lemmaAndFactorIndices);
+      lemmaIndices.push_back((IndexType) lemmaAndFactorIndices[0]); // save the lemma vocabulary index
+      for (size_t g = 1; g < numGroups; g++) { // loop over the different factors group
+        auto factorIndex = lemmaAndFactorIndices[g]; // get the vocabulary index of the factor of group g
+        ABORT_IF(factorIndex == FACTOR_NOT_SPECIFIED, "Attempted to embed a word with a factor not specified");
+        for (int i = 0; i < factorShape_[g] - 1; i++) { // loop over all factors in group g
+          factorIndices.push_back((float) (factorIndex == i)); // fill the factor indexes array with '0' if the factor is not used in a given word, '1' if it is
+        }
+      }
+    }
+  }
+}
+
 // create a CSR matrix M[V,U] from words[] with M[v,u] = 1 if factor u is a factor of word v
 // This is used to form the embedding of a multi-factor token.
 // That embedding is a sum of the embeddings of the individual factors.
diff --git a/src/data/factored_vocab.h b/src/data/factored_vocab.h
index 6b96d8cd..b644ce4c 100644
--- a/src/data/factored_vocab.h
+++ b/src/data/factored_vocab.h
@@ -49,12 +49,13 @@ public:
   virtual size_t lemmaSize() const override;
 
   CSRData csr_rows(const Words& words) const; // sparse matrix for summing up factors from the concatenated embedding matrix for each word
-
+  void lemmaAndFactorsIndexes(const Words& words, std::vector<IndexType>& lemmaIndices, std::vector<float>& factorIndices) const;
 #ifdef FACTOR_FULL_EXPANSION
   const CSRData& getGlobalFactorMatrix() const { return globalFactorMatrix_; }   // [v,u] (sparse) -> =1 if u is factor of v  --only used in getLogits()
 #endif
   size_t getNumGroups() const { return groupRanges_.size(); }
-  std::pair<size_t, size_t> getGroupRange(size_t g)    const { return groupRanges_[g]; }   // [g] -> (u_begin, u_end)
+  std::pair<size_t, size_t> getGroupRange(size_t g) const { return groupRanges_[g]; }   // [g] -> (u_begin, u_end)
+  size_t getTotalFactorCount() const;
 #ifdef FACTOR_FULL_EXPANSION
   const std::vector<float>& getGapLogMask() const { return gapLogMask_; } // [v] -inf if v is a gap entry, else 0
 #endif
@@ -80,15 +81,6 @@ public:
   Word string2word(const std::string& w) const;
   bool tryGetFactor(const std::string& factorGroupName, size_t& groupIndex, size_t& factorIndex) const; // note: factorGroupName given without separator
 
-  // some hard-coded constants from FactoredSegmenter
-  // The naming mimics the names in FactoredSegmenter.cs, and therefore intentionally does not follow Marian conventions.
-  // @TODO: We have more hard-coded constants throughout the code. Move them all here.
-  // @TODO: figure out how to do this with static const*/constexpr
-#define FactoredVocab_INLINE_FIX_WHAT_serialized "is"
-#define FactoredVocab_FIX_SRC_ID_TAG             "<IOPEN>"
-#define FactoredVocab_FIX_TGT_ID_TAG             "<IDELIM>"
-#define FactoredVocab_FIX_END_ID_TAG             "<ICLOSE>"
-
 private:
   void constructGroupInfoFromFactorVocab();
   void constructFactorIndexConversion();
diff --git a/src/layers/embedding.cpp b/src/layers/embedding.cpp
index 92c4ad6d..26d6b7fe 100644
--- a/src/layers/embedding.cpp
+++ b/src/layers/embedding.cpp
@@ -8,19 +8,31 @@ Embedding::Embedding(Ptr<ExpressionGraph> graph, Ptr<Options> options)
   std::string name = opt<std::string>("prefix");
   int dimVoc       = opt<int>("dimVocab");
   int dimEmb       = opt<int>("dimEmb");
+  int dimFactorEmb = opt<int>("dimFactorEmb");
 
   bool fixed = opt<bool>("fixed", false);
 
+  // Embedding layer initialization should depend only on embedding size, hence fanIn=false
+  auto initFunc = inits::glorotUniform(
+      /*fanIn=*/false, /*fanOut=*/true);  // -> embedding vectors have roughly unit length
+
   factoredVocab_ = FactoredVocab::tryCreateAndLoad(options_->get<std::string>("vocab", ""));
   if(factoredVocab_) {
     dimVoc = (int)factoredVocab_->factorVocabSize();
     LOG_ONCE(info, "[embedding] Factored embeddings enabled");
+    if(opt<std::string>("factorsCombine") == "concat") {
+      ABORT_IF(dimFactorEmb == 0,
+               "Embedding: If concatenation is chosen to combine the factor embeddings, a factor "
+               "embedding size must be specified.");
+      int numberOfFactors = (int)factoredVocab_->getTotalFactorCount();
+      dimVoc -= numberOfFactors;
+      FactorEmbMatrix_
+          = graph_->param("factor_" + name, {numberOfFactors, dimFactorEmb}, initFunc, fixed);
+      LOG_ONCE(info,
+               "[embedding] Combining lemma and factors embeddings with concatenation enabled");
+    }
   }
 
-  // Embedding layer initialization should depend only on embedding size, hence fanIn=false
-  auto initFunc = inits::glorotUniform(
-      /*fanIn=*/false, /*fanOut=*/true);  // -> embedding vectors have roughly unit length
-
   if(options_->has("embFile")) {
     std::string file = opt<std::string>("embFile");
     if(!file.empty()) {
@@ -32,6 +44,26 @@ Embedding::Embedding(Ptr<ExpressionGraph> graph, Ptr<Options> options)
   E_ = graph_->param(name, {dimVoc, dimEmb}, initFunc, fixed);
 }
 
+/**
+ * Embeds a sequence of words (given as indices), where they have factor information. The matrices are concatenated
+ * @param words vector of words
+ * @returns  Expression that is the concatenation of the lemma and factor embeddings
+ */
+/*private*/ Expr Embedding::embedWithConcat(const Words& data) const {
+  auto graph = E_->graph();
+  std::vector<IndexType> lemmaIndices;
+  std::vector<float> factorIndices;
+  factoredVocab_->lemmaAndFactorsIndexes(data, lemmaIndices, factorIndices);
+  auto lemmaEmbs = rows(E_, lemmaIndices);
+  int dimFactors = FactorEmbMatrix_->shape()[0];
+  auto factEmbs
+      = dot(graph->constant(
+                {(int)data.size(), dimFactors}, inits::fromVector(factorIndices), Type::float32),
+            FactorEmbMatrix_);
+
+  return concatenate({lemmaEmbs, factEmbs}, -1);
+}
+
 // helper to embed a sequence of words (given as indices) via factored embeddings
 Expr Embedding::multiRows(const Words& data, float dropProb) const {
   auto graph        = E_->graph();
@@ -61,7 +93,9 @@ std::tuple<Expr /*embeddings*/, Expr /*mask*/> Embedding::apply(Ptr<data::SubBat
 /*override final*/ {
   auto graph   = E_->graph();
   int dimBatch = (int)subBatch->batchSize();
-  int dimEmb   = E_->shape()[-1];
+  int dimEmb   = (factoredVocab_ && opt<std::string>("factorsCombine") == "concat")
+                   ? E_->shape()[-1] + FactorEmbMatrix_->shape()[-1]
+                   : E_->shape()[-1];
   int dimWidth = (int)subBatch->batchWidth();
 
   // factored embeddings:
@@ -96,14 +130,8 @@ std::tuple<Expr /*embeddings*/, Expr /*mask*/> Embedding::apply(Ptr<data::SubBat
   //        more slowly
 
   auto batchEmbeddings = apply(subBatch->data(), {dimWidth, dimBatch, dimEmb});
-#if 1
+
   auto batchMask = graph->constant({dimWidth, dimBatch, 1}, inits::fromVector(subBatch->mask()));
-#else  // @TODO: this is dead code now, get rid of it
-  // experimental: hide inline-fix source tokens from cross attention
-  auto batchMask
-      = graph->constant({dimWidth, dimBatch, 1},
-                        inits::fromVector(subBatch->crossMaskWithInlineFixSourceSuppressed()));
-#endif
   // give the graph inputs readable names for debugging and ONNX
   batchMask->set_name("data_" + std::to_string(/*batchIndex_=*/0) + "_mask");
 
@@ -112,8 +140,12 @@ std::tuple<Expr /*embeddings*/, Expr /*mask*/> Embedding::apply(Ptr<data::SubBat
 
 Expr Embedding::apply(const Words& words, const Shape& shape) const /*override final*/ {
   if(factoredVocab_) {
-    Expr selectedEmbs = multiRows(words, options_->get<float>("dropout", 0.0f));  // [(B*W) x E]
-    selectedEmbs      = reshape(selectedEmbs, shape);                             // [W, B, E]
+    Expr selectedEmbs;
+    if(opt<std::string>("factorsCombine") == "concat")
+      selectedEmbs = embedWithConcat(words);  // [(B*W) x E]
+    else
+      selectedEmbs = multiRows(words, options_->get<float>("dropout", 0.0f));  // [(B*W) x E]
+    selectedEmbs = reshape(selectedEmbs, shape);                               // [W, B, E]
     // selectedEmbs = dropout(selectedEmbs, options_->get<float>("dropout", 0.0f), {
     // selectedEmbs->shape()[-3], 1, 1 }); // @TODO: replace with factor dropout
     return selectedEmbs;
@@ -141,13 +173,15 @@ Expr Embedding::applyIndices(const std::vector<WordIndex>& embIdx, const Shape&
 /*private*/ Ptr<IEmbeddingLayer> EncoderDecoderLayerBase::createEmbeddingLayer() const {
   // clang-format off
   auto options = New<Options>(
-      "dimVocab",  opt<std::vector<int>>("dim-vocabs")[batchIndex_],
-      "dimEmb",    opt<int>("dim-emb"),
-      "dropout",   dropoutEmbeddings_,
-      "inference", inference_,
-      "prefix",    (opt<bool>("tied-embeddings-src") || opt<bool>("tied-embeddings-all")) ? "Wemb"
+      "dimVocab",       opt<std::vector<int>>("dim-vocabs")[batchIndex_],
+      "dimEmb",         opt<int>("dim-emb"),
+      "dropout",        dropoutEmbeddings_,
+      "inference",      inference_,
+      "prefix",         (opt<bool>("tied-embeddings-src") || opt<bool>("tied-embeddings-all")) ? "Wemb"
                                                                                           : prefix_ + "_Wemb",
-      "fixed",     embeddingFix_,
+      "fixed",          embeddingFix_,
+      "dimFactorEmb",   opt<int>("factors-dim-emb"),  // for factored embeddings
+      "factorsCombine", opt<std::string>("factors-combine"),  // for factored embeddings
       "vocab",     opt<std::vector<std::string>>("vocabs")[batchIndex_]);  // for factored embeddings
   // clang-format on
   if(options_->hasAndNotEmpty("embedding-vectors")) {
diff --git a/src/layers/embedding.h b/src/layers/embedding.h
index 2fa7b78d..d34c7ffb 100644
--- a/src/layers/embedding.h
+++ b/src/layers/embedding.h
@@ -12,8 +12,10 @@ class FactoredVocab;
 // EncoderDecoderLayerBase, which knows to pass on all required parameters from options.
 class Embedding : public LayerBase, public IEmbeddingLayer {
   Expr E_;
+  Expr FactorEmbMatrix_; // Factors embedding matrix if combining lemma and factors embeddings with concatenation
   Ptr<FactoredVocab> factoredVocab_;
   Expr multiRows(const Words& data, float dropProb) const;
+  Expr embedWithConcat(const Words& data) const;
   bool inference_{false};
 
 public:
diff --git a/src/layers/output.cpp b/src/layers/output.cpp
index 92cccdfb..4d6e488a 100644
--- a/src/layers/output.cpp
+++ b/src/layers/output.cpp
@@ -36,12 +36,12 @@ namespace mlp {
     b_ = graph_->param(name + "_b", {1, numOutputClasses}, inits::zeros());
 
   /*const*/ int lemmaDimEmb = options_->get<int>("lemma-dim-emb", 0);
+  std::string lemmaDependency = options_->get<std::string>("lemma-dependency", "");
   ABORT_IF(lemmaDimEmb && !factoredVocab_, "--lemma-dim-emb requires a factored vocabulary");
-  if(lemmaDimEmb > 0) {  // > 0 means to embed the (expected) word with a different embedding matrix
-#define HARDMAX_HACK
-#ifdef HARDMAX_HACK
-    lemmaDimEmb = lemmaDimEmb & 0xfffffffe;  // hack to select hard-max: use an odd number
-#endif
+  if(lemmaDependency == "re-embedding") {  // embed the (expected) word with a different embedding matrix
+    ABORT_IF(
+        lemmaDimEmb <= 0,
+        "In order to predict factors by re-embedding them, a lemma-dim-emb must be specified.");
     auto range = factoredVocab_->getGroupRange(0);
     auto lemmaVocabDim = (int)(range.second - range.first);
     auto initFunc = inits::glorotUniform(
@@ -109,8 +109,12 @@ Logits Output::applyAsLogits(Expr input) /*override final*/ {
     std::vector<Ptr<RationalLoss>> allLogits(numGroups,
                                              nullptr);  // (note: null entries for absent factors)
     Expr input1 = input;                                // [B... x D]
-    Expr Plemma = nullptr;                              // used for lemmaDimEmb=-1
-    Expr inputLemma = nullptr;                          // used for lemmaDimEmb=-2, -3
+    Expr Plemma = nullptr;                              // used for lemmaDependency = lemma-dependent-bias
+    Expr inputLemma = nullptr;                          // used for lemmaDependency = hard-transformer-layer and soft-transformer-layer
+
+    std::string factorsCombine = options_->get<std::string>("factors-combine", "");
+    ABORT_IF(factorsCombine == "concat", "Combining lemma and factors embeddings with concatenation on the target side is currently not supported");
+
     for(size_t g = 0; g < numGroups; g++) {
       auto range = factoredVocab_->getGroupRange(g);
       if(g > 0 && range.first == range.second)  // empty entry
@@ -130,9 +134,8 @@ Logits Output::applyAsLogits(Expr input) /*override final*/ {
           factorB = slice(b_, -1, Slice((int)range.first, (int)range.second));
       }
       /*const*/ int lemmaDimEmb = options_->get<int>("lemma-dim-emb", 0);
-      if((lemmaDimEmb == -2 || lemmaDimEmb == -3)
-         && g > 0) {  // -2/-3 means a gated transformer-like structure (-3 = hard-max)
-        LOG_ONCE(info, "[embedding] using lemma conditioning with gate");
+      std::string lemmaDependency = options_->get<std::string>("lemma-dependency", "");
+      if((lemmaDependency == "soft-transformer-layer" || lemmaDependency == "hard-transformer-layer") && g > 0) {
         // this mimics one transformer layer
         //  - attention over two inputs:
         //     - e = current lemma. We use the original embedding vector; specifically, expectation
@@ -229,7 +232,7 @@ Logits Output::applyAsLogits(Expr input) /*override final*/ {
       allLogits[g] = New<RationalLoss>(factorLogits, nullptr);
       // optionally add a soft embedding of lemma back to create some lemma dependency
       // @TODO: if this works, move it into lazyConstruct
-      if(lemmaDimEmb == -2 && g == 0) {  // -2 means a gated transformer-like structure
+      if(lemmaDependency == "soft-transformer-layer" && g == 0) {
         LOG_ONCE(info, "[embedding] using lemma conditioning with gate, soft-max version");
         // get expected lemma embedding vector
         auto factorLogSoftmax = logsoftmax(
@@ -239,7 +242,7 @@ Logits Output::applyAsLogits(Expr input) /*override final*/ {
                          factorWt,
                          false,
                          /*transB=*/isLegacyUntransposedW ? true : false);  // [B... x D]
-      } else if(lemmaDimEmb == -3 && g == 0) {  // same as -2 except with hard max
+      } else if(lemmaDependency == "hard-transformer-layer" && g == 0) {
         LOG_ONCE(info, "[embedding] using lemma conditioning with gate, hard-max version");
         // get max-lemma embedding vector
         auto maxVal = max(factorLogits,
@@ -249,29 +252,22 @@ Logits Output::applyAsLogits(Expr input) /*override final*/ {
                          factorWt,
                          false,
                          /*transB=*/isLegacyUntransposedW ? true : false);  // [B... x D]
-      } else if(lemmaDimEmb == -1 && g == 0) {  // -1 means learn a lemma-dependent bias
+      } else if(lemmaDependency == "lemma-dependent-bias" && g == 0) {
         ABORT_IF(shortlist_, "Lemma-dependent bias with short list is not yet implemented");
         LOG_ONCE(info, "[embedding] using lemma-dependent bias");
         auto factorLogSoftmax
             = logsoftmax(factorLogits);  // (we do that again later, CSE will kick in)
         auto z = /*stopGradient*/ (factorLogSoftmax);
         Plemma = exp(z);                      // [B... x U]
-      } else if(lemmaDimEmb > 0 && g == 0) {  // > 0 means learn a re-embedding matrix
+      } else if(lemmaDependency == "re-embedding" && g == 0) {
+        ABORT_IF(
+            lemmaDimEmb <= 0,
+            "In order to predict factors by re-embedding them, a lemma-dim-emb must be specified.");
         LOG_ONCE(info, "[embedding] enabled re-embedding of lemma, at dim {}", lemmaDimEmb);
         // compute softmax. We compute logsoftmax() separately because this way, computation will be
         // reused later via CSE
         auto factorLogSoftmax = logsoftmax(factorLogits);
         auto factorSoftmax = exp(factorLogSoftmax);
-#ifdef HARDMAX_HACK
-        bool hardmax = (lemmaDimEmb & 1)
-                       != 0;  // odd value triggers hardmax for now (for quick experimentation)
-        if(hardmax) {
-          lemmaDimEmb = lemmaDimEmb & 0xfffffffe;
-          LOG_ONCE(info, "[embedding] HARDMAX_HACK enabled. Actual dim is {}", lemmaDimEmb);
-          auto maxVal = max(factorSoftmax, -1);
-          factorSoftmax = eq(factorSoftmax, maxVal);
-        }
-#endif
         // re-embedding lookup, soft-indexed by softmax
         Expr e;
         if(shortlist_) {  // short-listed version of re-embedding matrix
diff --git a/src/models/encoder_classifier.h b/src/models/encoder_classifier.h
index 4cfc54f1..5c8ddb5a 100644
--- a/src/models/encoder_classifier.h
+++ b/src/models/encoder_classifier.h
@@ -139,6 +139,9 @@ public:
     modelFeatures_.insert("ulr-trainable-transformation");
     modelFeatures_.insert("ulr-dim-emb");
     modelFeatures_.insert("lemma-dim-emb");
+    modelFeatures_.insert("lemma-dependency");
+    modelFeatures_.insert("factors-combine");
+    modelFeatures_.insert("factors-dim-emb");
   }
 
   virtual Ptr<Options> getOptions() override { return options_; }
diff --git a/src/models/encoder_decoder.cpp b/src/models/encoder_decoder.cpp
index 8fc9321a..66ff16ce 100644
--- a/src/models/encoder_decoder.cpp
+++ b/src/models/encoder_decoder.cpp
@@ -62,6 +62,9 @@ EncoderDecoder::EncoderDecoder(Ptr<ExpressionGraph> graph, Ptr<Options> options)
   modelFeatures_.insert("ulr-dim-emb");
   modelFeatures_.insert("lemma-dim-emb");
   modelFeatures_.insert("output-omit-bias");
+  modelFeatures_.insert("lemma-dependency");
+  modelFeatures_.insert("factors-combine");
+  modelFeatures_.insert("factors-dim-emb");
 }
 
 std::vector<Ptr<EncoderBase>>& EncoderDecoder::getEncoders() {
diff --git a/src/models/encoder_pooler.h b/src/models/encoder_pooler.h
index 1baa8560..8a212343 100644
--- a/src/models/encoder_pooler.h
+++ b/src/models/encoder_pooler.h
@@ -149,6 +149,9 @@ public:
     modelFeatures_.insert("ulr-trainable-transformation");
     modelFeatures_.insert("ulr-dim-emb");
     modelFeatures_.insert("lemma-dim-emb");
+    modelFeatures_.insert("lemma-dependency");
+    modelFeatures_.insert("factors-combine");
+    modelFeatures_.insert("factors-dim-emb");
   }
 
   virtual Ptr<Options> getOptions() override { return options_; }
diff --git a/src/models/s2s.h b/src/models/s2s.h
index 7009fad5..104f946c 100644
--- a/src/models/s2s.h
+++ b/src/models/s2s.h
@@ -318,7 +318,9 @@ public:
       }
       last("vocab", opt<std::vector<std::string>>("vocabs")[batchIndex_]); // for factored outputs
       last("lemma-dim-emb", opt<int>("lemma-dim-emb", 0)); // for factored outputs
-      
+      last("lemma-dependency", opt<std::string>("lemma-dependency", "")); // for factored outputs
+      last("factors-combine", opt<std::string>("factors-combine", "")); // for factored outputs
+
       last("output-omit-bias", opt<bool>("output-omit-bias", false)); 
 
       // assemble layers into MLP and apply to embeddings, decoder context and
diff --git a/src/models/transformer.h b/src/models/transformer.h
index a792de8b..7ec40dc5 100644
--- a/src/models/transformer.h
+++ b/src/models/transformer.h
@@ -295,7 +295,8 @@ public:
       kh = cache_[prefix + "_keys"];                                                   // then return cached tensor
     }
     else {
-      auto Wk = graph_->param(prefix + "_Wk", {dimModel, dimModel}, inits::glorotUniform(true, true, depthScaling_ ? 1.f / sqrtf((float)depth_) : 1.f));
+      int dimKeys =  keys->shape()[-1]; // different than dimModel when using lemma and factors combined with concatenation
+      auto Wk = graph_->param(prefix + "_Wk", {dimKeys, dimModel}, inits::glorotUniform(true, true, depthScaling_ ? 1.f / sqrtf((float)depth_) : 1.f));
       auto bk = graph_->param(prefix + "_bk", {1,        dimModel}, inits::zeros());
 
       kh = affine(keys, Wk, bk);     // [-4: beam depth, -3: batch size, -2: max length, -1: vector dim]
@@ -309,7 +310,8 @@ public:
         && cache_[prefix + "_values"]->shape().elements() == values->shape().elements()) {
       vh = cache_[prefix + "_values"];
     } else {
-      auto Wv = graph_->param(prefix + "_Wv", {dimModel, dimModel}, inits::glorotUniform(true, true, depthScaling_ ? 1.f / sqrtf((float)depth_) : 1.f));
+      int dimValues = values->shape()[-1]; // different than dimModel when using lemma and factors combined with concatenation
+      auto Wv = graph_->param(prefix + "_Wv", {dimValues, dimModel}, inits::glorotUniform(true, true, depthScaling_ ? 1.f / sqrtf((float)depth_) : 1.f));
       auto bv = graph_->param(prefix + "_bv", {1,        dimModel}, inits::zeros());
 
       vh = affine(values, Wv, bv); // [-4: batch size, -3: num heads, -2: max length, -1: split vector dim]
@@ -661,7 +663,9 @@ private:
         "vocab", opt<std::vector<std::string>>("vocabs")[batchIndex_], // for factored outputs
         "output-omit-bias", opt<bool>("output-omit-bias", false),
         "output-approx-knn", opt<std::vector<int>>("output-approx-knn", {}),
-        "lemma-dim-emb", opt<int>("lemma-dim-emb", 0)); // for factored outputs
+        "lemma-dim-emb", opt<int>("lemma-dim-emb", 0), // for factored outputs
+        "lemma-dependency", opt<std::string>("lemma-dependency", ""), // for factored outputs
+        "factors-combine", opt<std::string>("factors-combine", "")); // for factored outputs
 
     if(opt<bool>("tied-embeddings") || opt<bool>("tied-embeddings-all"))
       outputFactory.tieTransposed(opt<bool>("tied-embeddings-all") || opt<bool>("tied-embeddings-src") ? "Wemb" : prefix_ + "_Wemb");
author	Kenneth Heafield <kpu@users.noreply.github.com>	2021-09-08 16:02:21 +0300
committer	GitHub <noreply@github.com>	2021-09-08 16:02:21 +0300
commit	4dd30b5065efba61fc044e9dc4303205c9d2ac53 (patch)
tree	59d43288e309801d0d63331bfabb48bdea534607
parent	8d0a3c0c2749f234acf60a6c33fa93d5918f8fe7 (diff)