Added DRAFT of confusion network decoding

and efficient LM handling. git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@928 1f5c12ca-751b-0410-a591-d2e778427230
author: mfederico <mfederico@1f5c12ca-751b-0410-a591-d2e778427230> 2006-10-25 18:29:41 +0400
committer: mfederico <mfederico@1f5c12ca-751b-0410-a591-d2e778427230> 2006-10-25 18:29:41 +0400
commit: 3c9e76d1ded5c5a2a8ee4b93a026f0055f22c3ce (patch)
tree: 0172e79addc33f23577b9fc130b591221476793e /report
parent: 0c49f2ccbaaa4554e4a582f0d4d78bedd03a4156 (diff)
1 files changed, 383 insertions, 2 deletions
diff --git a/report/report.tex b/report/report.tex
index ccbdcbaf3..8703bce55 100755
--- a/report/report.tex
+++ b/report/report.tex
@@ -342,7 +342,295 @@ the other parameters used in decoding.
 
 
 \chapter{Confusion Network Decoding}
-{\sc Marcello Federico and Richard Zens: cut and paste from your journal paper?}
+%{\sc Marcello Federico and Richard Zens: cut and paste from your journal paper?}
+
+% Definitions.
+% --------------------
+\def\x{{\mathbf x}}
+\def\L{{\cal L}}
+\newcommand{\devfour}{{\tt dev4}}
+\newcommand{\eval}{{\tt eval}}
+\newcommand{\E}{{\bf e}}
+\newcommand{\F}{{\bf f}}
+\newcommand{\bm}[1]{{\mbox{\boldmath${#1}$}}}
+\newcommand{\tf}{\tilde{f}}
+\newcommand{\te}{\tilde{e}}
+\newcommand{\teb}{{\bf \te}}
+
+
+Machine translation input currently takes the form of simple sequences of words. 
+However, there are increasing demands to integrate machine translation technology 
+in larger information processing systems with upstream NLP/speech processing tools 
+(such as named entity  recognizers, speech recognizers, morphological analyzers, etc.). 
+These upstream processes tend to generate multiple, erroneous hypotheses with 
+varying confidence.  Current MT systems are designed to process only one input 
+hypothesis, making them vulnerable to errors in the input.  
+
+\noindent
+Our  focus in the workshop was on the speech translation case, where the input is 
+generated by a speech recognizer.  More specifically, our goal was  to improve performance 
+of spoken language translation by better integrating  speech recognition and 
+machine translation  models. Translation from speech input is considered  more difficult than 
+translation from text for several reasons. Spoken language has many  styles and 
+genres, such as, formal read speech, unplanned speeches, interviews, spontaneous 
+conversations;  it produces less controlled language, presenting more relaxed syntax 
+and spontaneous speech phenomena. Finally, translation of spoken language is prone 
+to speech recognition  errors, which can  possibly corrupt the  syntax and the meaning of 
+the input.
+
+The impact of speech recognition errors on translation performance
+is explained by the plot in Figure~\ref{fig:wer-bleu}, which shows BLEU scores of 
+translations together with the WER of the corresponding source sentences.  There is
+also empirical evidence that better translations can be obtained from 
+transcriptions of the speech recognizer which resulted indeed in lower scores.  This
+suggests that improvements can be achieved by applying machine translation on 
+a large set of transcription hypotheses generated by the speech recognizers and by
+combining scores of  acoustic models, language models, and translation models.
+
+
+
+Recently, approaches have been
+proposed for improving translation quality  through the processing of 
+multiple input hypotheses.  In particular, better translation performance 
+have been reported by exploiting   $N$-best lists \cite{zhang04,quan05},  word 
+lattices \cite{matusov05,mathias06},   and confusion networks \cite{bertoldi05a}.  
+
+\noindent
+In the workshop we concentrated on confusion network decoding as  discussed in  
+\cite{bertoldi05a}, and  developed a simpler translation model and a  more efficient 
+implementation of the search algorithm.  Remarkably, the confusion network decoder  
+resulted in an extension of  the standard text decoder.
+
+\section{Spoken language translation}
+\label{sec:slt}
+\noindent
+
+\begin{figure}[t]
+\begin{center}
+\label{fig:wer-bleu}
+\caption{Relationship between BLEU score of target strings and word-error-rate (WER) of source strings. 
+Source strings are transcriptions of parliamentary speeches produced by a speech recognition system (To be checked) .}\includegraphics[width=10cm]{marcello-bleu-wer} 
+\end{center}
+\end{figure}
+
+
+
+From a statistical perspective, SLT can be approached as follows.
+Given the  vector $\o$ representing  the acoustic observations  of the
+input  utterance,   let  ${\cal  F(\o)}$  be  a   set  of
+transcription  hypotheses  computed by  a speech recognizers  and
+represented  as a  word-graph.   The  best translation  $\E^*$ is  
+searched  among all  strings in  the target language ${\cal E}$ through 
+the following criterion:
+%\begin{eqnarray}
+\begin{equation}
+%\e^* &=& \arg\max_{\E} \sum_{\f \in {\cal F}(\o)} \Pr(\E, \f  \mid \o)
+\E^* = \arg\max_{\E} \sum_{\F \in {\cal F}(\o)} \Pr(\E, \F  \mid \o)
+\label{eq:slt-max}
+%\end{eqnarray}
+\end{equation}
+where the source language sentence $\F$ is an hidden variable representing any speech transcription
+hypothesis.   According to the well established log-linear framework, 
+ the  conditional distribution $\Pr(\E,\F  \mid \o)$ can be  determined through suitable
+real-valued  feature functions  $h_r(\E,\F,\o)$  and real-valued  parameters
+$\lambda_r$, $r=1\ldots R$, and takes the parametric form:
+\begin{equation} 
+p_{\tiny \bm{ \lambda}}(\E ,\F \mid \o) = \frac{1}{{\cal Z}(\o)} \exp\left\{ \sum_{r=1}^R \lambda_r h_r(\E,\F,\o)\right\}
+\label{eq:loglin} 
+\end{equation}
+where ${\cal Z}(\o)$ is a normalization term.
+
+\noindent
+The main advantage of the log-linear model defined in ({\ref{eq:loglin}) is
+the possibility to use any kind of features, regarded as important for
+the sake of translation. Currently,  better  performance  are  achieved by  defining
+features  in  terms of  {\em  phrases} $\te$  \cite{zens02,koehn03,federico05a} instead of single words, 
+and by searching the best translation $\teb^*$  among all  strings  of phrases 
+in  a  defined vocabulary of phrases.
+
+\noindent
+The kind of representation used for the set of hypotheses  ${\cal F}(\o)$  clearly 
+impacts on the implementation  of the search algorithm. Here, we assume to 
+have all hypotheses represented as a confusion network.
+
+\section{Confusion Networks}
+\label{sec:cn}
+\noindent
+A Confusion Network (CN) ${\cal  G}$ is a weighted directed graph with
+a start node, an end node, and word labels over its edges.  The CN has
+the peculiarity  that each path  from the start  node to the  end node
+goes    through    all    the    other    nodes.     As    shown    in
+Figure~\ref{fig:CN-matrix}, a CN can be represented as a matrix of words
+whose columns  have different depths.   Each word $w_{j,k}$  in ${\cal
+G}$  is identified  by its  column  $j$ and  its position  $k$ in  the
+column;  word   $w_{j,k}$  is  associated  to   the  weight  $p_{j,k}$
+corresponding to the posterior probability $\Pr(f=w_{j,k} \mid \o, j)$
+of  having  $f=w_{j,k}$ at  position  $j$  given  $\o$.  A realization 
+$\F=f_1,  \ldots, f_m$ of ${\cal G}$  is associated with
+the probability $\Pr(\F\mid \o)$, which is factorized as follows:
+\begin{equation}
+\Pr (\F \mid \o)= \prod_{j=1}^m \Pr(f_j\mid \o, j)
+\label{eq:CN-posterior}
+\end{equation}
+
+\begin{figure}
+\begin{center}
+\begin{tabular}{c|c|c|c|c}
+se$_{.97}$ &  presenta$_{.40}$  & $\epsilon_{.78}$& esas$_{.86}$& elecciones$_{.97}$\\
+he$_{.03}$ &  present\'o$_{.22}$  & a$_{.08}$           & $\epsilon_{.10}$& selecciones$_{.03}$\\
+                     &  presentan$_{.06}$ & e$_{.07}$            & esa$_{.04}$&     \\
+                       &  $\ldots$                    & en$_{.06}$          &                     &    \\
+                      &                                    & $\ldots$                &                      &\\
+\end{tabular}
+\end{center}
+\caption{Example of confusion network.}
+\label{fig:CN-matrix}
+\end{figure}
+
+
+\noindent
+The  generation of  a CN  from an  ASR word-graph  \cite{mangu00} can
+also produce special empty-words $\epsilon$ in some columns. These
+empty-words permit to generate source sentences of different length and 
+are treated differently from regular words only at the level of feature functions.
+
+\subsection{Generative translation process}
+\label{sec:CN-gen}
+\noindent
+The following process describes how to incrementally generate a translation 
+from a confusion network ${\cal G}$:
+\bigskip
+
+\noindent
+\begin{tt}
+While there are uncovered source columns\\
+\indent A span of uncovered columns is covered\\
+\indent A path inside the span is chosen\\
+\indent A translation of the path is appended to the target\\
+\end{tt}
+
+
+\subsection{CN-based log-linear model}
+\label{sec:cn-model}
+\noindent
+The log-linear model adopted for the CN decoder includes the following feature 
+functions:
+\begin{enumerate}
+\item A word-based $n$-gram target LM.
+\item A reordering model defined in terms of the distance  between the first column covered 
+by current span and the last column of the previous span. (In the current implementation, we
+did not distinguish between regular and empty words.)
+\item Four phrase-based lexicon models exploiting statistics at word- and phrase-level. These models
+remove any empty-word in the source side.
+\item Phrase and word penalty models, i.e.  counts of the number of phrases and words in the target string.
+\item The  CN posterior probability, see formula (\ref{eq:CN-posterior}).
+\end{enumerate}
+
+\noindent 
+Notice that the above features can grouped into two categories: those which are expansion-dependent because their computation requires some knowledge about the previous step (i,ii), and those which are not (iii,iv,v).
+
+\subsection{Decoding algorithm}
+\label{sec:decoding}
+\noindent
+According  to  the {\em  dynamic  programming}  paradigm, the  optimal
+solution can be computed through expansions and recombinations of  
+previously computed  partial  theories.  With respect to translating a
+single input hypothesis, translating from a CN requires, 
+in principle, exploring all possible input paths inside the graph. 
+A key insight is that, due to their linear structure, CN
+decoding is very similar to text decoding.  During the
+decoding, we have to look up the translation options of spans, i.e. some
+contiguous sequence of source positions.  The main
+difference between CN and text decoding is that in text
+decoding there is exactly one source phrase per span, whereas in
+confusion network decoding there can be multiple source phrases per
+span.  In fact, in a CN the number of source phrases
+per span is exponential in the span length, assuming its minimum depth is larger than one.
+
+\noindent
+The decoding algorithm can be made much more efficient by 
+pre-fetching translations for all the spans and by applying  early recombination. 
+
+\subsection{Early recombination}
+\label{sec:early-rec}
+\noindent
+At each expansion step  a span covering a given number of consecutive columns is generated. 
+Due to the presence of empty-words, different paths within the span can 
+generate the same source phrase, hence the same translations. The scores of such 
+paths only impacts on the CN posterior feature (v).  Additionally, it might happen that two 
+different source phrases of the same span have a common translation. In this case, not only the CN 
+posterior feature is different, but also the phrase translation features (iii).
+This  suggests that efficiency can be gained by pre-computing all possible alternative
+ translations for all possible spans, together with their expansion-independent scores, 
+ and to recombine these translations in advance.  
+  
+\subsection{Pre-fetching of translation options} 
+\label{sec:pre-fetching}
+\noindent
+Concerning the pre-fetching of translations from the phrase-table, an 
+efficient implementation can be achieved if we use a prefix
+tree representation for the source phrases in the phrase table and
+generate the translation options incrementally over the span length.
+So, when looking up a span $(j_1,j_2)$, we can exploit our knowledge
+about the span $(j_1,j_2-1)$.  Thus, we have to check only for the
+known prefixes of $(j_1,j_2-1)$ if there exists a successor prefix
+with a word in column $j_2$ of the CN.  If all the word
+sequences in the CN also occur in the phrase table,
+this approach still enumerates an exponential number of phrases.  So,
+the worst case complexity is still exponential in the span length.
+Nevertheless, this is unlikely to happen in practice.  In our
+experiments, we do not observe the exponential behavior.  What we
+observe is a constant overhead compared to text input.
+
+
+
+\section{$N$-best decoder}
+
+A widely used  reference  for spoken language translation  is 
+the so-called N-best  translation approach, which we used as a term
+of comparison in our experiments.
+
+\bigskip
+\label{sec:1best-smt}
+\noindent
+An alternative way to define the set ${\cal F}(\o)$ is to take the $N$
+most  probable hypotheses  computed by  the ASR  system,  i.e.  ${\cal
+F}(\o)=\{\F_1, \ldots,\F_N\}$. By  taking a maximum approximation over
+${\cal F}(\o)$, and assuming that $\Pr(\teb, \F \mid \o) = \Pr(\F\mid
+\o)   \Pr(\teb   \mid  \F)$, we get the search criterion:
+\begin{eqnarray}
+\label{eq:nbest-max-1}
+\teb^* &\approx& \arg\max_{n=1,..,N}    \Pr(\F_n \mid \o) \max_{\teb} \Pr(\teb \mid \F_n)
+\label{eq:nbest-max}
+\end{eqnarray}
+
+\noindent
+In the equation above we can isolate $N$ independent translation tasks 
+(rightmost  maximization),  and   the  recombination  of  their  results
+(leftmost maximization).  Hence, the search criterion  can be restated
+as:
+\begin{eqnarray}
+\teb_n^* &=& \arg\max_{\teb} \Pr(\teb \mid \F_n)    \qquad n=1,\ldots, N \\
+\teb^* &\approx& \arg\max_{n=1,..,N} \Pr(\F_n  \mid \o) \Pr(\teb_n^* \mid \F_n)
+\label{eq:nbest-max-alg}
+\end{eqnarray}
+
+\noindent
+In plain words: first the best translation $\teb_n^*$ of each
+transcription hypothesis $\F_n$ is searched; then, the best
+translation $\teb^*$ is selected among $\{\teb_1^*, \ldots,
+\teb_N^*\}$ according to its score weighted by the ASR posterior
+probability $\Pr(\F_n \mid \o)$.
+
+\noindent
+In the experimental chapters we will compare 
+performance of the CN decoder against the above N-best approach, which 
+besides the standard text translation model of Moses requires a second re-scoring pass. 
+The resulting log-linear model for the $N$-best decoder 
+is very similar  to the CN decoder. Specifically,  feature (v) is  replaced   %%% h_cn is never used before
+ with two features corresponding to the log-probability of the acoustic and language model scores provided by the  ASR system.
+
+
 
 \chapter{Open Source Toolkit}
 \section{Overall design}
@@ -853,8 +1141,101 @@ The {\tt Scorer} employed in the workshop computes BLEU score (version xxx).
 The time spent for each iteration of the outer and inner loops is basically proportional to the size of the input and the amount of translation hypotheses, respectively.
 
 \section{Efficient Language Model Handling}
-{\sc Marcello Federico}
+%{\sc Marcello Federico}
+
+In this section we review main concepts related to the handling of language models (LMs) 
+at run time, that is during the decoding phase. Concerning the estimation of language 
+models the reader can refer to the documentation of publicly available tools, such as 
+SRI LM Toolkit and the CMU-Cambridge SLM toolkit.  These toolkits, in general, provide many 
+methods for estimating n-gram probabilities and are capable of generating  a static 
+representation of an n-gram LM, a text or binary file, that can  be used within other programs.   
+
+The motivation of developing software for handling LMs at run time is that efficiency, both 
+in time and space, can be gained by exploiting peculiarities  of the way they are used by the 
+hosting program, namely the decoder.  
+
+The need for efficient LM handling comes from the  intrinsic data-sparseness of language corpora. 
+Empirically, for a given size of n-grams,   the set of
+observations increases  almost linearly with the size  of the training
+data.  Hence, the trend of using larger and larger training corpora compels for careful memory
+usage.
+
+In general, to efficiently  store observations and probabilities in a
+computer  memory the  following approaches  can be  tackled:  pruning  rare or unreliable observations, designing compact data-structures,  and applying data compression. While the observation pruning is typically embedded in the tools to estimate the LMs, we focused
+on the other aspects, which are more related to the way the LM is used by a specific program.
+
+In the following we discuss some of the features of our implementation: the representation of 
+n-gram LMs, the quantization of probabilities, and  the use of cache memories. 
+
+\subsection{LM representation}
+In our implementation N-gram are stored in a data structure which indeed privileges memory
+saving rather than access time.  In particular, single components of each n-grams are accessed 
+via binary search and stored with 3 bytes, probabilities and back-off weights are instead stored
+in 4 bytes, namely  floating point numbers. Improvements in memory savings are obtained by 
+quantizing both back-off weights and probabilities. 
+
+\begin{figure}
+\begin{center}
+\label{fig:LM-struct}
+\caption{Data structure for storing n-gram language models.}
+ \includegraphics[width=\columnwidth]{marcello-lmstruct}
+\end{center}
+ \end{figure}
+
 
+\subsection{Probability quantization}
+
+Quantization provides an effective way of reducing the number of bits
+needed to store floating point variables. The quantization process
+consists in partitioning the real space into a finite set of $k$ {\em
+quantization levels} and identifying a center $c_i$ for each level,
+$i=1,\ldots, k$. A function $q(x)$ maps any real-valued point $x$ onto
+its unique center $c_i$.  Cost of quantization is the approximation
+error between $x$ and $c_i$. 
+
+Previous  experiments (cite ) suggested us to apply the so-called
+binning method.  The binning method partitions data points into uniformly
+populated intervals or {\em bins}. The center of each bin corresponds
+to the mean value of all points falling into it.  If $N_i$ is the
+number of points of the $i$-th bin, and $x_i$ the smallest point in
+the $i$-th bin, a partition $[x_i,x_{i+1}]$ results such that $N_i$ is
+constant for each $i=0,\ldots, k-1$, where $x_k=1$ by default.
+The following map is thus defined:
+$$q(x)=c_i \mbox{ if } x_i<=x<x_{i+1}.$$
+
+In particular, our implementation uses the following {\em greedy} strategy: bins are
+build by uniformly partition all different points of the data set.  Finally, quantization is 
+applied separately at each n-gram level and for each kind of score, that is probabilities
+or back-off weights.  The level of quantization is set to 8 bits, that experimentally showed 
+to cause  no measurable loss in performance.
+
+Quantization can be applied on any LM represented with the ARPA format. Quantized LMs
+can be converted into a binary format that can be efficiently uploaded at decoding time.
+
+
+\subsection{Caching of probabilities}
+
+In order to overcome limitations of access time, caching is applied. The potential advantage
+of caching during MT decoding is made evident by the plot in Figure~\ref{fig:LM-calls}. 
+It shows all calls of 3-gram probabilities by the search algorithm during the decoding of
+the following German sentence:
+
+{\em ich bin kein christdemokrat und glaube daher nicht an wunder . doch ich m\"ochte dem europ\"aischen parlament , so wie es gegenw\"urtig beschaffen ist , f\"ur seinen grossen beitrag zu diesen arbeiten danken.}
+
+During decoding of the sentence, about 120,000 different 3-grams are called for a total 
+of about  1.7 million times.  The fact that a relatively small subset of 3-grams is frequently
+accessed for each sentence suggest to store all of them into a cache. Each time a new 
+sentence is decoded, the cache is reset and n-gram probabilities are added as soon as they are 
+needed.  Additional caches are also used to store LM states, and all partial  n-grams 
+searched in the data in order to limit the number of binary searches performed.
+
+\begin{figure}
+\begin{center}
+\label{fig:LM-calls}
+\caption{LM calls during decoding of a sentence by Moses.}
+ \includegraphics[width=\columnwidth]{marcello-lmcalls}
+\end{center}
+ \end{figure}
 \section{Lexicalized Reordering Models}
 {\sc Christine Moran}
author	mfederico <mfederico@1f5c12ca-751b-0410-a591-d2e778427230>	2006-10-25 18:29:41 +0400
committer	mfederico <mfederico@1f5c12ca-751b-0410-a591-d2e778427230>	2006-10-25 18:29:41 +0400
commit	3c9e76d1ded5c5a2a8ee4b93a026f0055f22c3ce (patch)
tree	0172e79addc33f23577b9fc130b591221476793e /report
parent	0c49f2ccbaaa4554e4a582f0d4d78bedd03a4156 (diff)