Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/moses-smt/mosesdecoder.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
path: root/report
diff options
context:
space:
mode:
authormfederico <mfederico@1f5c12ca-751b-0410-a591-d2e778427230>2006-10-25 18:29:41 +0400
committermfederico <mfederico@1f5c12ca-751b-0410-a591-d2e778427230>2006-10-25 18:29:41 +0400
commit3c9e76d1ded5c5a2a8ee4b93a026f0055f22c3ce (patch)
tree0172e79addc33f23577b9fc130b591221476793e /report
parent0c49f2ccbaaa4554e4a582f0d4d78bedd03a4156 (diff)
Added DRAFT of confusion network decoding
and efficient LM handling. git-svn-id: https://mosesdecoder.svn.sourceforge.net/svnroot/mosesdecoder/trunk@928 1f5c12ca-751b-0410-a591-d2e778427230
Diffstat (limited to 'report')
-rwxr-xr-xreport/report.tex385
1 files changed, 383 insertions, 2 deletions
diff --git a/report/report.tex b/report/report.tex
index ccbdcbaf3..8703bce55 100755
--- a/report/report.tex
+++ b/report/report.tex
@@ -342,7 +342,295 @@ the other parameters used in decoding.
\chapter{Confusion Network Decoding}
-{\sc Marcello Federico and Richard Zens: cut and paste from your journal paper?}
+%{\sc Marcello Federico and Richard Zens: cut and paste from your journal paper?}
+
+% Definitions.
+% --------------------
+\def\x{{\mathbf x}}
+\def\L{{\cal L}}
+\newcommand{\devfour}{{\tt dev4}}
+\newcommand{\eval}{{\tt eval}}
+\newcommand{\E}{{\bf e}}
+\newcommand{\F}{{\bf f}}
+\newcommand{\bm}[1]{{\mbox{\boldmath${#1}$}}}
+\newcommand{\tf}{\tilde{f}}
+\newcommand{\te}{\tilde{e}}
+\newcommand{\teb}{{\bf \te}}
+
+
+Machine translation input currently takes the form of simple sequences of words.
+However, there are increasing demands to integrate machine translation technology
+in larger information processing systems with upstream NLP/speech processing tools
+(such as named entity recognizers, speech recognizers, morphological analyzers, etc.).
+These upstream processes tend to generate multiple, erroneous hypotheses with
+varying confidence. Current MT systems are designed to process only one input
+hypothesis, making them vulnerable to errors in the input.
+
+\noindent
+Our focus in the workshop was on the speech translation case, where the input is
+generated by a speech recognizer. More specifically, our goal was to improve performance
+of spoken language translation by better integrating speech recognition and
+machine translation models. Translation from speech input is considered more difficult than
+translation from text for several reasons. Spoken language has many styles and
+genres, such as, formal read speech, unplanned speeches, interviews, spontaneous
+conversations; it produces less controlled language, presenting more relaxed syntax
+and spontaneous speech phenomena. Finally, translation of spoken language is prone
+to speech recognition errors, which can possibly corrupt the syntax and the meaning of
+the input.
+
+The impact of speech recognition errors on translation performance
+is explained by the plot in Figure~\ref{fig:wer-bleu}, which shows BLEU scores of
+translations together with the WER of the corresponding source sentences. There is
+also empirical evidence that better translations can be obtained from
+transcriptions of the speech recognizer which resulted indeed in lower scores. This
+suggests that improvements can be achieved by applying machine translation on
+a large set of transcription hypotheses generated by the speech recognizers and by
+combining scores of acoustic models, language models, and translation models.
+
+
+
+Recently, approaches have been
+proposed for improving translation quality through the processing of
+multiple input hypotheses. In particular, better translation performance
+have been reported by exploiting $N$-best lists \cite{zhang04,quan05}, word
+lattices \cite{matusov05,mathias06}, and confusion networks \cite{bertoldi05a}.
+
+\noindent
+In the workshop we concentrated on confusion network decoding as discussed in
+\cite{bertoldi05a}, and developed a simpler translation model and a more efficient
+implementation of the search algorithm. Remarkably, the confusion network decoder
+resulted in an extension of the standard text decoder.
+
+\section{Spoken language translation}
+\label{sec:slt}
+\noindent
+
+\begin{figure}[t]
+\begin{center}
+\label{fig:wer-bleu}
+\caption{Relationship between BLEU score of target strings and word-error-rate (WER) of source strings.
+Source strings are transcriptions of parliamentary speeches produced by a speech recognition system (To be checked) .}\includegraphics[width=10cm]{marcello-bleu-wer}
+\end{center}
+\end{figure}
+
+
+
+From a statistical perspective, SLT can be approached as follows.
+Given the vector $\o$ representing the acoustic observations of the
+input utterance, let ${\cal F(\o)}$ be a set of
+transcription hypotheses computed by a speech recognizers and
+represented as a word-graph. The best translation $\E^*$ is
+searched among all strings in the target language ${\cal E}$ through
+the following criterion:
+%\begin{eqnarray}
+\begin{equation}
+%\e^* &=& \arg\max_{\E} \sum_{\f \in {\cal F}(\o)} \Pr(\E, \f \mid \o)
+\E^* = \arg\max_{\E} \sum_{\F \in {\cal F}(\o)} \Pr(\E, \F \mid \o)
+\label{eq:slt-max}
+%\end{eqnarray}
+\end{equation}
+where the source language sentence $\F$ is an hidden variable representing any speech transcription
+hypothesis. According to the well established log-linear framework,
+ the conditional distribution $\Pr(\E,\F \mid \o)$ can be determined through suitable
+real-valued feature functions $h_r(\E,\F,\o)$ and real-valued parameters
+$\lambda_r$, $r=1\ldots R$, and takes the parametric form:
+\begin{equation}
+p_{\tiny \bm{ \lambda}}(\E ,\F \mid \o) = \frac{1}{{\cal Z}(\o)} \exp\left\{ \sum_{r=1}^R \lambda_r h_r(\E,\F,\o)\right\}
+\label{eq:loglin}
+\end{equation}
+where ${\cal Z}(\o)$ is a normalization term.
+
+\noindent
+The main advantage of the log-linear model defined in ({\ref{eq:loglin}) is
+the possibility to use any kind of features, regarded as important for
+the sake of translation. Currently, better performance are achieved by defining
+features in terms of {\em phrases} $\te$ \cite{zens02,koehn03,federico05a} instead of single words,
+and by searching the best translation $\teb^*$ among all strings of phrases
+in a defined vocabulary of phrases.
+
+\noindent
+The kind of representation used for the set of hypotheses ${\cal F}(\o)$ clearly
+impacts on the implementation of the search algorithm. Here, we assume to
+have all hypotheses represented as a confusion network.
+
+\section{Confusion Networks}
+\label{sec:cn}
+\noindent
+A Confusion Network (CN) ${\cal G}$ is a weighted directed graph with
+a start node, an end node, and word labels over its edges. The CN has
+the peculiarity that each path from the start node to the end node
+goes through all the other nodes. As shown in
+Figure~\ref{fig:CN-matrix}, a CN can be represented as a matrix of words
+whose columns have different depths. Each word $w_{j,k}$ in ${\cal
+G}$ is identified by its column $j$ and its position $k$ in the
+column; word $w_{j,k}$ is associated to the weight $p_{j,k}$
+corresponding to the posterior probability $\Pr(f=w_{j,k} \mid \o, j)$
+of having $f=w_{j,k}$ at position $j$ given $\o$. A realization
+$\F=f_1, \ldots, f_m$ of ${\cal G}$ is associated with
+the probability $\Pr(\F\mid \o)$, which is factorized as follows:
+\begin{equation}
+\Pr (\F \mid \o)= \prod_{j=1}^m \Pr(f_j\mid \o, j)
+\label{eq:CN-posterior}
+\end{equation}
+
+\begin{figure}
+\begin{center}
+\begin{tabular}{c|c|c|c|c}
+se$_{.97}$ & presenta$_{.40}$ & $\epsilon_{.78}$& esas$_{.86}$& elecciones$_{.97}$\\
+he$_{.03}$ & present\'o$_{.22}$ & a$_{.08}$ & $\epsilon_{.10}$& selecciones$_{.03}$\\
+ & presentan$_{.06}$ & e$_{.07}$ & esa$_{.04}$& \\
+ & $\ldots$ & en$_{.06}$ & & \\
+ & & $\ldots$ & &\\
+\end{tabular}
+\end{center}
+\caption{Example of confusion network.}
+\label{fig:CN-matrix}
+\end{figure}
+
+
+\noindent
+The generation of a CN from an ASR word-graph \cite{mangu00} can
+also produce special empty-words $\epsilon$ in some columns. These
+empty-words permit to generate source sentences of different length and
+are treated differently from regular words only at the level of feature functions.
+
+\subsection{Generative translation process}
+\label{sec:CN-gen}
+\noindent
+The following process describes how to incrementally generate a translation
+from a confusion network ${\cal G}$:
+\bigskip
+
+\noindent
+\begin{tt}
+While there are uncovered source columns\\
+\indent A span of uncovered columns is covered\\
+\indent A path inside the span is chosen\\
+\indent A translation of the path is appended to the target\\
+\end{tt}
+
+
+\subsection{CN-based log-linear model}
+\label{sec:cn-model}
+\noindent
+The log-linear model adopted for the CN decoder includes the following feature
+functions:
+\begin{enumerate}
+\item A word-based $n$-gram target LM.
+\item A reordering model defined in terms of the distance between the first column covered
+by current span and the last column of the previous span. (In the current implementation, we
+did not distinguish between regular and empty words.)
+\item Four phrase-based lexicon models exploiting statistics at word- and phrase-level. These models
+remove any empty-word in the source side.
+\item Phrase and word penalty models, i.e. counts of the number of phrases and words in the target string.
+\item The CN posterior probability, see formula (\ref{eq:CN-posterior}).
+\end{enumerate}
+
+\noindent
+Notice that the above features can grouped into two categories: those which are expansion-dependent because their computation requires some knowledge about the previous step (i,ii), and those which are not (iii,iv,v).
+
+\subsection{Decoding algorithm}
+\label{sec:decoding}
+\noindent
+According to the {\em dynamic programming} paradigm, the optimal
+solution can be computed through expansions and recombinations of
+previously computed partial theories. With respect to translating a
+single input hypothesis, translating from a CN requires,
+in principle, exploring all possible input paths inside the graph.
+A key insight is that, due to their linear structure, CN
+decoding is very similar to text decoding. During the
+decoding, we have to look up the translation options of spans, i.e. some
+contiguous sequence of source positions. The main
+difference between CN and text decoding is that in text
+decoding there is exactly one source phrase per span, whereas in
+confusion network decoding there can be multiple source phrases per
+span. In fact, in a CN the number of source phrases
+per span is exponential in the span length, assuming its minimum depth is larger than one.
+
+\noindent
+The decoding algorithm can be made much more efficient by
+pre-fetching translations for all the spans and by applying early recombination.
+
+\subsection{Early recombination}
+\label{sec:early-rec}
+\noindent
+At each expansion step a span covering a given number of consecutive columns is generated.
+Due to the presence of empty-words, different paths within the span can
+generate the same source phrase, hence the same translations. The scores of such
+paths only impacts on the CN posterior feature (v). Additionally, it might happen that two
+different source phrases of the same span have a common translation. In this case, not only the CN
+posterior feature is different, but also the phrase translation features (iii).
+This suggests that efficiency can be gained by pre-computing all possible alternative
+ translations for all possible spans, together with their expansion-independent scores,
+ and to recombine these translations in advance.
+
+\subsection{Pre-fetching of translation options}
+\label{sec:pre-fetching}
+\noindent
+Concerning the pre-fetching of translations from the phrase-table, an
+efficient implementation can be achieved if we use a prefix
+tree representation for the source phrases in the phrase table and
+generate the translation options incrementally over the span length.
+So, when looking up a span $(j_1,j_2)$, we can exploit our knowledge
+about the span $(j_1,j_2-1)$. Thus, we have to check only for the
+known prefixes of $(j_1,j_2-1)$ if there exists a successor prefix
+with a word in column $j_2$ of the CN. If all the word
+sequences in the CN also occur in the phrase table,
+this approach still enumerates an exponential number of phrases. So,
+the worst case complexity is still exponential in the span length.
+Nevertheless, this is unlikely to happen in practice. In our
+experiments, we do not observe the exponential behavior. What we
+observe is a constant overhead compared to text input.
+
+
+
+\section{$N$-best decoder}
+
+A widely used reference for spoken language translation is
+the so-called N-best translation approach, which we used as a term
+of comparison in our experiments.
+
+\bigskip
+\label{sec:1best-smt}
+\noindent
+An alternative way to define the set ${\cal F}(\o)$ is to take the $N$
+most probable hypotheses computed by the ASR system, i.e. ${\cal
+F}(\o)=\{\F_1, \ldots,\F_N\}$. By taking a maximum approximation over
+${\cal F}(\o)$, and assuming that $\Pr(\teb, \F \mid \o) = \Pr(\F\mid
+\o) \Pr(\teb \mid \F)$, we get the search criterion:
+\begin{eqnarray}
+\label{eq:nbest-max-1}
+\teb^* &\approx& \arg\max_{n=1,..,N} \Pr(\F_n \mid \o) \max_{\teb} \Pr(\teb \mid \F_n)
+\label{eq:nbest-max}
+\end{eqnarray}
+
+\noindent
+In the equation above we can isolate $N$ independent translation tasks
+(rightmost maximization), and the recombination of their results
+(leftmost maximization). Hence, the search criterion can be restated
+as:
+\begin{eqnarray}
+\teb_n^* &=& \arg\max_{\teb} \Pr(\teb \mid \F_n) \qquad n=1,\ldots, N \\
+\teb^* &\approx& \arg\max_{n=1,..,N} \Pr(\F_n \mid \o) \Pr(\teb_n^* \mid \F_n)
+\label{eq:nbest-max-alg}
+\end{eqnarray}
+
+\noindent
+In plain words: first the best translation $\teb_n^*$ of each
+transcription hypothesis $\F_n$ is searched; then, the best
+translation $\teb^*$ is selected among $\{\teb_1^*, \ldots,
+\teb_N^*\}$ according to its score weighted by the ASR posterior
+probability $\Pr(\F_n \mid \o)$.
+
+\noindent
+In the experimental chapters we will compare
+performance of the CN decoder against the above N-best approach, which
+besides the standard text translation model of Moses requires a second re-scoring pass.
+The resulting log-linear model for the $N$-best decoder
+is very similar to the CN decoder. Specifically, feature (v) is replaced %%% h_cn is never used before
+ with two features corresponding to the log-probability of the acoustic and language model scores provided by the ASR system.
+
+
\chapter{Open Source Toolkit}
\section{Overall design}
@@ -853,8 +1141,101 @@ The {\tt Scorer} employed in the workshop computes BLEU score (version xxx).
The time spent for each iteration of the outer and inner loops is basically proportional to the size of the input and the amount of translation hypotheses, respectively.
\section{Efficient Language Model Handling}
-{\sc Marcello Federico}
+%{\sc Marcello Federico}
+
+In this section we review main concepts related to the handling of language models (LMs)
+at run time, that is during the decoding phase. Concerning the estimation of language
+models the reader can refer to the documentation of publicly available tools, such as
+SRI LM Toolkit and the CMU-Cambridge SLM toolkit. These toolkits, in general, provide many
+methods for estimating n-gram probabilities and are capable of generating a static
+representation of an n-gram LM, a text or binary file, that can be used within other programs.
+
+The motivation of developing software for handling LMs at run time is that efficiency, both
+in time and space, can be gained by exploiting peculiarities of the way they are used by the
+hosting program, namely the decoder.
+
+The need for efficient LM handling comes from the intrinsic data-sparseness of language corpora.
+Empirically, for a given size of n-grams, the set of
+observations increases almost linearly with the size of the training
+data. Hence, the trend of using larger and larger training corpora compels for careful memory
+usage.
+
+In general, to efficiently store observations and probabilities in a
+computer memory the following approaches can be tackled: pruning rare or unreliable observations, designing compact data-structures, and applying data compression. While the observation pruning is typically embedded in the tools to estimate the LMs, we focused
+on the other aspects, which are more related to the way the LM is used by a specific program.
+
+In the following we discuss some of the features of our implementation: the representation of
+n-gram LMs, the quantization of probabilities, and the use of cache memories.
+
+\subsection{LM representation}
+In our implementation N-gram are stored in a data structure which indeed privileges memory
+saving rather than access time. In particular, single components of each n-grams are accessed
+via binary search and stored with 3 bytes, probabilities and back-off weights are instead stored
+in 4 bytes, namely floating point numbers. Improvements in memory savings are obtained by
+quantizing both back-off weights and probabilities.
+
+\begin{figure}
+\begin{center}
+\label{fig:LM-struct}
+\caption{Data structure for storing n-gram language models.}
+ \includegraphics[width=\columnwidth]{marcello-lmstruct}
+\end{center}
+ \end{figure}
+
+\subsection{Probability quantization}
+
+Quantization provides an effective way of reducing the number of bits
+needed to store floating point variables. The quantization process
+consists in partitioning the real space into a finite set of $k$ {\em
+quantization levels} and identifying a center $c_i$ for each level,
+$i=1,\ldots, k$. A function $q(x)$ maps any real-valued point $x$ onto
+its unique center $c_i$. Cost of quantization is the approximation
+error between $x$ and $c_i$.
+
+Previous experiments (cite ) suggested us to apply the so-called
+binning method. The binning method partitions data points into uniformly
+populated intervals or {\em bins}. The center of each bin corresponds
+to the mean value of all points falling into it. If $N_i$ is the
+number of points of the $i$-th bin, and $x_i$ the smallest point in
+the $i$-th bin, a partition $[x_i,x_{i+1}]$ results such that $N_i$ is
+constant for each $i=0,\ldots, k-1$, where $x_k=1$ by default.
+The following map is thus defined:
+$$q(x)=c_i \mbox{ if } x_i<=x<x_{i+1}.$$
+
+In particular, our implementation uses the following {\em greedy} strategy: bins are
+build by uniformly partition all different points of the data set. Finally, quantization is
+applied separately at each n-gram level and for each kind of score, that is probabilities
+or back-off weights. The level of quantization is set to 8 bits, that experimentally showed
+to cause no measurable loss in performance.
+
+Quantization can be applied on any LM represented with the ARPA format. Quantized LMs
+can be converted into a binary format that can be efficiently uploaded at decoding time.
+
+
+\subsection{Caching of probabilities}
+
+In order to overcome limitations of access time, caching is applied. The potential advantage
+of caching during MT decoding is made evident by the plot in Figure~\ref{fig:LM-calls}.
+It shows all calls of 3-gram probabilities by the search algorithm during the decoding of
+the following German sentence:
+
+{\em ich bin kein christdemokrat und glaube daher nicht an wunder . doch ich m\"ochte dem europ\"aischen parlament , so wie es gegenw\"urtig beschaffen ist , f\"ur seinen grossen beitrag zu diesen arbeiten danken.}
+
+During decoding of the sentence, about 120,000 different 3-grams are called for a total
+of about 1.7 million times. The fact that a relatively small subset of 3-grams is frequently
+accessed for each sentence suggest to store all of them into a cache. Each time a new
+sentence is decoded, the cache is reset and n-gram probabilities are added as soon as they are
+needed. Additional caches are also used to store LM states, and all partial n-grams
+searched in the data in order to limit the number of binary searches performed.
+
+\begin{figure}
+\begin{center}
+\label{fig:LM-calls}
+\caption{LM calls during decoding of a sentence by Moses.}
+ \includegraphics[width=\columnwidth]{marcello-lmcalls}
+\end{center}
+ \end{figure}
\section{Lexicalized Reordering Models}
{\sc Christine Moran}