========================================================================
GIZA++ is an extension of the program GIZA.
It is a program for learning statistical translation models from
bitext. It is an implementation of the models described in
(Brown et al., 1993), (Vogel et al., 1996), (Och et al., 2000a), 
(Och et al., 2000b).
========================================================================


CONTENTS of this README file:

Part I: GIZA Package Contents
Part II: How To Compile GIZA
Part III: How to Run GIZA
Part IV: Input File Formats
     A. VOCABULARY FILES
     B. Bitext Files
     C. Dictionary File (optional)
Part V: Output File Formats:
     A. PROBABILITY TABLES
	1. T TABLE (translation table)
	2. N TABLE (Fertility table)
	3. P0 TABLE
	4. A TABLE
	5. D3 TABLE
	6. D4 TABLE
	7. D5 TABLE
	8. HMM TABLE
     B. ALIGNMENT FILE
     C. Cross Entropy and Perplexity Files
     D. Revised Vocabulary files
Part VI: Literature
Part VII: New features

HISTORY of this README file:

GIZA++:
edited: 11 Jan. 2000, Franz Josef Och
GIZA:
edited: 16 Aug. 1999, Dan Melamed
edited: 13 Aug. 1999, Yaser Al-Onaizan
edited: 20 July 1999, Yaser Al-Onaizan
edited: 15 July 1999, Yaser Al-Onaizan
edited: 13 July 1999, Noah Smith
========================================================================

Part 0: What is GIZA++

GIZA++ is an extension of the program GIZA (part of the SMT toolkit
EGYPT - http://www.clsp.jhu.edu/ws99/projects/mt/toolkit/ ) which was
developed by the Statistical Machine Translation team during the
summer workshop in 1999 at the Center for Language and Speech
Processing at Johns-Hopkins University (CLSP/JHU). GIZA++ includes a
lot of additional features. The extensions of GIZA++ were designed and
written by Franz Josef Och.

Features of GIZA++ not in GIZA:

- Implements full IBM-4 alignment model with a dependency of word
classes as described in (Brown et al. 1993)

- Implements IBM-5: dependency on word classes, smoothing, ...

- Implements HMM alignment model: Baum-Welch training, Forward-Backward
algorithm, empty word, dependency on word classes, transfer to
fertility models, ...

- Implementation of a variant of the IBM-3 and IBM-4
(-deficientDistortionModel 1) models which allow the training of -p0

- Smoothing for fertility, distortion/alignment parameters

- Significant more efficient training of the fertility models 

- Correct implementation of pegging as described in (Brown et
al. 1993), implemented a series of heuristics in order to make pegging
sufficiently efficient

- Completely new parameter mechanism: allows to easily add additional
parameters

- Improved perplexity calculation for models IBM-1, IBM-2 and HMM (the
parameter of the Poisson-distribution of the sentence lengths is
computed automatically from the used training corpus)

========================================================================
Part I: GIZA++ Package Programs

GIZA++: GIZA++ itself

plain2snt.out: simple tool to transform plain text into GIZA text
format

snt2plain.out: simple tool to transform GIZA text format into plain
text

trainGIZA++.sh: Shell script to perform standard training given a
corpus in GIZA text format

========================================================================
Part II: How To Compile GIZA++

In order to compile GIZA++ you may need:
- recent version of the GNU compiler (2.95 or higher)
- recent version of assembler and linker which do not have restrictions
  with respect to the length of symbol names

There is a make file in the src directory that will take care of the
compilation. The most important targets are:

GIZA++: generates an optimized version 

GIZA++.dbg: generates the debug version 

depend: generates the "dependencies" file (make this whenever you add
source or header files to the package.

========================================================================
Part III: How To run GIZA++

It's simple:

GIZA++ [config-file] [options]

All options which expect a parameter could also be used in the
parameter file. For example the command line options

	GIZA++ -S S.vcb -T T.vcb -C ST.snt

corresponds to the config file:

	S: S.vcb
	T: T.vcb
	C: ST.snt

If you call GIZA++ without a parameter you get a list of all the
options. The option names form GIZA are normally still valid. The
default values of the parameters typically are optimized with respect
to the corpora I use and typically give good results. It is
nevertheless important that these parameters are always optimized for
every new task.

==========================================================================
Part IV: Input File Formats

A. VOCABULARY FILES

Each entry is stored on one line as follows:

 uniq_id1 string1 no_occurrences1
 uniq_id2 string2 no_occurrences2
 uniq_id3 string3 no_occurrences3
 ....

Here is a sample from an English vocabulary file:

627 abandon 10
628 abandoned 17
629 abandoning 2
630 abandonment 12
631 abatement 8
632 abbotsford 2

uniq_ids are sequential positive integer numbers.  0 is reserved for
the special token NULL.


B. Bitext Files

Each sentence pair is stored in three lines. The first line
is the number of times this sentence pair occurred. The second line is
the source sentence where each token is replaced by its unique integer
id from the vocabulary file and the third is the target sentence in
the same format.

Here's a sample of 3 sentences from English/french corpus:

1
1 1 226 5008 621 6492 226 6377 6813 226 9505 5100 6824 226 5100 5222 0 614 10243 613
2769 155 7989 585 1 578 6503 585 8242 578 8142 8541 578 12328 6595 8550 578 6595 6710 1
1
1 1 226 6260 11856 11806 1293
11 1 1 11 155 14888 2649 11447 9457 8488 4168
1
1 1 226 7652 1 226 5337 226 6940 12089 5582 8076 12050
1 1 155 4140 6812 153 1 154 155 14668 15616 10524 9954 1392

C. Dictionary File

This is optional. The dictionary file is of the format:

target_word_id source_word_id

The list should be sorted by the target_word_id.

C. Dictionary Files

If you provide a dictionary and list it in the configuration file,
GIZA++ will change the cooccurrence counting in the first iteration
of model 1 to honor the so-called "Dictionary Constraint":

	In parallel sentences "e1 ... en" and "f1 ... fm",
	ei and fi are counted as a coocurrence pair if one of two
	conditions is met:  1.) ei and fi occur as an entry in the
	dictionary, or 2.) ei does not occur in the dictionary with
	any fj (1 <= j <= m) and fi does not occur in the dictionary
	with any ej (1 <= j <= n).

The dictionary must a list of pairs, one per line:

	F E

where F is an integer of a target token, and E is the integer of a
source token.  F may be listed with other Es, and vice versa.

Important:  The dictionary must be sorted by the F integers!

==========================================================================
Part V: Output File Formats:

For file names, we will use the prefix "prob_table".  This can be
changed using the -o switch.  The default is a combination of user id
and time stamp.


A. PROBABILITY TABLES

Normally, Model1 is trained first, and the result is used to start
Model2 training.  Then Model2 is transfered to Model3.  Model3 viterbi
training follows.  This sequence can be adjusted by the various
options, either on the command line or in a config file.

1. T TABLE ( *.t3.* )

(translation table)

 prob_table.t1.n = t table after n iterations of Model1 training
 prob_table.t2.n = t table after n iterations of Model2 training
 prob_table.t2to3 = t table after transfering Model2 to Model3
 prob_table.t3.n = t table after n iterations of Model3 training
 prob_table.4.n = t table after n iterations of Model4 training

Each line is of the following format:

s_id t_id P(t_id/s_id)

where: 
 s_id: is the unique id for the source token
 t_id: is the unique id for the target token
 P(t_id/s_id) the probability of translating s_id as t_id

sample part of a file:

3599 5697 0.0628115
2056 10686 0.000259988
8227 3738 3.57132e-13
5141 13720 5.52332e-12
10798 4102 6.53047e-06
8227 3750 6.97502e-14
7712 14080 6.0365e-20
7712 14082 2.68323e-17
7713 1083 3.94464e-15
7712 14084 2.98768e-15

Similar files will be generated (with the prefix
"prob_table.actual.xxx" that has the actual tokens instead of their
unique ids). This is also true for fertility tables. Also the inverse
probability table will be generated for the final table and it will
have the infix "ti" .

2. N TABLE ( *.n3.* )

(Fertility table)
 
 prob_table.n2to3 = n table estimated during the transfer from M2 to M3
 ptob_table.n3.X = n table after X iterations of model3

Each line in this file is of the following format:

source_token_id p0 p1 p2 .... pn

where p0 is the probability that the source token has zero fertility;
p1, fertility one, ...., and n is the maximum possible fertility as
defined in the program.

sample:

1 0.475861 0.282418 0.133455 0.0653083 0.0329326 0.00844979 0.0014008
10 0.249747 0.000107778 0.307767 0.192208 0.0641439 0.15016 0.0358886
11 0.397111 0.390421 0.19925 0.013382 2.21286e-05 0 0
12 0.0163432 0.560621 0.374745 0.00231588 0 0 0
13 1.78045e-07 0.545694 0.299573 0.132127 0.0230494 9.00322e-05 0
14 1.41918e-18 0.332721 0.300773 0.0334969 0 0 0
15 0 5.98626e-10 0.47729 0.0230955 0 0 0
17 0 1.66346e-07 0.895883 0.103948 0 0 0


3. P0 TABLE ( *.p0* )

(1 - P0 is the probability of inserting a null after a
   source word.)
 
This file contains only one line with one real number which is the
value of P0, the probability of not inserting a NULL token.


4. A TABLE ( *.a[23].* )

The file names follow the naming conventions above. The format of each
line is as follows:

i j l m p(i | j, l, m)

where i, j, l, m are all integers and
 j = position in target sentence
 i = position in source sentence
 l = length of source sentence
 m = length of target sentence
and p(i/j,l,m) is the probability that a source word in position i is
moved to position j in a pair of sentences of length l and m.

sample:

15 14 15 14 0.630798
15 14 15 15 0.414137
15 14 15 16 0.268919
15 14 15 17 0.23171
15 14 15 18 0.117311
15 14 15 19 0.119202
15 14 15 20 0.111369
15 14 15 21 0.0358169


5. D3 TABLE ( *.d3.* )

distortion table

The format is similar to the A table with a slight difference --- the
position of i & j are switched:

j i l m p(j/i,l,m)

sample:

15 14 14 15 0.286397
15 14 14 16 0.138898
15 14 14 17 0.109712
15 14 14 18 0.0868322
15 14 14 19 0.0535823

6. D4 TABLE: (( *.d4.* )

distortion table for IBM-4

7. D5 TABLE: ( *.d5.* )

distortion table for IBM-5

8. HMM TABLE: ( *.hhmm.* )

alignment probability table for HMM alignment model

B. ALIGNMENT FILE ( *.A3.* )

In each iteration of the training, and for each sentence pair in the
training set, the best alignment (viterbi alignment) is written to the
alignment file (if the dump parameters are set accordingly). The
alignment file is named prob_table.An.i, where n is the model number
({1,2, 2to3, 3 or 4}), and i is the iteration number. The format of
the alignments file is illustrated in the following sample:

# Sentence pair (1)
il s' agit de la même société qui a changé de propriétaires
NULL ({ }) UNK ({ }) UNK ({ }) ( ({ }) this ({ 4 11 }) is ({ }) the ({ }) same ({ 6 }) agency ({ }) which ({ 8 }) has ({ }) undergone ({ 1 2 3 7 9 10 12 }) a ({ }) change ({ 5 }) of ({ }) UNK ({ })
# Sentence pair (2)
UNK UNK , le propriétaire , dit que cela s' est produit si rapidement qu' il n' en connaît pas la cause exacte
NULL ({ 4 }) UNK ({ 1 2 }) UNK ({ }) , ({ 3 }) the ({ }) owner ({ 5 22 23 }) , ({ 6 }) says ({ 7 8 }) it ({ }) happened ({ 10 11 12 }) so ({ 13 }) fast ({ 14 19 }) he ({ 16 }) is ({ }) not ({ 20 }) sure ({ 15 17 }) what ({ }) went ({ 18 21 }) wrong ({ 9 })

The alignment file is represented by three lines for each sentence
pair. The first line is a label that can be used, e.g., as a caption
for alignment visualization tools.  It contains information about the
sentence sequential number in the training corpus, sentence lengths,
and alignment probability. The second line is the target sentence, the
third line is the source sentence. Each token in the source sentence
is followed by a set of zero or more numbers. These numbers represent
the positions of the target words to which this source word is
connected, according to the alignment.


C. Perplexity File ( *.perp )

This file will be generated at the end of training. It summarizes
perplexity values for each training iteration.  Here is a sample
perplexity file that illustrates the format. The format is the same
for cross entropy. If no test corpus was provided, the values for it
will be set to "N/A".

# train-size test-size iter. model train-perplexity test-perplexity final(y/n) train-viterbi-perp test-viterbi-perp
	447136 9625 0 1 187067 186722 n 3.34328e+06 3.35352e+06
	447136 9625 1 1 192.88 248.763 n 909.879 1203.13
	447136 9625 2 1 99.45 139.214 n 316.363 459.745
	447136 9625 3 1 83.4746 126.046 n 214.612 341.27
	447136 9625 4 1 78.6939 124.914 n 179.218 303.169
	447136 9625 5 2 76.6848 125.986 n 161.874 286.226
	447136 9625 6 2 50.7452 86.2273 n 84.7227 151.701
	447136 9625 7 2 42.9178 74.5574 n 63.6644 116.034
	447136 9625 8 2 40.0651 70.7444 n 56.3186 104.274
	447136 9625 9 2 38.8471 69.4105 n 53.1277 99.6044
	447136 9625 10 2to3 38.2561 68.9576 n 51.4856 97.4414
	447136 9625 11 3 129.993 248.885 n 86.6675 165.012
	447136 9625 12 3 79.2212 169.902 n 86.4842 171.367
	447136 9625 13 3 75.0746 164.488 n 84.9647 172.639
	447136 9625 14 3 73.412 162.765 n 83.5762 172.797
	447136 9625 15 3 72.6107 162.254 y 82.4575 172.688


D. Revised Vocabulary files (*.src.vcb, *.trg.vcb)

The revised vocabulary files are similar in format to the original
vocabulary files. The only exceptions is that the frequency for each
token is calculated from the given corpus (i.e. it is exact), which is
not required in the input.

E. final parameter file: ( *.gizacfg )

This file includes all the parameter settings that were used in order
to perform this training. This means that starting GIZA using this
parameter file produces (should produce) the same training.


Part VI: LITERATURE
-------------------

The following two articles include a comparison of the alignment
models implemented in GIZA++:

@INPROCEEDINGS{och00:isa,
	AUTHOR = {F.~J.~Och and H.~Ney},
	TITLE ={Improved Statistical Alignment Models},
	BOOKTITLE = ACL00 ,
	PAGES ={440--447},
	ADDRESS={ Hongkong, China},
	MONTH = {October},
	YEAR = 2000}

@INPROCEEDINGS{och00:aco,
	AUTHOR =  {F.~J.~Och and H.~Ney},
	TITLE = {A Comparison of Alignment Models for Statistical Machine Translation},
	BOOKTITLE = COLING00,
	ADDRESS	= {Saarbr\"ucken, Germany},
	YEAR = {2000},
	MONTH = {August},
        PAGES = {1086--1090}
	}

The following article describes the statistical machine translation
toolkit EGYPT:

@MISC{ alonaizan99:smt,
AUTHOR = {Y. Al-Onaizan and J. Curin and M. Jahr and K. Knight and J. Lafferty and I. D. Melamed and F. J. Och and D. Purdy and N. A. Smith and D. Yarowsky},
TITLE = {Statistical Machine Translation, Final Report, {JHU} Workshop},
YEAR = {1999},
ADDRESS = {Baltimore, Maryland, MD},
NOTE={{\tt http://www.clsp.jhu.edu/ws99/projects/ mt/final\_report/mt-final-report.ps}}
}


The implemented alignment models IBM-1 to IBM-5 and HMM were originally described in:

@ARTICLE{brown93:tmo,
	AUTHOR	= {Brown, P. F. and Della Pietra, S. A. and Della Pietra, V. J. and Mercer, R. L.},
	TITLE	= {The Mathematics of Statistical Machine Translation: Parameter Estimation},
	JOURNAL	= {Computational Linguistics},
	YEAR	= 1993,
	VOLUME	= 19,
	NUMBER	= 2,
	PAGES	= {263--311}
}

@INPROCEEDINGS{ vogel96:hbw,
	AUTHOR	= {Vogel, S. and Ney, H. and Tillmann, C.},
	TITLE	= {{HMM}-Based Word Alignment in Statistical Translation},
	YEAR	= 1996,
	PAGES	= {836--841},
	MONTH	= {August},
	ADDRESS	= {Copenhagen},
	BOOKTITLE	= COLING96
}


Part VII: New features
======================

2003-06-09: 

- new parameter "-nbestalignments N": prints an N-best list of
  alignments into a file *.NBEST

- If program is compiled with "-DBINARY_SEARCH_FOR_TTABLE", it uses
  more memory-efficient data structures for the t table (vector with
  binary search instead of hash table). Then, the program expects a
  parameter "-CoocurrenceFile FILE" which specifies a file which
  includes all lexical coccurrences in the training corpus. This file
  can be produced by the snt2cooc.out tool.