Welcome to mirror list, hosted at ThFree Co, Russian Federation.

github.com/moses-smt/mosesdecoder.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorRico Sennrich <rico.sennrich@gmx.ch>2012-01-31 13:50:20 +0400
committerRico Sennrich <rico.sennrich@gmx.ch>2012-01-31 13:50:20 +0400
commit8ee3ed6d64a564d846b92f05103dcd4eda37c21d (patch)
tree8c3b4bee16ce3999845a9064ff8eeb6222a1727f /contrib
parent037af96a6e3eef8fcdae15adc469a6e6e5047977 (diff)
tmcombine (translation model combination)
Diffstat (limited to 'contrib')
-rw-r--r--contrib/tmcombine/README.md90
-rw-r--r--contrib/tmcombine/argparse.py2382
-rw-r--r--contrib/tmcombine/test/extract22
-rw-r--r--contrib/tmcombine/test/model1/model/lex.counts.e2f8
-rw-r--r--contrib/tmcombine/test/model1/model/lex.counts.f2e8
-rw-r--r--contrib/tmcombine/test/model1/model/lex.e2f8
-rw-r--r--contrib/tmcombine/test/model1/model/lex.f2e8
-rw-r--r--contrib/tmcombine/test/model1/model/phrase-table8
-rw-r--r--contrib/tmcombine/test/model2/model/lex.counts.e2f8
-rw-r--r--contrib/tmcombine/test/model2/model/lex.counts.f2e8
-rw-r--r--contrib/tmcombine/test/model2/model/lex.e2f8
-rw-r--r--contrib/tmcombine/test/model2/model/lex.f2e8
-rw-r--r--contrib/tmcombine/test/model2/model/phrase-table5
-rw-r--r--contrib/tmcombine/test/phrase-table_test18
-rw-r--r--contrib/tmcombine/test/phrase-table_test29
-rw-r--r--contrib/tmcombine/test/phrase-table_test39
-rw-r--r--contrib/tmcombine/test/phrase-table_test48
-rw-r--r--contrib/tmcombine/test/phrase-table_test59
-rw-r--r--contrib/tmcombine/test/phrase-table_test64
-rw-r--r--contrib/tmcombine/test/phrase-table_test71
-rw-r--r--contrib/tmcombine/test/phrase-table_test89
-rwxr-xr-xcontrib/tmcombine/tmcombine.py1848
-rw-r--r--contrib/tmcombine/train_model.patch24
23 files changed, 4500 insertions, 0 deletions
diff --git a/contrib/tmcombine/README.md b/contrib/tmcombine/README.md
new file mode 100644
index 000000000..ee4d0cfcf
--- /dev/null
+++ b/contrib/tmcombine/README.md
@@ -0,0 +1,90 @@
+tmcombine - a tool for Moses translation model combination
+
+Author: Rico Sennrich <sennrich [AT] cl.uzh.ch>
+
+ABOUT
+-----
+
+This program handles the combination of Moses phrase tables, either through
+linear interpolation of the phrase translation probabilities/lexical weights,
+or through a recomputation based on the (weighted) combined counts.
+
+It also supports an automatic search for weights that minimize the cross-entropy
+between the model and a tuning set of word/phrase alignments.
+
+
+REQUIREMENTS
+------------
+
+The script requires Python >= 2.6.
+SciPy is recommended. If it is missing, an ad-hoc hill-climbing optimizer will be used (which may be slower, but is actually recommended for PyPy and/or a high number of models).
+On Debian-based systems, you can install SciPy from the repository:
+ sudo apt-get install python-scipy
+
+
+USAGE
+-----
+
+for usage information, run
+ ./tmcombine.py -h
+
+Two basic command line examples:
+
+linearly interpolate two translation models with fixed weights:
+ ./tmcombine.py combine_given_weights test/model1 test/model2 -w "0.1,0.9;0.1,1;0.2,0.8;0.5,0.5" -o test/phrase-table_test2
+
+do a count-based combination of two translation models with weights that minimize perplexity on a set of reference phrase pairs.
+ ./tmcombine.py combine_given_tuning_set test/model1 test/model2 -o test/phrase-table_test5 -m counts -r test/extract
+
+Typically, you have to specify one action out of the following:
+
+ - `combine_given_weights`: write a new phrase table with defined weights
+
+ - `combine_given_tuning_set`: write a new phrase table, using the weights that minimize cross-entropy on a tuning set
+
+ - `compare_cross_entropies`: print cross-entropies for each model/feature, using the intersection of phrase pairs.
+
+ - `compute_cross_entropy`: return cross-entropy for a tuning set, a set of models and a set of weights.
+
+ - `return_best_cross_entropy`: return the set of weights and cross-entropy that is optimal for a tuning set and a set of models.
+
+You can check the docstrings of `Combine_TMs()` for more information and find some example commands in the function `test()`.
+Some configuration options (i.e. normalization of linear interpolation) are not accessible from the command line.
+You can gain a bit more flexibility by writing/modifying python code that initializes `Combine_TMs()` with your desired arguments, or by just fiddling with the default values in the script.
+
+Regression tests (check if the output files (`test/phrase-table_testN`) differ from the files in the repositorys):
+ ./tmcombine.py test
+
+FURTHER NOTES
+-------------
+
+ - Different combination algorithms require different statistics. To be on the safe side, apply `train_model.patch` to `train_model.perl` and use the option `-phrase-word-alignment` when training models.
+
+ - The script assumes that phrase tables are sorted (to allow incremental, more memory-friendly processing). Sort the tables with `LC_ALL=C`. Phrase tables produced by Moses are sorted correctly.
+
+ - Some configurations require additional statistics that are loaded in memory (lexical tables; complete list of target phrases).
+ If memory consumption is a problem, use the option --lowmem (slightly slower and writes temporary files to disk), or consider pruning your phrase table before combining (e.g. using Johnson et al. 2007).
+
+ - The script assumes that all files are encoded in UTF-8. If this is not the case, fix it or change the `handle_file()` function.
+
+ - The script can read/write gzipped files, but the Python implementation is slow. You're better off unzipping the files on the command line and working with the unzipped files. The script will automatically search for the unzipped file first, and for the gzipped file if the former doesn't exist.
+
+ - The cross-entropy estimation assumes that phrase tables contain true probability distributions (i.e. a probability mass of 1 for each conditional probability distribution). If this is not true, the results may be skewed.
+
+ - Unknown phrase pairs are not considered for the cross-entropy estimation. A comparison of models with different vocabularies may be misleading.
+
+ - Don't directly compare cross-entropies obtained from a combination with different modes. Depending on how some corner cases are treated, linear interpolation does not distribute the full probability mass and thus shows higher (i.e. worse) cross-entropies.
+
+
+REFERENCES
+----------
+
+The algorithms are described in
+
+Sennrich, Rico (2012). Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation. In: Proceedings of EACL 2012.
+
+The evaluated algorithms are:
+
+ - linear interpolation (naive): default
+ - linear interpolation (modified): use options `--normalized` and `--recompute_lexweights`
+ - weighted counts: use option `-m counts`
diff --git a/contrib/tmcombine/argparse.py b/contrib/tmcombine/argparse.py
new file mode 100644
index 000000000..87d0cef35
--- /dev/null
+++ b/contrib/tmcombine/argparse.py
@@ -0,0 +1,2382 @@
+# Author: Steven J. Bethard <steven.bethard@gmail.com>.
+
+"""Command-line parsing library
+
+This module is an optparse-inspired command-line parsing library that:
+
+ - handles both optional and positional arguments
+ - produces highly informative usage messages
+ - supports parsers that dispatch to sub-parsers
+
+The following is a simple usage example that sums integers from the
+command-line and writes the result to a file::
+
+ parser = argparse.ArgumentParser(
+ description='sum the integers at the command line')
+ parser.add_argument(
+ 'integers', metavar='int', nargs='+', type=int,
+ help='an integer to be summed')
+ parser.add_argument(
+ '--log', default=sys.stdout, type=argparse.FileType('w'),
+ help='the file where the sum should be written')
+ args = parser.parse_args()
+ args.log.write('%s' % sum(args.integers))
+ args.log.close()
+
+The module contains the following public classes:
+
+ - ArgumentParser -- The main entry point for command-line parsing. As the
+ example above shows, the add_argument() method is used to populate
+ the parser with actions for optional and positional arguments. Then
+ the parse_args() method is invoked to convert the args at the
+ command-line into an object with attributes.
+
+ - ArgumentError -- The exception raised by ArgumentParser objects when
+ there are errors with the parser's actions. Errors raised while
+ parsing the command-line are caught by ArgumentParser and emitted
+ as command-line messages.
+
+ - FileType -- A factory for defining types of files to be created. As the
+ example above shows, instances of FileType are typically passed as
+ the type= argument of add_argument() calls.
+
+ - Action -- The base class for parser actions. Typically actions are
+ selected by passing strings like 'store_true' or 'append_const' to
+ the action= argument of add_argument(). However, for greater
+ customization of ArgumentParser actions, subclasses of Action may
+ be defined and passed as the action= argument.
+
+ - HelpFormatter, RawDescriptionHelpFormatter, RawTextHelpFormatter,
+ ArgumentDefaultsHelpFormatter -- Formatter classes which
+ may be passed as the formatter_class= argument to the
+ ArgumentParser constructor. HelpFormatter is the default,
+ RawDescriptionHelpFormatter and RawTextHelpFormatter tell the parser
+ not to change the formatting for help text, and
+ ArgumentDefaultsHelpFormatter adds information about argument defaults
+ to the help.
+
+All other classes in this module are considered implementation details.
+(Also note that HelpFormatter and RawDescriptionHelpFormatter are only
+considered public as object names -- the API of the formatter objects is
+still considered an implementation detail.)
+"""
+
+__version__ = '1.1'
+__all__ = [
+ 'ArgumentParser',
+ 'ArgumentError',
+ 'ArgumentTypeError',
+ 'FileType',
+ 'HelpFormatter',
+ 'ArgumentDefaultsHelpFormatter',
+ 'RawDescriptionHelpFormatter',
+ 'RawTextHelpFormatter',
+ 'MetavarTypeHelpFormatter',
+ 'Namespace',
+ 'Action',
+ 'ONE_OR_MORE',
+ 'OPTIONAL',
+ 'PARSER',
+ 'REMAINDER',
+ 'SUPPRESS',
+ 'ZERO_OR_MORE',
+]
+
+
+import collections as _collections
+import copy as _copy
+import os as _os
+import re as _re
+import sys as _sys
+import textwrap as _textwrap
+
+from gettext import gettext as _, ngettext
+
+
+SUPPRESS = '==SUPPRESS=='
+
+OPTIONAL = '?'
+ZERO_OR_MORE = '*'
+ONE_OR_MORE = '+'
+PARSER = 'A...'
+REMAINDER = '...'
+_UNRECOGNIZED_ARGS_ATTR = '_unrecognized_args'
+
+# =============================
+# Utility functions and classes
+# =============================
+
+class _AttributeHolder(object):
+ """Abstract base class that provides __repr__.
+
+ The __repr__ method returns a string in the format::
+ ClassName(attr=name, attr=name, ...)
+ The attributes are determined either by a class-level attribute,
+ '_kwarg_names', or by inspecting the instance __dict__.
+ """
+
+ def __repr__(self):
+ type_name = type(self).__name__
+ arg_strings = []
+ for arg in self._get_args():
+ arg_strings.append(repr(arg))
+ for name, value in self._get_kwargs():
+ arg_strings.append('%s=%r' % (name, value))
+ return '%s(%s)' % (type_name, ', '.join(arg_strings))
+
+ def _get_kwargs(self):
+ return sorted(self.__dict__.items())
+
+ def _get_args(self):
+ return []
+
+
+def _ensure_value(namespace, name, value):
+ if getattr(namespace, name, None) is None:
+ setattr(namespace, name, value)
+ return getattr(namespace, name)
+
+
+# ===============
+# Formatting Help
+# ===============
+
+class HelpFormatter(object):
+ """Formatter for generating usage messages and argument help strings.
+
+ Only the name of this class is considered a public API. All the methods
+ provided by the class are considered an implementation detail.
+ """
+
+ def __init__(self,
+ prog,
+ indent_increment=2,
+ max_help_position=24,
+ width=None):
+
+ # default setting for width
+ if width is None:
+ try:
+ width = int(_os.environ['COLUMNS'])
+ except (KeyError, ValueError):
+ width = 80
+ width -= 2
+
+ self._prog = prog
+ self._indent_increment = indent_increment
+ self._max_help_position = max_help_position
+ self._width = width
+
+ self._current_indent = 0
+ self._level = 0
+ self._action_max_length = 0
+
+ self._root_section = self._Section(self, None)
+ self._current_section = self._root_section
+
+ self._whitespace_matcher = _re.compile(r'\s+')
+ self._long_break_matcher = _re.compile(r'\n\n\n+')
+
+ # ===============================
+ # Section and indentation methods
+ # ===============================
+ def _indent(self):
+ self._current_indent += self._indent_increment
+ self._level += 1
+
+ def _dedent(self):
+ self._current_indent -= self._indent_increment
+ assert self._current_indent >= 0, 'Indent decreased below 0.'
+ self._level -= 1
+
+ class _Section(object):
+
+ def __init__(self, formatter, parent, heading=None):
+ self.formatter = formatter
+ self.parent = parent
+ self.heading = heading
+ self.items = []
+
+ def format_help(self):
+ # format the indented section
+ if self.parent is not None:
+ self.formatter._indent()
+ join = self.formatter._join_parts
+ for func, args in self.items:
+ func(*args)
+ item_help = join([func(*args) for func, args in self.items])
+ if self.parent is not None:
+ self.formatter._dedent()
+
+ # return nothing if the section was empty
+ if not item_help:
+ return ''
+
+ # add the heading if the section was non-empty
+ if self.heading is not SUPPRESS and self.heading is not None:
+ current_indent = self.formatter._current_indent
+ heading = '%*s%s:\n' % (current_indent, '', self.heading)
+ else:
+ heading = ''
+
+ # join the section-initial newline, the heading and the help
+ return join(['\n', heading, item_help, '\n'])
+
+ def _add_item(self, func, args):
+ self._current_section.items.append((func, args))
+
+ # ========================
+ # Message building methods
+ # ========================
+ def start_section(self, heading):
+ self._indent()
+ section = self._Section(self, self._current_section, heading)
+ self._add_item(section.format_help, [])
+ self._current_section = section
+
+ def end_section(self):
+ self._current_section = self._current_section.parent
+ self._dedent()
+
+ def add_text(self, text):
+ if text is not SUPPRESS and text is not None:
+ self._add_item(self._format_text, [text])
+
+ def add_usage(self, usage, actions, groups, prefix=None):
+ if usage is not SUPPRESS:
+ args = usage, actions, groups, prefix
+ self._add_item(self._format_usage, args)
+
+ def add_argument(self, action):
+ if action.help is not SUPPRESS:
+
+ # find all invocations
+ get_invocation = self._format_action_invocation
+ invocations = [get_invocation(action)]
+ for subaction in self._iter_indented_subactions(action):
+ invocations.append(get_invocation(subaction))
+
+ # update the maximum item length
+ invocation_length = max([len(s) for s in invocations])
+ action_length = invocation_length + self._current_indent
+ self._action_max_length = max(self._action_max_length,
+ action_length)
+
+ # add the item to the list
+ self._add_item(self._format_action, [action])
+
+ def add_arguments(self, actions):
+ for action in actions:
+ self.add_argument(action)
+
+ # =======================
+ # Help-formatting methods
+ # =======================
+ def format_help(self):
+ help = self._root_section.format_help()
+ if help:
+ help = self._long_break_matcher.sub('\n\n', help)
+ help = help.strip('\n') + '\n'
+ return help
+
+ def _join_parts(self, part_strings):
+ return ''.join([part
+ for part in part_strings
+ if part and part is not SUPPRESS])
+
+ def _format_usage(self, usage, actions, groups, prefix):
+ if prefix is None:
+ prefix = _('usage: ')
+
+ # if usage is specified, use that
+ if usage is not None:
+ usage = usage % dict(prog=self._prog)
+
+ # if no optionals or positionals are available, usage is just prog
+ elif usage is None and not actions:
+ usage = '%(prog)s' % dict(prog=self._prog)
+
+ # if optionals and positionals are available, calculate usage
+ elif usage is None:
+ prog = '%(prog)s' % dict(prog=self._prog)
+
+ # split optionals from positionals
+ optionals = []
+ positionals = []
+ for action in actions:
+ if action.option_strings:
+ optionals.append(action)
+ else:
+ positionals.append(action)
+
+ # build full usage string
+ format = self._format_actions_usage
+ action_usage = format(optionals + positionals, groups)
+ usage = ' '.join([s for s in [prog, action_usage] if s])
+
+ # wrap the usage parts if it's too long
+ text_width = self._width - self._current_indent
+ if len(prefix) + len(usage) > text_width:
+
+ # break usage into wrappable parts
+ part_regexp = r'\(.*?\)+|\[.*?\]+|\S+'
+ opt_usage = format(optionals, groups)
+ pos_usage = format(positionals, groups)
+ opt_parts = _re.findall(part_regexp, opt_usage)
+ pos_parts = _re.findall(part_regexp, pos_usage)
+ assert ' '.join(opt_parts) == opt_usage
+ assert ' '.join(pos_parts) == pos_usage
+
+ # helper for wrapping lines
+ def get_lines(parts, indent, prefix=None):
+ lines = []
+ line = []
+ if prefix is not None:
+ line_len = len(prefix) - 1
+ else:
+ line_len = len(indent) - 1
+ for part in parts:
+ if line_len + 1 + len(part) > text_width:
+ lines.append(indent + ' '.join(line))
+ line = []
+ line_len = len(indent) - 1
+ line.append(part)
+ line_len += len(part) + 1
+ if line:
+ lines.append(indent + ' '.join(line))
+ if prefix is not None:
+ lines[0] = lines[0][len(indent):]
+ return lines
+
+ # if prog is short, follow it with optionals or positionals
+ if len(prefix) + len(prog) <= 0.75 * text_width:
+ indent = ' ' * (len(prefix) + len(prog) + 1)
+ if opt_parts:
+ lines = get_lines([prog] + opt_parts, indent, prefix)
+ lines.extend(get_lines(pos_parts, indent))
+ elif pos_parts:
+ lines = get_lines([prog] + pos_parts, indent, prefix)
+ else:
+ lines = [prog]
+
+ # if prog is long, put it on its own line
+ else:
+ indent = ' ' * len(prefix)
+ parts = opt_parts + pos_parts
+ lines = get_lines(parts, indent)
+ if len(lines) > 1:
+ lines = []
+ lines.extend(get_lines(opt_parts, indent))
+ lines.extend(get_lines(pos_parts, indent))
+ lines = [prog] + lines
+
+ # join lines into usage
+ usage = '\n'.join(lines)
+
+ # prefix with 'usage:'
+ return '%s%s\n\n' % (prefix, usage)
+
+ def _format_actions_usage(self, actions, groups):
+ # find group indices and identify actions in groups
+ group_actions = set()
+ inserts = {}
+ for group in groups:
+ try:
+ start = actions.index(group._group_actions[0])
+ except ValueError:
+ continue
+ else:
+ end = start + len(group._group_actions)
+ if actions[start:end] == group._group_actions:
+ for action in group._group_actions:
+ group_actions.add(action)
+ if not group.required:
+ if start in inserts:
+ inserts[start] += ' ['
+ else:
+ inserts[start] = '['
+ inserts[end] = ']'
+ else:
+ if start in inserts:
+ inserts[start] += ' ('
+ else:
+ inserts[start] = '('
+ inserts[end] = ')'
+ for i in range(start + 1, end):
+ inserts[i] = '|'
+
+ # collect all actions format strings
+ parts = []
+ for i, action in enumerate(actions):
+
+ # suppressed arguments are marked with None
+ # remove | separators for suppressed arguments
+ if action.help is SUPPRESS:
+ parts.append(None)
+ if inserts.get(i) == '|':
+ inserts.pop(i)
+ elif inserts.get(i + 1) == '|':
+ inserts.pop(i + 1)
+
+ # produce all arg strings
+ elif not action.option_strings:
+ default = self._get_default_metavar_for_positional(action)
+ part = self._format_args(action, default)
+
+ # if it's in a group, strip the outer []
+ if action in group_actions:
+ if part[0] == '[' and part[-1] == ']':
+ part = part[1:-1]
+
+ # add the action string to the list
+ parts.append(part)
+
+ # produce the first way to invoke the option in brackets
+ else:
+ option_string = action.option_strings[0]
+
+ # if the Optional doesn't take a value, format is:
+ # -s or --long
+ if action.nargs == 0:
+ part = '%s' % option_string
+
+ # if the Optional takes a value, format is:
+ # -s ARGS or --long ARGS
+ else:
+ default = self._get_default_metavar_for_optional(action)
+ args_string = self._format_args(action, default)
+ part = '%s %s' % (option_string, args_string)
+
+ # make it look optional if it's not required or in a group
+ if not action.required and action not in group_actions:
+ part = '[%s]' % part
+
+ # add the action string to the list
+ parts.append(part)
+
+ # insert things at the necessary indices
+ for i in sorted(inserts, reverse=True):
+ parts[i:i] = [inserts[i]]
+
+ # join all the action items with spaces
+ text = ' '.join([item for item in parts if item is not None])
+
+ # clean up separators for mutually exclusive groups
+ open = r'[\[(]'
+ close = r'[\])]'
+ text = _re.sub(r'(%s) ' % open, r'\1', text)
+ text = _re.sub(r' (%s)' % close, r'\1', text)
+ text = _re.sub(r'%s *%s' % (open, close), r'', text)
+ text = _re.sub(r'\(([^|]*)\)', r'\1', text)
+ text = text.strip()
+
+ # return the text
+ return text
+
+ def _format_text(self, text):
+ if '%(prog)' in text:
+ text = text % dict(prog=self._prog)
+ text_width = self._width - self._current_indent
+ indent = ' ' * self._current_indent
+ return self._fill_text(text, text_width, indent) + '\n\n'
+
+ def _format_action(self, action):
+ # determine the required width and the entry label
+ help_position = min(self._action_max_length + 2,
+ self._max_help_position)
+ help_width = self._width - help_position
+ action_width = help_position - self._current_indent - 2
+ action_header = self._format_action_invocation(action)
+
+ # ho nelp; start on same line and add a final newline
+ if not action.help:
+ tup = self._current_indent, '', action_header
+ action_header = '%*s%s\n' % tup
+
+ # short action name; start on the same line and pad two spaces
+ elif len(action_header) <= action_width:
+ tup = self._current_indent, '', action_width, action_header
+ action_header = '%*s%-*s ' % tup
+ indent_first = 0
+
+ # long action name; start on the next line
+ else:
+ tup = self._current_indent, '', action_header
+ action_header = '%*s%s\n' % tup
+ indent_first = help_position
+
+ # collect the pieces of the action help
+ parts = [action_header]
+
+ # if there was help for the action, add lines of help text
+ if action.help:
+ help_text = self._expand_help(action)
+ help_lines = self._split_lines(help_text, help_width)
+ parts.append('%*s%s\n' % (indent_first, '', help_lines[0]))
+ for line in help_lines[1:]:
+ parts.append('%*s%s\n' % (help_position, '', line))
+
+ # or add a newline if the description doesn't end with one
+ elif not action_header.endswith('\n'):
+ parts.append('\n')
+
+ # if there are any sub-actions, add their help as well
+ for subaction in self._iter_indented_subactions(action):
+ parts.append(self._format_action(subaction))
+
+ # return a single string
+ return self._join_parts(parts)
+
+ def _format_action_invocation(self, action):
+ if not action.option_strings:
+ default = self._get_default_metavar_for_positional(action)
+ metavar, = self._metavar_formatter(action, default)(1)
+ return metavar
+
+ else:
+ parts = []
+
+ # if the Optional doesn't take a value, format is:
+ # -s, --long
+ if action.nargs == 0:
+ parts.extend(action.option_strings)
+
+ # if the Optional takes a value, format is:
+ # -s ARGS, --long ARGS
+ else:
+ default = self._get_default_metavar_for_optional(action)
+ args_string = self._format_args(action, default)
+ for option_string in action.option_strings:
+ parts.append('%s %s' % (option_string, args_string))
+
+ return ', '.join(parts)
+
+ def _metavar_formatter(self, action, default_metavar):
+ if action.metavar is not None:
+ result = action.metavar
+ elif action.choices is not None:
+ choice_strs = [str(choice) for choice in action.choices]
+ result = '{%s}' % ','.join(choice_strs)
+ else:
+ result = default_metavar
+
+ def format(tuple_size):
+ if isinstance(result, tuple):
+ return result
+ else:
+ return (result, ) * tuple_size
+ return format
+
+ def _format_args(self, action, default_metavar):
+ get_metavar = self._metavar_formatter(action, default_metavar)
+ if action.nargs is None:
+ result = '%s' % get_metavar(1)
+ elif action.nargs == OPTIONAL:
+ result = '[%s]' % get_metavar(1)
+ elif action.nargs == ZERO_OR_MORE:
+ result = '[%s [%s ...]]' % get_metavar(2)
+ elif action.nargs == ONE_OR_MORE:
+ result = '%s [%s ...]' % get_metavar(2)
+ elif action.nargs == REMAINDER:
+ result = '...'
+ elif action.nargs == PARSER:
+ result = '%s ...' % get_metavar(1)
+ else:
+ formats = ['%s' for _ in range(action.nargs)]
+ result = ' '.join(formats) % get_metavar(action.nargs)
+ return result
+
+ def _expand_help(self, action):
+ params = dict(vars(action), prog=self._prog)
+ for name in list(params):
+ if params[name] is SUPPRESS:
+ del params[name]
+ for name in list(params):
+ if hasattr(params[name], '__name__'):
+ params[name] = params[name].__name__
+ if params.get('choices') is not None:
+ choices_str = ', '.join([str(c) for c in params['choices']])
+ params['choices'] = choices_str
+ return self._get_help_string(action) % params
+
+ def _iter_indented_subactions(self, action):
+ try:
+ get_subactions = action._get_subactions
+ except AttributeError:
+ pass
+ else:
+ self._indent()
+ for subaction in get_subactions():
+ yield subaction
+ self._dedent()
+
+ def _split_lines(self, text, width):
+ text = self._whitespace_matcher.sub(' ', text).strip()
+ return _textwrap.wrap(text, width)
+
+ def _fill_text(self, text, width, indent):
+ text = self._whitespace_matcher.sub(' ', text).strip()
+ return _textwrap.fill(text, width, initial_indent=indent,
+ subsequent_indent=indent)
+
+ def _get_help_string(self, action):
+ return action.help
+
+ def _get_default_metavar_for_optional(self, action):
+ return action.dest.upper()
+
+ def _get_default_metavar_for_positional(self, action):
+ return action.dest
+
+
+class RawDescriptionHelpFormatter(HelpFormatter):
+ """Help message formatter which retains any formatting in descriptions.
+
+ Only the name of this class is considered a public API. All the methods
+ provided by the class are considered an implementation detail.
+ """
+
+ def _fill_text(self, text, width, indent):
+ return ''.join(indent + line for line in text.splitlines(keepends=True))
+
+
+class RawTextHelpFormatter(RawDescriptionHelpFormatter):
+ """Help message formatter which retains formatting of all help text.
+
+ Only the name of this class is considered a public API. All the methods
+ provided by the class are considered an implementation detail.
+ """
+
+ def _split_lines(self, text, width):
+ return text.splitlines()
+
+
+class ArgumentDefaultsHelpFormatter(HelpFormatter):
+ """Help message formatter which adds default values to argument help.
+
+ Only the name of this class is considered a public API. All the methods
+ provided by the class are considered an implementation detail.
+ """
+
+ def _get_help_string(self, action):
+ help = action.help
+ if '%(default)' not in action.help:
+ if action.default is not SUPPRESS:
+ defaulting_nargs = [OPTIONAL, ZERO_OR_MORE]
+ if action.option_strings or action.nargs in defaulting_nargs:
+ help += ' (default: %(default)s)'
+ return help
+
+
+class MetavarTypeHelpFormatter(HelpFormatter):
+ """Help message formatter which uses the argument 'type' as the default
+ metavar value (instead of the argument 'dest')
+
+ Only the name of this class is considered a public API. All the methods
+ provided by the class are considered an implementation detail.
+ """
+
+ def _get_default_metavar_for_optional(self, action):
+ return action.type.__name__
+
+ def _get_default_metavar_for_positional(self, action):
+ return action.type.__name__
+
+
+
+# =====================
+# Options and Arguments
+# =====================
+
+def _get_action_name(argument):
+ if argument is None:
+ return None
+ elif argument.option_strings:
+ return '/'.join(argument.option_strings)
+ elif argument.metavar not in (None, SUPPRESS):
+ return argument.metavar
+ elif argument.dest not in (None, SUPPRESS):
+ return argument.dest
+ else:
+ return None
+
+
+class ArgumentError(Exception):
+ """An error from creating or using an argument (optional or positional).
+
+ The string value of this exception is the message, augmented with
+ information about the argument that caused it.
+ """
+
+ def __init__(self, argument, message):
+ self.argument_name = _get_action_name(argument)
+ self.message = message
+
+ def __str__(self):
+ if self.argument_name is None:
+ format = '%(message)s'
+ else:
+ format = 'argument %(argument_name)s: %(message)s'
+ return format % dict(message=self.message,
+ argument_name=self.argument_name)
+
+
+class ArgumentTypeError(Exception):
+ """An error from trying to convert a command line string to a type."""
+ pass
+
+
+# ==============
+# Action classes
+# ==============
+
+class Action(_AttributeHolder):
+ """Information about how to convert command line strings to Python objects.
+
+ Action objects are used by an ArgumentParser to represent the information
+ needed to parse a single argument from one or more strings from the
+ command line. The keyword arguments to the Action constructor are also
+ all attributes of Action instances.
+
+ Keyword Arguments:
+
+ - option_strings -- A list of command-line option strings which
+ should be associated with this action.
+
+ - dest -- The name of the attribute to hold the created object(s)
+
+ - nargs -- The number of command-line arguments that should be
+ consumed. By default, one argument will be consumed and a single
+ value will be produced. Other values include:
+ - N (an integer) consumes N arguments (and produces a list)
+ - '?' consumes zero or one arguments
+ - '*' consumes zero or more arguments (and produces a list)
+ - '+' consumes one or more arguments (and produces a list)
+ Note that the difference between the default and nargs=1 is that
+ with the default, a single value will be produced, while with
+ nargs=1, a list containing a single value will be produced.
+
+ - const -- The value to be produced if the option is specified and the
+ option uses an action that takes no values.
+
+ - default -- The value to be produced if the option is not specified.
+
+ - type -- The type which the command-line arguments should be converted
+ to, should be one of 'string', 'int', 'float', 'complex' or a
+ callable object that accepts a single string argument. If None,
+ 'string' is assumed.
+
+ - choices -- A container of values that should be allowed. If not None,
+ after a command-line argument has been converted to the appropriate
+ type, an exception will be raised if it is not a member of this
+ collection.
+
+ - required -- True if the action must always be specified at the
+ command line. This is only meaningful for optional command-line
+ arguments.
+
+ - help -- The help string describing the argument.
+
+ - metavar -- The name to be used for the option's argument with the
+ help string. If None, the 'dest' value will be used as the name.
+ """
+
+ def __init__(self,
+ option_strings,
+ dest,
+ nargs=None,
+ const=None,
+ default=None,
+ type=None,
+ choices=None,
+ required=False,
+ help=None,
+ metavar=None):
+ self.option_strings = option_strings
+ self.dest = dest
+ self.nargs = nargs
+ self.const = const
+ self.default = default
+ self.type = type
+ self.choices = choices
+ self.required = required
+ self.help = help
+ self.metavar = metavar
+
+ def _get_kwargs(self):
+ names = [
+ 'option_strings',
+ 'dest',
+ 'nargs',
+ 'const',
+ 'default',
+ 'type',
+ 'choices',
+ 'help',
+ 'metavar',
+ ]
+ return [(name, getattr(self, name)) for name in names]
+
+ def __call__(self, parser, namespace, values, option_string=None):
+ raise NotImplementedError(_('.__call__() not defined'))
+
+
+class _StoreAction(Action):
+
+ def __init__(self,
+ option_strings,
+ dest,
+ nargs=None,
+ const=None,
+ default=None,
+ type=None,
+ choices=None,
+ required=False,
+ help=None,
+ metavar=None):
+ if nargs == 0:
+ raise ValueError('nargs for store actions must be > 0; if you '
+ 'have nothing to store, actions such as store '
+ 'true or store const may be more appropriate')
+ if const is not None and nargs != OPTIONAL:
+ raise ValueError('nargs must be %r to supply const' % OPTIONAL)
+ super(_StoreAction, self).__init__(
+ option_strings=option_strings,
+ dest=dest,
+ nargs=nargs,
+ const=const,
+ default=default,
+ type=type,
+ choices=choices,
+ required=required,
+ help=help,
+ metavar=metavar)
+
+ def __call__(self, parser, namespace, values, option_string=None):
+ setattr(namespace, self.dest, values)
+
+
+class _StoreConstAction(Action):
+
+ def __init__(self,
+ option_strings,
+ dest,
+ const,
+ default=None,
+ required=False,
+ help=None,
+ metavar=None):
+ super(_StoreConstAction, self).__init__(
+ option_strings=option_strings,
+ dest=dest,
+ nargs=0,
+ const=const,
+ default=default,
+ required=required,
+ help=help)
+
+ def __call__(self, parser, namespace, values, option_string=None):
+ setattr(namespace, self.dest, self.const)
+
+
+class _StoreTrueAction(_StoreConstAction):
+
+ def __init__(self,
+ option_strings,
+ dest,
+ default=False,
+ required=False,
+ help=None):
+ super(_StoreTrueAction, self).__init__(
+ option_strings=option_strings,
+ dest=dest,
+ const=True,
+ default=default,
+ required=required,
+ help=help)
+
+
+class _StoreFalseAction(_StoreConstAction):
+
+ def __init__(self,
+ option_strings,
+ dest,
+ default=True,
+ required=False,
+ help=None):
+ super(_StoreFalseAction, self).__init__(
+ option_strings=option_strings,
+ dest=dest,
+ const=False,
+ default=default,
+ required=required,
+ help=help)
+
+
+class _AppendAction(Action):
+
+ def __init__(self,
+ option_strings,
+ dest,
+ nargs=None,
+ const=None,
+ default=None,
+ type=None,
+ choices=None,
+ required=False,
+ help=None,
+ metavar=None):
+ if nargs == 0:
+ raise ValueError('nargs for append actions must be > 0; if arg '
+ 'strings are not supplying the value to append, '
+ 'the append const action may be more appropriate')
+ if const is not None and nargs != OPTIONAL:
+ raise ValueError('nargs must be %r to supply const' % OPTIONAL)
+ super(_AppendAction, self).__init__(
+ option_strings=option_strings,
+ dest=dest,
+ nargs=nargs,
+ const=const,
+ default=default,
+ type=type,
+ choices=choices,
+ required=required,
+ help=help,
+ metavar=metavar)
+
+ def __call__(self, parser, namespace, values, option_string=None):
+ items = _copy.copy(_ensure_value(namespace, self.dest, []))
+ items.append(values)
+ setattr(namespace, self.dest, items)
+
+
+class _AppendConstAction(Action):
+
+ def __init__(self,
+ option_strings,
+ dest,
+ const,
+ default=None,
+ required=False,
+ help=None,
+ metavar=None):
+ super(_AppendConstAction, self).__init__(
+ option_strings=option_strings,
+ dest=dest,
+ nargs=0,
+ const=const,
+ default=default,
+ required=required,
+ help=help,
+ metavar=metavar)
+
+ def __call__(self, parser, namespace, values, option_string=None):
+ items = _copy.copy(_ensure_value(namespace, self.dest, []))
+ items.append(self.const)
+ setattr(namespace, self.dest, items)
+
+
+class _CountAction(Action):
+
+ def __init__(self,
+ option_strings,
+ dest,
+ default=None,
+ required=False,
+ help=None):
+ super(_CountAction, self).__init__(
+ option_strings=option_strings,
+ dest=dest,
+ nargs=0,
+ default=default,
+ required=required,
+ help=help)
+
+ def __call__(self, parser, namespace, values, option_string=None):
+ new_count = _ensure_value(namespace, self.dest, 0) + 1
+ setattr(namespace, self.dest, new_count)
+
+
+class _HelpAction(Action):
+
+ def __init__(self,
+ option_strings,
+ dest=SUPPRESS,
+ default=SUPPRESS,
+ help=None):
+ super(_HelpAction, self).__init__(
+ option_strings=option_strings,
+ dest=dest,
+ default=default,
+ nargs=0,
+ help=help)
+
+ def __call__(self, parser, namespace, values, option_string=None):
+ parser.print_help()
+ parser.exit()
+
+
+class _VersionAction(Action):
+
+ def __init__(self,
+ option_strings,
+ version=None,
+ dest=SUPPRESS,
+ default=SUPPRESS,
+ help="show program's version number and exit"):
+ super(_VersionAction, self).__init__(
+ option_strings=option_strings,
+ dest=dest,
+ default=default,
+ nargs=0,
+ help=help)
+ self.version = version
+
+ def __call__(self, parser, namespace, values, option_string=None):
+ version = self.version
+ if version is None:
+ version = parser.version
+ formatter = parser._get_formatter()
+ formatter.add_text(version)
+ parser.exit(message=formatter.format_help())
+
+
+class _SubParsersAction(Action):
+
+ class _ChoicesPseudoAction(Action):
+
+ def __init__(self, name, aliases, help):
+ metavar = dest = name
+ if aliases:
+ metavar += ' (%s)' % ', '.join(aliases)
+ sup = super(_SubParsersAction._ChoicesPseudoAction, self)
+ sup.__init__(option_strings=[], dest=dest, help=help,
+ metavar=metavar)
+
+ def __init__(self,
+ option_strings,
+ prog,
+ parser_class,
+ dest=SUPPRESS,
+ help=None,
+ metavar=None):
+
+ self._prog_prefix = prog
+ self._parser_class = parser_class
+ self._name_parser_map = _collections.OrderedDict()
+ self._choices_actions = []
+
+ super(_SubParsersAction, self).__init__(
+ option_strings=option_strings,
+ dest=dest,
+ nargs=PARSER,
+ choices=self._name_parser_map,
+ help=help,
+ metavar=metavar)
+
+ def add_parser(self, name, **kwargs):
+ # set prog from the existing prefix
+ if kwargs.get('prog') is None:
+ kwargs['prog'] = '%s %s' % (self._prog_prefix, name)
+
+ aliases = kwargs.pop('aliases', ())
+
+ # create a pseudo-action to hold the choice help
+ if 'help' in kwargs:
+ help = kwargs.pop('help')
+ choice_action = self._ChoicesPseudoAction(name, aliases, help)
+ self._choices_actions.append(choice_action)
+
+ # create the parser and add it to the map
+ parser = self._parser_class(**kwargs)
+ self._name_parser_map[name] = parser
+
+ # make parser available under aliases also
+ for alias in aliases:
+ self._name_parser_map[alias] = parser
+
+ return parser
+
+ def _get_subactions(self):
+ return self._choices_actions
+
+ def __call__(self, parser, namespace, values, option_string=None):
+ parser_name = values[0]
+ arg_strings = values[1:]
+
+ # set the parser name if requested
+ if self.dest is not SUPPRESS:
+ setattr(namespace, self.dest, parser_name)
+
+ # select the parser
+ try:
+ parser = self._name_parser_map[parser_name]
+ except KeyError:
+ args = {'parser_name': parser_name,
+ 'choices': ', '.join(self._name_parser_map)}
+ msg = _('unknown parser %(parser_name)r (choices: %(choices)s)') % args
+ raise ArgumentError(self, msg)
+
+ # parse all the remaining options into the namespace
+ # store any unrecognized options on the object, so that the top
+ # level parser can decide what to do with them
+ namespace, arg_strings = parser.parse_known_args(arg_strings, namespace)
+ if arg_strings:
+ vars(namespace).setdefault(_UNRECOGNIZED_ARGS_ATTR, [])
+ getattr(namespace, _UNRECOGNIZED_ARGS_ATTR).extend(arg_strings)
+
+
+# ==============
+# Type classes
+# ==============
+
+class FileType(object):
+ """Factory for creating file object types
+
+ Instances of FileType are typically passed as type= arguments to the
+ ArgumentParser add_argument() method.
+
+ Keyword Arguments:
+ - mode -- A string indicating how the file is to be opened. Accepts the
+ same values as the builtin open() function.
+ - bufsize -- The file's desired buffer size. Accepts the same values as
+ the builtin open() function.
+ """
+
+ def __init__(self, mode='r', bufsize=-1):
+ self._mode = mode
+ self._bufsize = bufsize
+
+ def __call__(self, string):
+ # the special argument "-" means sys.std{in,out}
+ if string == '-':
+ if 'r' in self._mode:
+ return _sys.stdin
+ elif 'w' in self._mode:
+ return _sys.stdout
+ else:
+ msg = _('argument "-" with mode %r') % self._mode
+ raise ValueError(msg)
+
+ # all other arguments are used as file names
+ try:
+ return open(string, self._mode, self._bufsize)
+ except IOError as e:
+ message = _("can't open '%s': %s")
+ raise ArgumentTypeError(message % (string, e))
+
+ def __repr__(self):
+ args = self._mode, self._bufsize
+ args_str = ', '.join(repr(arg) for arg in args if arg != -1)
+ return '%s(%s)' % (type(self).__name__, args_str)
+
+# ===========================
+# Optional and Positional Parsing
+# ===========================
+
+class Namespace(_AttributeHolder):
+ """Simple object for storing attributes.
+
+ Implements equality by attribute names and values, and provides a simple
+ string representation.
+ """
+
+ def __init__(self, **kwargs):
+ for name in kwargs:
+ setattr(self, name, kwargs[name])
+
+ def __eq__(self, other):
+ return vars(self) == vars(other)
+
+ def __ne__(self, other):
+ return not (self == other)
+
+ def __contains__(self, key):
+ return key in self.__dict__
+
+
+class _ActionsContainer(object):
+
+ def __init__(self,
+ description,
+ prefix_chars,
+ argument_default,
+ conflict_handler):
+ super(_ActionsContainer, self).__init__()
+
+ self.description = description
+ self.argument_default = argument_default
+ self.prefix_chars = prefix_chars
+ self.conflict_handler = conflict_handler
+
+ # set up registries
+ self._registries = {}
+
+ # register actions
+ self.register('action', None, _StoreAction)
+ self.register('action', 'store', _StoreAction)
+ self.register('action', 'store_const', _StoreConstAction)
+ self.register('action', 'store_true', _StoreTrueAction)
+ self.register('action', 'store_false', _StoreFalseAction)
+ self.register('action', 'append', _AppendAction)
+ self.register('action', 'append_const', _AppendConstAction)
+ self.register('action', 'count', _CountAction)
+ self.register('action', 'help', _HelpAction)
+ self.register('action', 'version', _VersionAction)
+ self.register('action', 'parsers', _SubParsersAction)
+
+ # raise an exception if the conflict handler is invalid
+ self._get_handler()
+
+ # action storage
+ self._actions = []
+ self._option_string_actions = {}
+
+ # groups
+ self._action_groups = []
+ self._mutually_exclusive_groups = []
+
+ # defaults storage
+ self._defaults = {}
+
+ # determines whether an "option" looks like a negative number
+ self._negative_number_matcher = _re.compile(r'^-\d+$|^-\d*\.\d+$')
+
+ # whether or not there are any optionals that look like negative
+ # numbers -- uses a list so it can be shared and edited
+ self._has_negative_number_optionals = []
+
+ # ====================
+ # Registration methods
+ # ====================
+ def register(self, registry_name, value, object):
+ registry = self._registries.setdefault(registry_name, {})
+ registry[value] = object
+
+ def _registry_get(self, registry_name, value, default=None):
+ return self._registries[registry_name].get(value, default)
+
+ # ==================================
+ # Namespace default accessor methods
+ # ==================================
+ def set_defaults(self, **kwargs):
+ self._defaults.update(kwargs)
+
+ # if these defaults match any existing arguments, replace
+ # the previous default on the object with the new one
+ for action in self._actions:
+ if action.dest in kwargs:
+ action.default = kwargs[action.dest]
+
+ def get_default(self, dest):
+ for action in self._actions:
+ if action.dest == dest and action.default is not None:
+ return action.default
+ return self._defaults.get(dest, None)
+
+
+ # =======================
+ # Adding argument actions
+ # =======================
+ def add_argument(self, *args, **kwargs):
+ """
+ add_argument(dest, ..., name=value, ...)
+ add_argument(option_string, option_string, ..., name=value, ...)
+ """
+
+ # if no positional args are supplied or only one is supplied and
+ # it doesn't look like an option string, parse a positional
+ # argument
+ chars = self.prefix_chars
+ if not args or len(args) == 1 and args[0][0] not in chars:
+ if args and 'dest' in kwargs:
+ raise ValueError('dest supplied twice for positional argument')
+ kwargs = self._get_positional_kwargs(*args, **kwargs)
+
+ # otherwise, we're adding an optional argument
+ else:
+ kwargs = self._get_optional_kwargs(*args, **kwargs)
+
+ # if no default was supplied, use the parser-level default
+ if 'default' not in kwargs:
+ dest = kwargs['dest']
+ if dest in self._defaults:
+ kwargs['default'] = self._defaults[dest]
+ elif self.argument_default is not None:
+ kwargs['default'] = self.argument_default
+
+ # create the action object, and add it to the parser
+ action_class = self._pop_action_class(kwargs)
+ if not callable(action_class):
+ raise ValueError('unknown action "%s"' % (action_class,))
+ action = action_class(**kwargs)
+
+ # raise an error if the action type is not callable
+ type_func = self._registry_get('type', action.type, action.type)
+ if not callable(type_func):
+ raise ValueError('%r is not callable' % (type_func,))
+
+ # raise an error if the metavar does not match the type
+ if hasattr(self, "_get_formatter"):
+ try:
+ self._get_formatter()._format_args(action, None)
+ except TypeError:
+ raise ValueError("length of metavar tuple does not match nargs")
+
+ return self._add_action(action)
+
+ def add_argument_group(self, *args, **kwargs):
+ group = _ArgumentGroup(self, *args, **kwargs)
+ self._action_groups.append(group)
+ return group
+
+ def add_mutually_exclusive_group(self, **kwargs):
+ group = _MutuallyExclusiveGroup(self, **kwargs)
+ self._mutually_exclusive_groups.append(group)
+ return group
+
+ def _add_action(self, action):
+ # resolve any conflicts
+ self._check_conflict(action)
+
+ # add to actions list
+ self._actions.append(action)
+ action.container = self
+
+ # index the action by any option strings it has
+ for option_string in action.option_strings:
+ self._option_string_actions[option_string] = action
+
+ # set the flag if any option strings look like negative numbers
+ for option_string in action.option_strings:
+ if self._negative_number_matcher.match(option_string):
+ if not self._has_negative_number_optionals:
+ self._has_negative_number_optionals.append(True)
+
+ # return the created action
+ return action
+
+ def _remove_action(self, action):
+ self._actions.remove(action)
+
+ def _add_container_actions(self, container):
+ # collect groups by titles
+ title_group_map = {}
+ for group in self._action_groups:
+ if group.title in title_group_map:
+ msg = _('cannot merge actions - two groups are named %r')
+ raise ValueError(msg % (group.title))
+ title_group_map[group.title] = group
+
+ # map each action to its group
+ group_map = {}
+ for group in container._action_groups:
+
+ # if a group with the title exists, use that, otherwise
+ # create a new group matching the container's group
+ if group.title not in title_group_map:
+ title_group_map[group.title] = self.add_argument_group(
+ title=group.title,
+ description=group.description,
+ conflict_handler=group.conflict_handler)
+
+ # map the actions to their new group
+ for action in group._group_actions:
+ group_map[action] = title_group_map[group.title]
+
+ # add container's mutually exclusive groups
+ # NOTE: if add_mutually_exclusive_group ever gains title= and
+ # description= then this code will need to be expanded as above
+ for group in container._mutually_exclusive_groups:
+ mutex_group = self.add_mutually_exclusive_group(
+ required=group.required)
+
+ # map the actions to their new mutex group
+ for action in group._group_actions:
+ group_map[action] = mutex_group
+
+ # add all actions to this container or their group
+ for action in container._actions:
+ group_map.get(action, self)._add_action(action)
+
+ def _get_positional_kwargs(self, dest, **kwargs):
+ # make sure required is not specified
+ if 'required' in kwargs:
+ msg = _("'required' is an invalid argument for positionals")
+ raise TypeError(msg)
+
+ # mark positional arguments as required if at least one is
+ # always required
+ if kwargs.get('nargs') not in [OPTIONAL, ZERO_OR_MORE]:
+ kwargs['required'] = True
+ if kwargs.get('nargs') == ZERO_OR_MORE and 'default' not in kwargs:
+ kwargs['required'] = True
+
+ # return the keyword arguments with no option strings
+ return dict(kwargs, dest=dest, option_strings=[])
+
+ def _get_optional_kwargs(self, *args, **kwargs):
+ # determine short and long option strings
+ option_strings = []
+ long_option_strings = []
+ for option_string in args:
+ # error on strings that don't start with an appropriate prefix
+ if not option_string[0] in self.prefix_chars:
+ args = {'option': option_string,
+ 'prefix_chars': self.prefix_chars}
+ msg = _('invalid option string %(option)r: '
+ 'must start with a character %(prefix_chars)r')
+ raise ValueError(msg % args)
+
+ # strings starting with two prefix characters are long options
+ option_strings.append(option_string)
+ if option_string[0] in self.prefix_chars:
+ if len(option_string) > 1:
+ if option_string[1] in self.prefix_chars:
+ long_option_strings.append(option_string)
+
+ # infer destination, '--foo-bar' -> 'foo_bar' and '-x' -> 'x'
+ dest = kwargs.pop('dest', None)
+ if dest is None:
+ if long_option_strings:
+ dest_option_string = long_option_strings[0]
+ else:
+ dest_option_string = option_strings[0]
+ dest = dest_option_string.lstrip(self.prefix_chars)
+ if not dest:
+ msg = _('dest= is required for options like %r')
+ raise ValueError(msg % option_string)
+ dest = dest.replace('-', '_')
+
+ # return the updated keyword arguments
+ return dict(kwargs, dest=dest, option_strings=option_strings)
+
+ def _pop_action_class(self, kwargs, default=None):
+ action = kwargs.pop('action', default)
+ return self._registry_get('action', action, action)
+
+ def _get_handler(self):
+ # determine function from conflict handler string
+ handler_func_name = '_handle_conflict_%s' % self.conflict_handler
+ try:
+ return getattr(self, handler_func_name)
+ except AttributeError:
+ msg = _('invalid conflict_resolution value: %r')
+ raise ValueError(msg % self.conflict_handler)
+
+ def _check_conflict(self, action):
+
+ # find all options that conflict with this option
+ confl_optionals = []
+ for option_string in action.option_strings:
+ if option_string in self._option_string_actions:
+ confl_optional = self._option_string_actions[option_string]
+ confl_optionals.append((option_string, confl_optional))
+
+ # resolve any conflicts
+ if confl_optionals:
+ conflict_handler = self._get_handler()
+ conflict_handler(action, confl_optionals)
+
+ def _handle_conflict_error(self, action, conflicting_actions):
+ message = ngettext('conflicting option string: %s',
+ 'conflicting option strings: %s',
+ len(conflicting_actions))
+ conflict_string = ', '.join([option_string
+ for option_string, action
+ in conflicting_actions])
+ raise ArgumentError(action, message % conflict_string)
+
+ def _handle_conflict_resolve(self, action, conflicting_actions):
+
+ # remove all conflicting options
+ for option_string, action in conflicting_actions:
+
+ # remove the conflicting option
+ action.option_strings.remove(option_string)
+ self._option_string_actions.pop(option_string, None)
+
+ # if the option now has no option string, remove it from the
+ # container holding it
+ if not action.option_strings:
+ action.container._remove_action(action)
+
+
+class _ArgumentGroup(_ActionsContainer):
+
+ def __init__(self, container, title=None, description=None, **kwargs):
+ # add any missing keyword arguments by checking the container
+ update = kwargs.setdefault
+ update('conflict_handler', container.conflict_handler)
+ update('prefix_chars', container.prefix_chars)
+ update('argument_default', container.argument_default)
+ super_init = super(_ArgumentGroup, self).__init__
+ super_init(description=description, **kwargs)
+
+ # group attributes
+ self.title = title
+ self._group_actions = []
+
+ # share most attributes with the container
+ self._registries = container._registries
+ self._actions = container._actions
+ self._option_string_actions = container._option_string_actions
+ self._defaults = container._defaults
+ self._has_negative_number_optionals = \
+ container._has_negative_number_optionals
+ self._mutually_exclusive_groups = container._mutually_exclusive_groups
+
+ def _add_action(self, action):
+ action = super(_ArgumentGroup, self)._add_action(action)
+ self._group_actions.append(action)
+ return action
+
+ def _remove_action(self, action):
+ super(_ArgumentGroup, self)._remove_action(action)
+ self._group_actions.remove(action)
+
+
+class _MutuallyExclusiveGroup(_ArgumentGroup):
+
+ def __init__(self, container, required=False):
+ super(_MutuallyExclusiveGroup, self).__init__(container)
+ self.required = required
+ self._container = container
+
+ def _add_action(self, action):
+ if action.required:
+ msg = _('mutually exclusive arguments must be optional')
+ raise ValueError(msg)
+ action = self._container._add_action(action)
+ self._group_actions.append(action)
+ return action
+
+ def _remove_action(self, action):
+ self._container._remove_action(action)
+ self._group_actions.remove(action)
+
+
+class ArgumentParser(_AttributeHolder, _ActionsContainer):
+ """Object for parsing command line strings into Python objects.
+
+ Keyword Arguments:
+ - prog -- The name of the program (default: sys.argv[0])
+ - usage -- A usage message (default: auto-generated from arguments)
+ - description -- A description of what the program does
+ - epilog -- Text following the argument descriptions
+ - parents -- Parsers whose arguments should be copied into this one
+ - formatter_class -- HelpFormatter class for printing help messages
+ - prefix_chars -- Characters that prefix optional arguments
+ - fromfile_prefix_chars -- Characters that prefix files containing
+ additional arguments
+ - argument_default -- The default value for all arguments
+ - conflict_handler -- String indicating how to handle conflicts
+ - add_help -- Add a -h/-help option
+ """
+
+ def __init__(self,
+ prog=None,
+ usage=None,
+ description=None,
+ epilog=None,
+ version=None,
+ parents=[],
+ formatter_class=HelpFormatter,
+ prefix_chars='-',
+ fromfile_prefix_chars=None,
+ argument_default=None,
+ conflict_handler='error',
+ add_help=True):
+
+ if version is not None:
+ import warnings
+ warnings.warn(
+ """The "version" argument to ArgumentParser is deprecated. """
+ """Please use """
+ """"add_argument(..., action='version', version="N", ...)" """
+ """instead""", DeprecationWarning)
+
+ superinit = super(ArgumentParser, self).__init__
+ superinit(description=description,
+ prefix_chars=prefix_chars,
+ argument_default=argument_default,
+ conflict_handler=conflict_handler)
+
+ # default setting for prog
+ if prog is None:
+ prog = _os.path.basename(_sys.argv[0])
+
+ self.prog = prog
+ self.usage = usage
+ self.epilog = epilog
+ self.version = version
+ self.formatter_class = formatter_class
+ self.fromfile_prefix_chars = fromfile_prefix_chars
+ self.add_help = add_help
+
+ add_group = self.add_argument_group
+ self._positionals = add_group(_('positional arguments'))
+ self._optionals = add_group(_('optional arguments'))
+ self._subparsers = None
+
+ # register types
+ def identity(string):
+ return string
+ self.register('type', None, identity)
+
+ # add help and version arguments if necessary
+ # (using explicit default to override global argument_default)
+ default_prefix = '-' if '-' in prefix_chars else prefix_chars[0]
+ if self.add_help:
+ self.add_argument(
+ default_prefix+'h', default_prefix*2+'help',
+ action='help', default=SUPPRESS,
+ help=_('show this help message and exit'))
+ if self.version:
+ self.add_argument(
+ default_prefix+'v', default_prefix*2+'version',
+ action='version', default=SUPPRESS,
+ version=self.version,
+ help=_("show program's version number and exit"))
+
+ # add parent arguments and defaults
+ for parent in parents:
+ self._add_container_actions(parent)
+ try:
+ defaults = parent._defaults
+ except AttributeError:
+ pass
+ else:
+ self._defaults.update(defaults)
+
+ # =======================
+ # Pretty __repr__ methods
+ # =======================
+ def _get_kwargs(self):
+ names = [
+ 'prog',
+ 'usage',
+ 'description',
+ 'version',
+ 'formatter_class',
+ 'conflict_handler',
+ 'add_help',
+ ]
+ return [(name, getattr(self, name)) for name in names]
+
+ # ==================================
+ # Optional/Positional adding methods
+ # ==================================
+ def add_subparsers(self, **kwargs):
+ if self._subparsers is not None:
+ self.error(_('cannot have multiple subparser arguments'))
+
+ # add the parser class to the arguments if it's not present
+ kwargs.setdefault('parser_class', type(self))
+
+ if 'title' in kwargs or 'description' in kwargs:
+ title = _(kwargs.pop('title', 'subcommands'))
+ description = _(kwargs.pop('description', None))
+ self._subparsers = self.add_argument_group(title, description)
+ else:
+ self._subparsers = self._positionals
+
+ # prog defaults to the usage message of this parser, skipping
+ # optional arguments and with no "usage:" prefix
+ if kwargs.get('prog') is None:
+ formatter = self._get_formatter()
+ positionals = self._get_positional_actions()
+ groups = self._mutually_exclusive_groups
+ formatter.add_usage(self.usage, positionals, groups, '')
+ kwargs['prog'] = formatter.format_help().strip()
+
+ # create the parsers action and add it to the positionals list
+ parsers_class = self._pop_action_class(kwargs, 'parsers')
+ action = parsers_class(option_strings=[], **kwargs)
+ self._subparsers._add_action(action)
+
+ # return the created parsers action
+ return action
+
+ def _add_action(self, action):
+ if action.option_strings:
+ self._optionals._add_action(action)
+ else:
+ self._positionals._add_action(action)
+ return action
+
+ def _get_optional_actions(self):
+ return [action
+ for action in self._actions
+ if action.option_strings]
+
+ def _get_positional_actions(self):
+ return [action
+ for action in self._actions
+ if not action.option_strings]
+
+ # =====================================
+ # Command line argument parsing methods
+ # =====================================
+ def parse_args(self, args=None, namespace=None):
+ args, argv = self.parse_known_args(args, namespace)
+ if argv:
+ msg = _('unrecognized arguments: %s')
+ self.error(msg % ' '.join(argv))
+ return args
+
+ def parse_known_args(self, args=None, namespace=None):
+ # args default to the system args
+ if args is None:
+ args = _sys.argv[1:]
+
+ # default Namespace built from parser defaults
+ if namespace is None:
+ namespace = Namespace()
+
+ # add any action defaults that aren't present
+ for action in self._actions:
+ if action.dest is not SUPPRESS:
+ if not hasattr(namespace, action.dest):
+ if action.default is not SUPPRESS:
+ default = action.default
+ if isinstance(action.default, str):
+ default = self._get_value(action, default)
+ setattr(namespace, action.dest, default)
+
+ # add any parser defaults that aren't present
+ for dest in self._defaults:
+ if not hasattr(namespace, dest):
+ setattr(namespace, dest, self._defaults[dest])
+
+ # parse the arguments and exit if there are any errors
+ try:
+ namespace, args = self._parse_known_args(args, namespace)
+ if hasattr(namespace, _UNRECOGNIZED_ARGS_ATTR):
+ args.extend(getattr(namespace, _UNRECOGNIZED_ARGS_ATTR))
+ delattr(namespace, _UNRECOGNIZED_ARGS_ATTR)
+ return namespace, args
+ except ArgumentError:
+ err = _sys.exc_info()[1]
+ self.error(str(err))
+
+ def _parse_known_args(self, arg_strings, namespace):
+ # replace arg strings that are file references
+ if self.fromfile_prefix_chars is not None:
+ arg_strings = self._read_args_from_files(arg_strings)
+
+ # map all mutually exclusive arguments to the other arguments
+ # they can't occur with
+ action_conflicts = {}
+ for mutex_group in self._mutually_exclusive_groups:
+ group_actions = mutex_group._group_actions
+ for i, mutex_action in enumerate(mutex_group._group_actions):
+ conflicts = action_conflicts.setdefault(mutex_action, [])
+ conflicts.extend(group_actions[:i])
+ conflicts.extend(group_actions[i + 1:])
+
+ # find all option indices, and determine the arg_string_pattern
+ # which has an 'O' if there is an option at an index,
+ # an 'A' if there is an argument, or a '-' if there is a '--'
+ option_string_indices = {}
+ arg_string_pattern_parts = []
+ arg_strings_iter = iter(arg_strings)
+ for i, arg_string in enumerate(arg_strings_iter):
+
+ # all args after -- are non-options
+ if arg_string == '--':
+ arg_string_pattern_parts.append('-')
+ for arg_string in arg_strings_iter:
+ arg_string_pattern_parts.append('A')
+
+ # otherwise, add the arg to the arg strings
+ # and note the index if it was an option
+ else:
+ option_tuple = self._parse_optional(arg_string)
+ if option_tuple is None:
+ pattern = 'A'
+ else:
+ option_string_indices[i] = option_tuple
+ pattern = 'O'
+ arg_string_pattern_parts.append(pattern)
+
+ # join the pieces together to form the pattern
+ arg_strings_pattern = ''.join(arg_string_pattern_parts)
+
+ # converts arg strings to the appropriate and then takes the action
+ seen_actions = set()
+ seen_non_default_actions = set()
+
+ def take_action(action, argument_strings, option_string=None):
+ seen_actions.add(action)
+ argument_values = self._get_values(action, argument_strings)
+
+ # error if this argument is not allowed with other previously
+ # seen arguments, assuming that actions that use the default
+ # value don't really count as "present"
+ if argument_values is not action.default:
+ seen_non_default_actions.add(action)
+ for conflict_action in action_conflicts.get(action, []):
+ if conflict_action in seen_non_default_actions:
+ msg = _('not allowed with argument %s')
+ action_name = _get_action_name(conflict_action)
+ raise ArgumentError(action, msg % action_name)
+
+ # take the action if we didn't receive a SUPPRESS value
+ # (e.g. from a default)
+ if argument_values is not SUPPRESS:
+ action(self, namespace, argument_values, option_string)
+
+ # function to convert arg_strings into an optional action
+ def consume_optional(start_index):
+
+ # get the optional identified at this index
+ option_tuple = option_string_indices[start_index]
+ action, option_string, explicit_arg = option_tuple
+
+ # identify additional optionals in the same arg string
+ # (e.g. -xyz is the same as -x -y -z if no args are required)
+ match_argument = self._match_argument
+ action_tuples = []
+ while True:
+
+ # if we found no optional action, skip it
+ if action is None:
+ extras.append(arg_strings[start_index])
+ return start_index + 1
+
+ # if there is an explicit argument, try to match the
+ # optional's string arguments to only this
+ if explicit_arg is not None:
+ arg_count = match_argument(action, 'A')
+
+ # if the action is a single-dash option and takes no
+ # arguments, try to parse more single-dash options out
+ # of the tail of the option string
+ chars = self.prefix_chars
+ if arg_count == 0 and option_string[1] not in chars:
+ action_tuples.append((action, [], option_string))
+ char = option_string[0]
+ option_string = char + explicit_arg[0]
+ new_explicit_arg = explicit_arg[1:] or None
+ optionals_map = self._option_string_actions
+ if option_string in optionals_map:
+ action = optionals_map[option_string]
+ explicit_arg = new_explicit_arg
+ else:
+ msg = _('ignored explicit argument %r')
+ raise ArgumentError(action, msg % explicit_arg)
+
+ # if the action expect exactly one argument, we've
+ # successfully matched the option; exit the loop
+ elif arg_count == 1:
+ stop = start_index + 1
+ args = [explicit_arg]
+ action_tuples.append((action, args, option_string))
+ break
+
+ # error if a double-dash option did not use the
+ # explicit argument
+ else:
+ msg = _('ignored explicit argument %r')
+ raise ArgumentError(action, msg % explicit_arg)
+
+ # if there is no explicit argument, try to match the
+ # optional's string arguments with the following strings
+ # if successful, exit the loop
+ else:
+ start = start_index + 1
+ selected_patterns = arg_strings_pattern[start:]
+ arg_count = match_argument(action, selected_patterns)
+ stop = start + arg_count
+ args = arg_strings[start:stop]
+ action_tuples.append((action, args, option_string))
+ break
+
+ # add the Optional to the list and return the index at which
+ # the Optional's string args stopped
+ assert action_tuples
+ for action, args, option_string in action_tuples:
+ take_action(action, args, option_string)
+ return stop
+
+ # the list of Positionals left to be parsed; this is modified
+ # by consume_positionals()
+ positionals = self._get_positional_actions()
+
+ # function to convert arg_strings into positional actions
+ def consume_positionals(start_index):
+ # match as many Positionals as possible
+ match_partial = self._match_arguments_partial
+ selected_pattern = arg_strings_pattern[start_index:]
+ arg_counts = match_partial(positionals, selected_pattern)
+
+ # slice off the appropriate arg strings for each Positional
+ # and add the Positional and its args to the list
+ for action, arg_count in zip(positionals, arg_counts):
+ args = arg_strings[start_index: start_index + arg_count]
+ start_index += arg_count
+ take_action(action, args)
+
+ # slice off the Positionals that we just parsed and return the
+ # index at which the Positionals' string args stopped
+ positionals[:] = positionals[len(arg_counts):]
+ return start_index
+
+ # consume Positionals and Optionals alternately, until we have
+ # passed the last option string
+ extras = []
+ start_index = 0
+ if option_string_indices:
+ max_option_string_index = max(option_string_indices)
+ else:
+ max_option_string_index = -1
+ while start_index <= max_option_string_index:
+
+ # consume any Positionals preceding the next option
+ next_option_string_index = min([
+ index
+ for index in option_string_indices
+ if index >= start_index])
+ if start_index != next_option_string_index:
+ positionals_end_index = consume_positionals(start_index)
+
+ # only try to parse the next optional if we didn't consume
+ # the option string during the positionals parsing
+ if positionals_end_index > start_index:
+ start_index = positionals_end_index
+ continue
+ else:
+ start_index = positionals_end_index
+
+ # if we consumed all the positionals we could and we're not
+ # at the index of an option string, there were extra arguments
+ if start_index not in option_string_indices:
+ strings = arg_strings[start_index:next_option_string_index]
+ extras.extend(strings)
+ start_index = next_option_string_index
+
+ # consume the next optional and any arguments for it
+ start_index = consume_optional(start_index)
+
+ # consume any positionals following the last Optional
+ stop_index = consume_positionals(start_index)
+
+ # if we didn't consume all the argument strings, there were extras
+ extras.extend(arg_strings[stop_index:])
+
+ # make sure all required actions were present
+ required_actions = [_get_action_name(action) for action in self._actions
+ if action.required and action not in seen_actions]
+ if required_actions:
+ self.error(_('the following arguments are required: %s') %
+ ', '.join(required_actions))
+
+ # make sure all required groups had one option present
+ for group in self._mutually_exclusive_groups:
+ if group.required:
+ for action in group._group_actions:
+ if action in seen_non_default_actions:
+ break
+
+ # if no actions were used, report the error
+ else:
+ names = [_get_action_name(action)
+ for action in group._group_actions
+ if action.help is not SUPPRESS]
+ msg = _('one of the arguments %s is required')
+ self.error(msg % ' '.join(names))
+
+ # return the updated namespace and the extra arguments
+ return namespace, extras
+
+ def _read_args_from_files(self, arg_strings):
+ # expand arguments referencing files
+ new_arg_strings = []
+ for arg_string in arg_strings:
+
+ # for regular arguments, just add them back into the list
+ if arg_string[0] not in self.fromfile_prefix_chars:
+ new_arg_strings.append(arg_string)
+
+ # replace arguments referencing files with the file content
+ else:
+ try:
+ args_file = open(arg_string[1:])
+ try:
+ arg_strings = []
+ for arg_line in args_file.read().splitlines():
+ for arg in self.convert_arg_line_to_args(arg_line):
+ arg_strings.append(arg)
+ arg_strings = self._read_args_from_files(arg_strings)
+ new_arg_strings.extend(arg_strings)
+ finally:
+ args_file.close()
+ except IOError:
+ err = _sys.exc_info()[1]
+ self.error(str(err))
+
+ # return the modified argument list
+ return new_arg_strings
+
+ def convert_arg_line_to_args(self, arg_line):
+ return [arg_line]
+
+ def _match_argument(self, action, arg_strings_pattern):
+ # match the pattern for this action to the arg strings
+ nargs_pattern = self._get_nargs_pattern(action)
+ match = _re.match(nargs_pattern, arg_strings_pattern)
+
+ # raise an exception if we weren't able to find a match
+ if match is None:
+ nargs_errors = {
+ None: _('expected one argument'),
+ OPTIONAL: _('expected at most one argument'),
+ ONE_OR_MORE: _('expected at least one argument'),
+ }
+ default = ngettext('expected %s argument',
+ 'expected %s arguments',
+ action.nargs) % action.nargs
+ msg = nargs_errors.get(action.nargs, default)
+ raise ArgumentError(action, msg)
+
+ # return the number of arguments matched
+ return len(match.group(1))
+
+ def _match_arguments_partial(self, actions, arg_strings_pattern):
+ # progressively shorten the actions list by slicing off the
+ # final actions until we find a match
+ result = []
+ for i in range(len(actions), 0, -1):
+ actions_slice = actions[:i]
+ pattern = ''.join([self._get_nargs_pattern(action)
+ for action in actions_slice])
+ match = _re.match(pattern, arg_strings_pattern)
+ if match is not None:
+ result.extend([len(string) for string in match.groups()])
+ break
+
+ # return the list of arg string counts
+ return result
+
+ def _parse_optional(self, arg_string):
+ # if it's an empty string, it was meant to be a positional
+ if not arg_string:
+ return None
+
+ # if it doesn't start with a prefix, it was meant to be positional
+ if not arg_string[0] in self.prefix_chars:
+ return None
+
+ # if the option string is present in the parser, return the action
+ if arg_string in self._option_string_actions:
+ action = self._option_string_actions[arg_string]
+ return action, arg_string, None
+
+ # if it's just a single character, it was meant to be positional
+ if len(arg_string) == 1:
+ return None
+
+ # if the option string before the "=" is present, return the action
+ if '=' in arg_string:
+ option_string, explicit_arg = arg_string.split('=', 1)
+ if option_string in self._option_string_actions:
+ action = self._option_string_actions[option_string]
+ return action, option_string, explicit_arg
+
+ # search through all possible prefixes of the option string
+ # and all actions in the parser for possible interpretations
+ option_tuples = self._get_option_tuples(arg_string)
+
+ # if multiple actions match, the option string was ambiguous
+ if len(option_tuples) > 1:
+ options = ', '.join([option_string
+ for action, option_string, explicit_arg in option_tuples])
+ args = {'option': arg_string, 'matches': options}
+ msg = _('ambiguous option: %(option)s could match %(matches)s')
+ self.error(msg % args)
+
+ # if exactly one action matched, this segmentation is good,
+ # so return the parsed action
+ elif len(option_tuples) == 1:
+ option_tuple, = option_tuples
+ return option_tuple
+
+ # if it was not found as an option, but it looks like a negative
+ # number, it was meant to be positional
+ # unless there are negative-number-like options
+ if self._negative_number_matcher.match(arg_string):
+ if not self._has_negative_number_optionals:
+ return None
+
+ # if it contains a space, it was meant to be a positional
+ if ' ' in arg_string:
+ return None
+
+ # it was meant to be an optional but there is no such option
+ # in this parser (though it might be a valid option in a subparser)
+ return None, arg_string, None
+
+ def _get_option_tuples(self, option_string):
+ result = []
+
+ # option strings starting with two prefix characters are only
+ # split at the '='
+ chars = self.prefix_chars
+ if option_string[0] in chars and option_string[1] in chars:
+ if '=' in option_string:
+ option_prefix, explicit_arg = option_string.split('=', 1)
+ else:
+ option_prefix = option_string
+ explicit_arg = None
+ for option_string in self._option_string_actions:
+ if option_string.startswith(option_prefix):
+ action = self._option_string_actions[option_string]
+ tup = action, option_string, explicit_arg
+ result.append(tup)
+
+ # single character options can be concatenated with their arguments
+ # but multiple character options always have to have their argument
+ # separate
+ elif option_string[0] in chars and option_string[1] not in chars:
+ option_prefix = option_string
+ explicit_arg = None
+ short_option_prefix = option_string[:2]
+ short_explicit_arg = option_string[2:]
+
+ for option_string in self._option_string_actions:
+ if option_string == short_option_prefix:
+ action = self._option_string_actions[option_string]
+ tup = action, option_string, short_explicit_arg
+ result.append(tup)
+ elif option_string.startswith(option_prefix):
+ action = self._option_string_actions[option_string]
+ tup = action, option_string, explicit_arg
+ result.append(tup)
+
+ # shouldn't ever get here
+ else:
+ self.error(_('unexpected option string: %s') % option_string)
+
+ # return the collected option tuples
+ return result
+
+ def _get_nargs_pattern(self, action):
+ # in all examples below, we have to allow for '--' args
+ # which are represented as '-' in the pattern
+ nargs = action.nargs
+
+ # the default (None) is assumed to be a single argument
+ if nargs is None:
+ nargs_pattern = '(-*A-*)'
+
+ # allow zero or one arguments
+ elif nargs == OPTIONAL:
+ nargs_pattern = '(-*A?-*)'
+
+ # allow zero or more arguments
+ elif nargs == ZERO_OR_MORE:
+ nargs_pattern = '(-*[A-]*)'
+
+ # allow one or more arguments
+ elif nargs == ONE_OR_MORE:
+ nargs_pattern = '(-*A[A-]*)'
+
+ # allow any number of options or arguments
+ elif nargs == REMAINDER:
+ nargs_pattern = '([-AO]*)'
+
+ # allow one argument followed by any number of options or arguments
+ elif nargs == PARSER:
+ nargs_pattern = '(-*A[-AO]*)'
+
+ # all others should be integers
+ else:
+ nargs_pattern = '(-*%s-*)' % '-*'.join('A' * nargs)
+
+ # if this is an optional action, -- is not allowed
+ if action.option_strings:
+ nargs_pattern = nargs_pattern.replace('-*', '')
+ nargs_pattern = nargs_pattern.replace('-', '')
+
+ # return the pattern
+ return nargs_pattern
+
+ # ========================
+ # Value conversion methods
+ # ========================
+ def _get_values(self, action, arg_strings):
+ # for everything but PARSER args, strip out '--'
+ if action.nargs not in [PARSER, REMAINDER]:
+ arg_strings = [s for s in arg_strings if s != '--']
+
+ # optional argument produces a default when not present
+ if not arg_strings and action.nargs == OPTIONAL:
+ if action.option_strings:
+ value = action.const
+ else:
+ value = action.default
+ if isinstance(value, str):
+ value = self._get_value(action, value)
+ self._check_value(action, value)
+
+ # when nargs='*' on a positional, if there were no command-line
+ # args, use the default if it is anything other than None
+ elif (not arg_strings and action.nargs == ZERO_OR_MORE and
+ not action.option_strings):
+ if action.default is not None:
+ value = action.default
+ else:
+ value = arg_strings
+ self._check_value(action, value)
+
+ # single argument or optional argument produces a single value
+ elif len(arg_strings) == 1 and action.nargs in [None, OPTIONAL]:
+ arg_string, = arg_strings
+ value = self._get_value(action, arg_string)
+ self._check_value(action, value)
+
+ # REMAINDER arguments convert all values, checking none
+ elif action.nargs == REMAINDER:
+ value = [self._get_value(action, v) for v in arg_strings]
+
+ # PARSER arguments convert all values, but check only the first
+ elif action.nargs == PARSER:
+ value = [self._get_value(action, v) for v in arg_strings]
+ self._check_value(action, value[0])
+
+ # all other types of nargs produce a list
+ else:
+ value = [self._get_value(action, v) for v in arg_strings]
+ for v in value:
+ self._check_value(action, v)
+
+ # return the converted value
+ return value
+
+ def _get_value(self, action, arg_string):
+ type_func = self._registry_get('type', action.type, action.type)
+ if not callable(type_func):
+ msg = _('%r is not callable')
+ raise ArgumentError(action, msg % type_func)
+
+ # convert the value to the appropriate type
+ try:
+ result = type_func(arg_string)
+
+ # ArgumentTypeErrors indicate errors
+ except ArgumentTypeError:
+ name = getattr(action.type, '__name__', repr(action.type))
+ msg = str(_sys.exc_info()[1])
+ raise ArgumentError(action, msg)
+
+ # TypeErrors or ValueErrors also indicate errors
+ except (TypeError, ValueError):
+ name = getattr(action.type, '__name__', repr(action.type))
+ args = {'type': name, 'value': arg_string}
+ msg = _('invalid %(type)s value: %(value)r')
+ raise ArgumentError(action, msg % args)
+
+ # return the converted value
+ return result
+
+ def _check_value(self, action, value):
+ # converted value must be one of the choices (if specified)
+ if action.choices is not None and value not in action.choices:
+ args = {'value': value,
+ 'choices': ', '.join(map(repr, action.choices))}
+ msg = _('invalid choice: %(value)r (choose from %(choices)s)')
+ raise ArgumentError(action, msg % args)
+
+ # =======================
+ # Help-formatting methods
+ # =======================
+ def format_usage(self):
+ formatter = self._get_formatter()
+ formatter.add_usage(self.usage, self._actions,
+ self._mutually_exclusive_groups)
+ return formatter.format_help()
+
+ def format_help(self):
+ formatter = self._get_formatter()
+
+ # usage
+ formatter.add_usage(self.usage, self._actions,
+ self._mutually_exclusive_groups)
+
+ # description
+ formatter.add_text(self.description)
+
+ # positionals, optionals and user-defined groups
+ for action_group in self._action_groups:
+ formatter.start_section(action_group.title)
+ formatter.add_text(action_group.description)
+ formatter.add_arguments(action_group._group_actions)
+ formatter.end_section()
+
+ # epilog
+ formatter.add_text(self.epilog)
+
+ # determine help from format above
+ return formatter.format_help()
+
+ def format_version(self):
+ import warnings
+ warnings.warn(
+ 'The format_version method is deprecated -- the "version" '
+ 'argument to ArgumentParser is no longer supported.',
+ DeprecationWarning)
+ formatter = self._get_formatter()
+ formatter.add_text(self.version)
+ return formatter.format_help()
+
+ def _get_formatter(self):
+ return self.formatter_class(prog=self.prog)
+
+ # =====================
+ # Help-printing methods
+ # =====================
+ def print_usage(self, file=None):
+ if file is None:
+ file = _sys.stdout
+ self._print_message(self.format_usage(), file)
+
+ def print_help(self, file=None):
+ if file is None:
+ file = _sys.stdout
+ self._print_message(self.format_help(), file)
+
+ def print_version(self, file=None):
+ import warnings
+ warnings.warn(
+ 'The print_version method is deprecated -- the "version" '
+ 'argument to ArgumentParser is no longer supported.',
+ DeprecationWarning)
+ self._print_message(self.format_version(), file)
+
+ def _print_message(self, message, file=None):
+ if message:
+ if file is None:
+ file = _sys.stderr
+ file.write(message)
+
+ # ===============
+ # Exiting methods
+ # ===============
+ def exit(self, status=0, message=None):
+ if message:
+ self._print_message(message, _sys.stderr)
+ _sys.exit(status)
+
+ def error(self, message):
+ """error(message: string)
+
+ Prints a usage message incorporating the message to stderr and
+ exits.
+
+ If you override this in a subclass, it should not return -- it
+ should either exit or raise an exception.
+ """
+ self.print_usage(_sys.stderr)
+ args = {'prog': self.prog, 'message': message}
+ self.exit(2, _('%(prog)s: error: %(message)s\n') % args)
diff --git a/contrib/tmcombine/test/extract b/contrib/tmcombine/test/extract
new file mode 100644
index 000000000..e137eba72
--- /dev/null
+++ b/contrib/tmcombine/test/extract
@@ -0,0 +1,22 @@
+jngspitz-nordostwand direkt ||| ngspitz : face nordest directe ||| 0-0 1-0 0-1 1-2 1-3 1-4
+michel ||| michel ||| 0-0
+michel piola ||| michel piola ||| 0-0 1-1
+michel piola , ||| michel piola , ||| 0-0 1-1 2-2
+michel piola , vernier ||| michel piola , vernier ||| 0-0 1-1 2-2 3-3
+piola ||| piola ||| 0-0
+piola , ||| piola , ||| 0-0 1-1
+piola , vernier ||| piola , vernier ||| 0-0 1-1 2-2
+, ||| , ||| 0-0
+, vernier ||| , vernier ||| 0-0 1-1
+vernier ||| vernier ||| 0-0
+die ||| la ||| 0-0
+nordostwand ||| face nordest ||| 0-0 0-1
+hohe nordostwand ||| face nordest ||| 1-0 1-1
+nordostwand des ||| face nordest de ||| 0-0 0-1 1-2
+hohe nordostwand des ||| face nordest de ||| 1-0 1-1 2-2
+nordostwand des kingspitz ||| face nordest de la kingspitz ||| 0-0 0-1 1-2 2-3 2-4
+hohe nordostwand des kingspitz ||| face nordest de la kingspitz ||| 1-0 1-1 2-2 3-3 3-4
+nordostwand des kingspitz ||| face nordest de la kingspitz , ||| 0-0 0-1 1-2 2-3 2-4
+hohe nordostwand des kingspitz ||| face nordest de la kingspitz , ||| 1-0 1-1 2-2 3-3 3-4
+pass ||| col ||| 0-0
+sitzung ||| séance ||| 0-0 \ No newline at end of file
diff --git a/contrib/tmcombine/test/model1/model/lex.counts.e2f b/contrib/tmcombine/test/model1/model/lex.counts.e2f
new file mode 100644
index 000000000..ed05c0b7d
--- /dev/null
+++ b/contrib/tmcombine/test/model1/model/lex.counts.e2f
@@ -0,0 +1,8 @@
+ad af 500 1000
+bd bf 5 10
+der le 20285 102586
+der NULL 12926 704917
+gipfel sommet 3485 7322
+pass col 419 2911
+pass passeport 7 28
+sitzung séance 14 59 \ No newline at end of file
diff --git a/contrib/tmcombine/test/model1/model/lex.counts.f2e b/contrib/tmcombine/test/model1/model/lex.counts.f2e
new file mode 100644
index 000000000..ea31f690d
--- /dev/null
+++ b/contrib/tmcombine/test/model1/model/lex.counts.f2e
@@ -0,0 +1,8 @@
+af ad 500 1000
+bf bd 5 10
+col pass 419 615
+le der 20285 113635
+passeport pass 7 615
+retrouvé NULL 34 1016136
+séance sitzung 14 33
+sommet gipfel 3485 5700 \ No newline at end of file
diff --git a/contrib/tmcombine/test/model1/model/lex.e2f b/contrib/tmcombine/test/model1/model/lex.e2f
new file mode 100644
index 000000000..f9263ffe5
--- /dev/null
+++ b/contrib/tmcombine/test/model1/model/lex.e2f
@@ -0,0 +1,8 @@
+ad af 0.5
+bd bf 0.5
+der le 0.1977365
+der NULL 0.0183369
+gipfel sommet 0.4759629
+pass col 0.1439368
+pass passeport 0.2500000
+sitzung séance 0.2372881 \ No newline at end of file
diff --git a/contrib/tmcombine/test/model1/model/lex.f2e b/contrib/tmcombine/test/model1/model/lex.f2e
new file mode 100644
index 000000000..2bba51f01
--- /dev/null
+++ b/contrib/tmcombine/test/model1/model/lex.f2e
@@ -0,0 +1,8 @@
+af ad 0.5
+bf bd 0.5
+col pass 0.6813008
+le der 0.1785101
+passeport pass 0.0113821
+retrouvé NULL 0.0000335
+séance sitzung 0.4242424
+sommet gipfel 0.6114035 \ No newline at end of file
diff --git a/contrib/tmcombine/test/model1/model/phrase-table b/contrib/tmcombine/test/model1/model/phrase-table
new file mode 100644
index 000000000..b46a04490
--- /dev/null
+++ b/contrib/tmcombine/test/model1/model/phrase-table
@@ -0,0 +1,8 @@
+ad ||| af ||| 0.5 0.5 0.5 0.5 2.718 ||| 0-0 ||| 1000 1000
+bd ||| bf ||| 0.5 0.5 0.5 0.5 2.718 ||| 0-0 ||| 10 10
+der gipfel ||| sommet ||| 0.00327135 0.00872768 0.0366795 0.611403 2.718 ||| 1-0 ||| 5808 518
+der pass ||| le col ||| 0.0173565 0.0284616 0.288889 0.121619 2.718 ||| 0-0 1-1 ||| 749 45
+pass ||| col ||| 0.1952 0.143937 0.628866 0.681301 2.718 ||| 0-0 ||| 1875 582
+pass ||| passeport retrouvé ||| 0.5 0.25 0.00171821 3.813e-07 2.718 ||| 0-0 ||| 2 582
+pass ||| passeport ||| 0.266667 0.25 0.00687285 0.0113821 2.718 ||| 0-0 ||| 15 582
+sitzung ||| séance ||| 0.272727 0.237288 0.352941 0.424242 2.718 ||| 0-0 ||| 22 17 \ No newline at end of file
diff --git a/contrib/tmcombine/test/model2/model/lex.counts.e2f b/contrib/tmcombine/test/model2/model/lex.counts.e2f
new file mode 100644
index 000000000..8475fcdf9
--- /dev/null
+++ b/contrib/tmcombine/test/model2/model/lex.counts.e2f
@@ -0,0 +1,8 @@
+ad af 100 1000
+bd bf 1 10
+der le 150181 944391
+der NULL 54483 3595140
+gipfel sommet 3421 9342
+pass col 2 70
+pass passeport 73 379
+sitzung séance 3441 5753 \ No newline at end of file
diff --git a/contrib/tmcombine/test/model2/model/lex.counts.f2e b/contrib/tmcombine/test/model2/model/lex.counts.f2e
new file mode 100644
index 000000000..b0913088a
--- /dev/null
+++ b/contrib/tmcombine/test/model2/model/lex.counts.f2e
@@ -0,0 +1,8 @@
+af ad 100 1000
+bf bd 1 10
+col pass 2 108
+le der 150181 1356104
+passeport pass 73 108
+retrouvé NULL 43 6276240
+séance sitzung 3441 6142
+sommet gipfel 3421 4908 \ No newline at end of file
diff --git a/contrib/tmcombine/test/model2/model/lex.e2f b/contrib/tmcombine/test/model2/model/lex.e2f
new file mode 100644
index 000000000..b1ce3a613
--- /dev/null
+++ b/contrib/tmcombine/test/model2/model/lex.e2f
@@ -0,0 +1,8 @@
+ad af 0.1
+bd bf 0.1
+der le 0.1590242
+der NULL 0.0151546
+gipfel sommet 0.366195
+pass col 0.0285714
+pass passeport 0.1926121
+sitzung séance 0.5981227 \ No newline at end of file
diff --git a/contrib/tmcombine/test/model2/model/lex.f2e b/contrib/tmcombine/test/model2/model/lex.f2e
new file mode 100644
index 000000000..d931dcb72
--- /dev/null
+++ b/contrib/tmcombine/test/model2/model/lex.f2e
@@ -0,0 +1,8 @@
+af ad 0.1
+bf bd 0.1
+col pass 0.0185185
+le der 0.1107445
+passeport pass 0.6759259
+retrouvé NULL 0.0000069
+séance sitzung 0.5602410
+sommet gipfel 0.6970253 \ No newline at end of file
diff --git a/contrib/tmcombine/test/model2/model/phrase-table b/contrib/tmcombine/test/model2/model/phrase-table
new file mode 100644
index 000000000..97b51a0d5
--- /dev/null
+++ b/contrib/tmcombine/test/model2/model/phrase-table
@@ -0,0 +1,5 @@
+ad ||| af ||| 0.1 0.1 0.1 0.1 2.718 ||| 0-0 ||| 1000 1000
+bd ||| bf ||| 0.1 0.1 0.1 0.1 2.718 ||| 0-0 ||| 10 10
+der pass ||| le passeport ||| 0.16 0.03063 0.4 0.0748551 2.718 ||| 0-0 1-1 ||| 25 10
+pass ||| passeport ||| 0.28022 0.192612 0.607143 0.675926 2.718 ||| 0-0 ||| 182 84
+sitzung ||| séance ||| 0.784521 0.598123 0.516654 0.560241 2.718 ||| 0-0 ||| 4251 6455 \ No newline at end of file
diff --git a/contrib/tmcombine/test/phrase-table_test1 b/contrib/tmcombine/test/phrase-table_test1
new file mode 100644
index 000000000..1d1d5a238
--- /dev/null
+++ b/contrib/tmcombine/test/phrase-table_test1
@@ -0,0 +1,8 @@
+ad ||| af ||| 0.3 0.3 0.3 0.3 2.718 ||| 0-0 ||| 1000 1000
+bd ||| bf ||| 0.3 0.3 0.3 0.3 2.718 ||| 0-0 ||| 10 10
+der gipfel ||| sommet ||| 0.00163568 0.00436384 0.0183397 0.305702 2.718 ||| 1-0 ||| 5808 518
+der pass ||| le col ||| 0.00867825 0.0142308 0.144445 0.0608095 2.718 ||| 0-0 1-1 ||| 749 45
+pass ||| col ||| 0.0976 0.0719685 0.314433 0.340651 2.718 ||| 0-0 ||| 1875 582
+pass ||| passeport retrouvé ||| 0.25 0.125 0.000859105 1.9065e-07 2.718 ||| 0-0 ||| 2 582
+pass ||| passeport ||| 0.273444 0.221306 0.307008 0.343654 2.718 ||| 0-0 ||| 182 84
+sitzung ||| séance ||| 0.528624 0.417705 0.434797 0.492241 2.718 ||| 0-0 ||| 4251 6455
diff --git a/contrib/tmcombine/test/phrase-table_test2 b/contrib/tmcombine/test/phrase-table_test2
new file mode 100644
index 000000000..9d3f28816
--- /dev/null
+++ b/contrib/tmcombine/test/phrase-table_test2
@@ -0,0 +1,9 @@
+ad ||| af ||| 0.14 0.136364 0.18 0.3 2.718 ||| 0-0 ||| 1000 1000
+bd ||| bf ||| 0.14 0.136364 0.18 0.3 2.718 ||| 0-0 ||| 10 10
+der gipfel ||| sommet ||| 0.000327135 0.000793425 0.0073359 0.305702 2.718 ||| 1-0 ||| 5808 518
+der pass ||| le col ||| 0.00173565 0.00258742 0.0577778 0.0608095 2.718 ||| 0-0 1-1 ||| 749 45
+der pass ||| le passeport ||| 0.144 0.0278455 0.32 0.0374275 2.718 ||| 0-0 1-1 ||| 25 10
+pass ||| col ||| 0.01952 0.0130852 0.125773 0.340651 2.718 ||| 0-0 ||| 1875 582
+pass ||| passeport retrouvé ||| 0.05 0.0227273 0.000343642 1.9065e-07 2.718 ||| 0-0 ||| 2 582
+pass ||| passeport ||| 0.278865 0.197829 0.487089 0.343654 2.718 ||| 0-0 ||| 182 84
+sitzung ||| séance ||| 0.733342 0.56532 0.483911 0.492241 2.718 ||| 0-0 ||| 4251 6455
diff --git a/contrib/tmcombine/test/phrase-table_test3 b/contrib/tmcombine/test/phrase-table_test3
new file mode 100644
index 000000000..8dfed73b8
--- /dev/null
+++ b/contrib/tmcombine/test/phrase-table_test3
@@ -0,0 +1,9 @@
+ad ||| af ||| 0.14 0.136364 0.18 0.3 2.718 ||| 0-0 ||| 10000.0 5000.0
+bd ||| bf ||| 0.14 0.136364 0.18 0.3 2.718 ||| 0-0 ||| 100.0 50.0
+der gipfel ||| sommet ||| 0.00327135 0.00569336 0.0366795 0.651018 2.718 ||| 1-0 ||| 5808.0 518.0
+der pass ||| le col ||| 0.0173565 0.0193836 0.152941 0.0675369 2.718 ||| 0-0 1-1 ||| 749.0 85.0
+der pass ||| le passeport ||| 0.16 0.0307772 0.188235 0.0128336 2.718 ||| 0-0 1-1 ||| 225.0 85.0
+pass ||| col ||| 0.1952 0.121573 0.398693 0.582296 2.718 ||| 0-0 ||| 1875.0 918.0
+pass ||| passeport retrouvé ||| 0.5 0.193033 0.00108932 1.16835e-06 2.718 ||| 0-0 ||| 2.0 918.0
+pass ||| passeport ||| 0.280097 0.193033 0.22658 0.11065 2.718 ||| 0-0 ||| 1653.0 918.0
+sitzung ||| séance ||| 0.784227 0.597753 0.516546 0.559514 2.718 ||| 0-0 ||| 38281.0 25837.0
diff --git a/contrib/tmcombine/test/phrase-table_test4 b/contrib/tmcombine/test/phrase-table_test4
new file mode 100644
index 000000000..7485c728f
--- /dev/null
+++ b/contrib/tmcombine/test/phrase-table_test4
@@ -0,0 +1,8 @@
+ad ||| af ||| 0.5 0.5 0.5 0.5 2.718 ||| 0-0 ||| 1000.0 1000.0
+bd ||| bf ||| 0.5 0.5 0.5 0.5 2.718 ||| 0-0 ||| 10.0 10.0
+der gipfel ||| sommet ||| 0.00327135 0.00872769 0.0366795 0.611404 2.718 ||| 1-0 ||| 5808.0 518.0
+der pass ||| le col ||| 0.0173565 0.0284616 0.288889 0.121619 2.718 ||| 0-0 1-1 ||| 749.0 45.0
+pass ||| col ||| 0.1952 0.143937 0.628866 0.681301 2.718 ||| 0-0 ||| 1875.0 582.0
+pass ||| passeport retrouvé ||| 0.5 0.25 0.00171821 3.80847e-07 2.718 ||| 0-0 ||| 2.0 582.0
+pass ||| passeport ||| 0.266667 0.25 0.00687285 0.0113821 2.718 ||| 0-0 ||| 15.0 582.0
+sitzung ||| séance ||| 0.272727 0.237288 0.352941 0.424242 2.718 ||| 0-0 ||| 22.0 17.0
diff --git a/contrib/tmcombine/test/phrase-table_test5 b/contrib/tmcombine/test/phrase-table_test5
new file mode 100644
index 000000000..45f15163d
--- /dev/null
+++ b/contrib/tmcombine/test/phrase-table_test5
@@ -0,0 +1,9 @@
+ad ||| af ||| 0.11579 0.35574 0.472359 0.469238 2.718 ||| 0-0 ||| 25332.4712297 1074.23173673
+bd ||| bf ||| 0.11579 0.35574 0.472359 0.469238 2.718 ||| 0-0 ||| 253.324712297 10.7423173673
+der gipfel ||| sommet ||| 0.00327135 0.00686984 0.0366795 0.617135 2.718 ||| 1-0 ||| 5808.0 518.0
+der pass ||| le col ||| 0.0173565 0.023534 0.284201 0.0972183 2.718 ||| 0-0 1-1 ||| 749.0 45.7423173673
+der pass ||| le passeport ||| 0.16 0.0329324 0.0064913 0.00303408 2.718 ||| 0-0 1-1 ||| 608.311780741 45.7423173673
+pass ||| col ||| 0.1952 0.142393 0.6222 0.671744 2.718 ||| 0-0 ||| 1875.0 588.235465885
+pass ||| passeport retrouvé ||| 0.5 0.199258 0.0017 5.11945e-07 2.718 ||| 0-0 ||| 2.0 588.235465885
+pass ||| passeport ||| 0.280174 0.199258 0.0132359 0.0209644 2.718 ||| 0-0 ||| 4443.5097638 588.235465885
+sitzung ||| séance ||| 0.784412 0.59168 0.511045 0.552002 2.718 ||| 0-0 ||| 103459.335197 496.165860589
diff --git a/contrib/tmcombine/test/phrase-table_test6 b/contrib/tmcombine/test/phrase-table_test6
new file mode 100644
index 000000000..38daf4512
--- /dev/null
+++ b/contrib/tmcombine/test/phrase-table_test6
@@ -0,0 +1,4 @@
+ad ||| af ||| 0.117462 0.117462 0.117462 0.117462 2.718 ||| 0-0 ||| 1000 1000
+bd ||| bf ||| 0.117462 0.117462 0.117462 0.117462 2.718 ||| 0-0 ||| 10 10
+pass ||| passeport ||| 0.278834 0.197701 0.387861 0.449295 2.718 ||| 0-0 ||| 182 84
+sitzung ||| séance ||| 0.705857 0.545304 0.497336 0.544877 2.718 ||| 0-0 ||| 4251 6455
diff --git a/contrib/tmcombine/test/phrase-table_test7 b/contrib/tmcombine/test/phrase-table_test7
new file mode 100644
index 000000000..01ea2a076
--- /dev/null
+++ b/contrib/tmcombine/test/phrase-table_test7
@@ -0,0 +1 @@
+([(1.8744705606119034, 2.0752881273042374, 1.5025010618768841, 1.2370391973008494, 0, 0, 1, 1, 22), (0.35011602922315899, 0.74148657814725749, 0.95272965495298623, 0.83588062023889353, 1, 0, 0, 1, 22)], (1, 22, 20)) \ No newline at end of file
diff --git a/contrib/tmcombine/test/phrase-table_test8 b/contrib/tmcombine/test/phrase-table_test8
new file mode 100644
index 000000000..74eb27c0c
--- /dev/null
+++ b/contrib/tmcombine/test/phrase-table_test8
@@ -0,0 +1,9 @@
+ad ||| af ||| 0.5 0.398085 0.5 0.482814 2.718 ||| 0-0 ||| 1000.000001 1000.000001
+bd ||| bf ||| 0.5 0.111367 0.5 0.172867 2.718 ||| 0-0 ||| 10.00000001 10.00000001
+der gipfel ||| sommet ||| 0.00327135 0.00863717 0.0366795 0.612073 2.718 ||| 1-0 ||| 5808.0 518.0
+der pass ||| le col ||| 0.0173565 0.0260469 0.288889 0.113553 2.718 ||| 0-0 1-1 ||| 749.0 45.00000001
+der pass ||| le passeport ||| 0.0064 0.0389201 8.88889e-12 0.0101009 2.718 ||| 0-0 1-1 ||| 2.5e-08 45.00000001
+pass ||| col ||| 0.1952 0.131811 0.628866 0.63621 2.718 ||| 0-0 ||| 1875.0 582.000000084
+pass ||| passeport retrouvé ||| 0.5 0.196956 0.00171821 1.89355e-06 2.718 ||| 0-0 ||| 2.0 582.000000084
+pass ||| passeport ||| 0.266667 0.196956 0.00687285 0.0565932 2.718 ||| 0-0 ||| 15.000000182 582.000000084
+sitzung ||| séance ||| 0.272727 0.545019 0.352941 0.502625 2.718 ||| 0-0 ||| 22.000004251 17.000006455
diff --git a/contrib/tmcombine/tmcombine.py b/contrib/tmcombine/tmcombine.py
new file mode 100755
index 000000000..333023929
--- /dev/null
+++ b/contrib/tmcombine/tmcombine.py
@@ -0,0 +1,1848 @@
+#!/usr/bin/env python
+# -*- coding: utf-8 -*-
+# Author: Rico Sennrich <sennrich [AT] cl.uzh.ch>
+
+# This program handles the combination of Moses phrase tables, either through
+# linear interpolation of the phrase translation probabilities/lexical weights,
+# or through a recomputation based on the (weighted) combined counts.
+#
+# It also supports an automatic search for weights that minimize the cross-entropy
+# between the model and a tuning set of word/phrase alignments.
+
+# for usage information, run
+# python tmcombine.py -h
+# you can also check the docstrings of Combine_TMs() for more information and find some example commands in the function test()
+
+
+# Some general things to note:
+# - Different combination algorithms require different statistics. To be on the safe side, apply train_model.patch to train_model.perl and use the option -phrase-word-alignment for training all models.
+# - The script assumes that phrase tables are sorted (to allow incremental, more memory-friendly processing). sort with LC_ALL=C.
+# - Some configurations require additional statistics that are loaded in memory (lexical tables; complete list of target phrases). If memory consumption is a problem, use the option --lowmem (slightly slower and writes temporary files to disk), or consider pruning your phrase table before combining (e.g. using Johnson et al. 2007).
+# - The script assumes that all files are encoded in UTF-8. If this is not the case, fix it or change the handle_file() function.
+# - The script can read/write gzipped files, but the Python implementation is slow. You're better off unzipping the files on the command line and working with the unzipped files.
+# - The cross-entropy estimation assumes that phrase tables contain true probability distributions (i.e. a probability mass of 1 for each conditional probability distribution). If this is not true, the results are skewed.
+# - Unknown phrase pairs are not considered for the cross-entropy estimation. A comparison of models with different vocabularies may be misleading.
+# - Don't directly compare cross-entropies obtained from a combination with different modes. Depending on how some corner cases are treated, linear interpolation does not distribute full probability mass and thus shows higher (i.e. worse) cross-entropies.
+
+
+from __future__ import division, unicode_literals
+import sys
+import os
+import gzip
+import argparse
+import copy
+import re
+from math import log, exp
+from collections import defaultdict
+from operator import mul
+from tempfile import NamedTemporaryFile
+from subprocess import Popen
+from itertools import izip
+
+try:
+ from lxml import etree as ET
+except:
+ import xml.etree.cElementTree as ET
+
+try:
+ from scipy.optimize.lbfgsb import fmin_l_bfgs_b
+ optimizer = 'l-bfgs'
+except:
+ optimizer = 'hillclimb'
+
+class Moses():
+ """Moses interface for loading/writing models
+ to support other phrase table formats, subclass this and overwrite the relevant functions
+ """
+
+ def __init__(self,models,number_of_features):
+
+ self.number_of_features = number_of_features
+ self.models = models
+
+ #example item (assuming mode=='counts' and one feature): phrase_pairs['the house']['das haus'] = [[[10,100]],['0-0 1-1']]
+ self.phrase_pairs = defaultdict(lambda: defaultdict(lambda: [[[0]*len(self.models) for i in range(self.number_of_features)],[]]))
+ self.phrase_source = defaultdict(lambda: [0]*len(self.models))
+ self.phrase_target = defaultdict(lambda: [0]*len(self.models))
+
+ self.reordering_pairs = defaultdict(lambda: defaultdict(lambda: [[0]*len(self.models) for i in range(self.number_of_features)]))
+
+ self.word_pairs_e2f = defaultdict(lambda: defaultdict(lambda: [0]*len(self.models)))
+ self.word_pairs_f2e = defaultdict(lambda: defaultdict(lambda: [0]*len(self.models)))
+ self.word_source = defaultdict(lambda: [0]*len(self.models))
+ self.word_target = defaultdict(lambda: [0]*len(self.models))
+
+ self.require_alignment = False
+
+
+ def open_table(self,model,table,mode='r'):
+ """define which paths to open for lexical tables and phrase tables.
+ we assume canonical Moses structure, but feel free to overwrite this
+ """
+
+ if table == 'reordering-table':
+ table = 'reordering-table.wbe-msd-bidirectional-fe'
+
+ filename = os.path.join(model,'model',table)
+ fileobj = handle_file(filename,'open',mode)
+ return fileobj
+
+
+ def load_phrase_counts(self,line,priority,i,store='all',filter_by=None,filter_by_src=None,filter_by_target=None,inverted=False):
+ """take single phrase table line and store counts in internal data structure"""
+
+ src = line[0]
+ target = line[1]
+
+ if inverted:
+ src,target = target,src
+
+ target_count = 0
+ src_count = 0
+
+ if (store == 'all' or store == 'pairs') and not (filter_by and not (src in filter_by and target in filter_by[src])):
+
+ if priority < 10 or (src in self.phrase_pairs and target in self.phrase_pairs[src]):
+
+ try:
+ target_count,src_count = map(float,line[-1].split())
+ except:
+ sys.stderr.write(str(line)+'\n')
+ sys.stderr.write('Counts are missing. Maybe you have an older Moses version without counts?\n')
+ return
+
+ scores = line[2].split()
+ pst = float(scores[0])
+ pts = float(scores[2])
+
+ if priority == 2: #MAP
+ self.phrase_pairs[src][target][0][0][i] = pst
+ self.phrase_pairs[src][target][0][1][i] = pts
+ else:
+ self.phrase_pairs[src][target][0][0][i] = pst*target_count
+ self.phrase_pairs[src][target][0][1][i] = pts*src_count
+
+ self.store_info(src,target,line)
+
+ if (store == 'all' or store == 'source') and not (filter_by_src and not filter_by_src.get(src)):
+
+ if not src_count:
+ try:
+ if priority == 2: #MAP
+ src_count = 1
+ else:
+ src_count = float(line[-1].split()[1])
+ except:
+ sys.stderr.write(str(line)+'\n')
+ sys.stderr.write('Counts are missing. Maybe you have an older Moses version without counts?\n')
+ return
+
+ self.phrase_source[src][i] = src_count
+
+ if (store == 'all' or store == 'target') and not (filter_by_target and not filter_by_target.get(target)):
+
+ if not target_count:
+ try:
+ if priority == 2: #MAP
+ target_count = 1
+ else:
+ target_count = float(line[-1].split()[0])
+ except:
+ sys.stderr.write(str(line)+'\n')
+ sys.stderr.write('Counts are missing. Maybe you have an older Moses version without counts?\n')
+ return
+
+ self.phrase_target[target][i] = target_count
+
+
+ def load_phrase_probabilities(self,line,priority,i,store='pairs',filter_by=None,filter_by_src=None,filter_by_target=None,inverted=False):
+ """take single phrase table line and store probablities in internal data structure"""
+
+ src = line[0]
+ target = line[1]
+
+ if inverted:
+ src,target = target,src
+
+ if (store == 'all' or store == 'pairs') and (priority < 10 or (src in self.phrase_pairs and target in self.phrase_pairs[src])) and not (filter_by and not (src in filter_by and target in filter_by[src])):
+
+ self.store_info(src,target,line)
+
+ model_probabilities = map(float,line[2].split()[:-1])
+ phrase_probabilities = self.phrase_pairs[src][target][0]
+
+ for j,p in enumerate(model_probabilities):
+ phrase_probabilities[j][i] = p
+
+ # mark that the src/target phrase has been seen.
+ # needed for re-normalization during linear interpolation
+ if (store == 'all' or store == 'source') and not (filter_by_src and not src in filter_by_src):
+ self.phrase_source[src][i] = 1
+ if (store == 'all' or store == 'target') and not (filter_by_target and not target in filter_by_target):
+ self.phrase_target[target][i] = 1
+
+
+ def load_reordering_probabilities(self,line,priority,i,store='pairs'):
+ """take single reordering table line and store probablities in internal data structure"""
+
+ src = line[0]
+ target = line[1]
+
+ model_probabilities = map(float,line[2].split())
+ reordering_probabilities = self.reordering_pairs[src][target]
+
+ for j,p in enumerate(model_probabilities):
+ reordering_probabilities[j][i] = p
+
+
+
+ def traverse_incrementally(self,table,models,load_lines,store_flag,inverted=False):
+ """hack-ish way to find common phrase pairs in multiple models in one traversal without storing it all in memory
+ relies on alphabetical sorting of phrase table.
+ """
+
+ increment = -1
+ stack = ['']*len(self.models)
+
+ while increment:
+
+ self.phrase_pairs = defaultdict(lambda: defaultdict(lambda: [[[0]*len(self.models) for i in range(self.number_of_features)],[]]))
+ self.reordering_pairs = defaultdict(lambda: defaultdict(lambda: [[0]*len(self.models) for i in range(self.number_of_features)]))
+ self.phrase_source = defaultdict(lambda: [0]*len(self.models))
+ self.phrase_target = defaultdict(lambda: [0]*len(self.models))
+
+ for model,priority,i in models:
+
+ if stack[i]:
+ if increment != stack[i][0]:
+ continue
+ else:
+ load_lines(stack[i],priority,i,store=store_flag,inverted=inverted)
+ stack[i] = ''
+
+ for line in model:
+
+ line = line.rstrip().split(b' ||| ')
+
+ if increment != line[0]:
+ stack[i] = line
+ break
+
+ load_lines(line,priority,i,store=store_flag,inverted=inverted)
+
+ yield 1
+
+ #calculate which source phrase to process next
+ lines = [line[0] + b' |' for line in stack if line]
+ if lines:
+ increment = min(lines)[:-2]
+ else:
+ increment = None
+
+
+ def load_word_probabilities(self,line,side,i,priority,e2f_filter=None,f2e_filter=None):
+ """process single line of lexical table"""
+
+ a, b, prob = line.split(b' ')
+
+ if side == 'e2f' and not e2f_filter or a in e2f_filter and b in e2f_filter[a]:
+
+ self.word_pairs_e2f[a][b][i] = float(prob)
+
+ elif side == 'f2e' and not f2e_filter or a in f2e_filter and b in f2e_filter[a]:
+
+ self.word_pairs_f2e[a][b][i] = float(prob)
+
+
+ def load_word_counts(self,line,side,i,priority,e2f_filter=None,f2e_filter=None):
+ """process single line of lexical table"""
+
+ a, b, ab_count, b_count = line.split(b' ')
+
+ if side == 'e2f':
+
+ if priority == 2: #MAP
+ if not e2f_filter or a in e2f_filter:
+ if not e2f_filter or b in e2f_filter[a]:
+ self.word_pairs_e2f[a][b][i] = float(ab_count)/float(b_count)
+ self.word_target[b][i] = 1
+ else:
+ if not e2f_filter or a in e2f_filter:
+ if not e2f_filter or b in e2f_filter[a]:
+ self.word_pairs_e2f[a][b][i] = float(ab_count)
+ self.word_target[b][i] = float(b_count)
+
+ elif side == 'f2e':
+
+ if priority == 2: #MAP
+ if not f2e_filter or a in f2e_filter and b in f2e_filter[a]:
+ if not f2e_filter or b in f2e_filter[a]:
+ self.word_pairs_f2e[a][b][i] = float(ab_count)/float(b_count)
+ self.word_source[b][i] = 1
+ else:
+ if not f2e_filter or a in f2e_filter and b in f2e_filter[a]:
+ if not f2e_filter or b in f2e_filter[a]:
+ self.word_pairs_f2e[a][b][i] = float(ab_count)
+ self.word_source[b][i] = float(b_count)
+
+
+ def load_lexical_tables(self,models,mode,e2f_filter=None,f2e_filter=None):
+ """open and load lexical tables into data structure"""
+
+ if mode == 'counts':
+ files = ['lex.counts.e2f','lex.counts.f2e']
+ load_lines = self.load_word_counts
+
+ else:
+ files = ['lex.e2f','lex.f2e']
+ load_lines = self.load_word_probabilities
+
+ j = 0
+
+ for f in files:
+ models_prioritized = [(self.open_table(model,f),priority,i) for (model,priority,i) in priority_sort_models(models)]
+
+ for model,priority,i in models_prioritized:
+ for line in model:
+ if not j % 100000:
+ sys.stderr.write('.')
+ j += 1
+ load_lines(line,f[-3:],i,priority,e2f_filter=e2f_filter,f2e_filter=f2e_filter)
+
+
+ def store_info(self,src,target,line):
+ """store alignment info and comment section for re-use in output"""
+
+ if len(line) == 5:
+ self.phrase_pairs[src][target][1] = line[3:5]
+
+ # assuming that alignment is empty
+ elif len(line) == 4:
+ if self.require_alignment:
+ sys.stderr.write('Error: unexpected phrase table format. Your current configuration requires alignment information. Make sure you trained your model with -phrase-word-alignment\n')
+ exit()
+
+ self.phrase_pairs[src][target][1] = ['',line[3].lstrip(b'| ')]
+
+ else:
+ sys.stderr.write('Error: unexpected phrase table format. Are you using a very old/new version of Moses with different formatting?\n')
+ exit()
+
+
+ def get_word_alignments(self,src,target,cache=False,mycache={}):
+ """from the Moses phrase table alignment info in the form "0-0 1-0",
+ get the aligned word pairs / NULL alignments
+ """
+
+ if cache:
+ if (src,target) in mycache:
+ return mycache[(src,target)]
+
+ try:
+ alignment = self.phrase_pairs[src][target][1][0]
+ except:
+ return None,None
+
+ src_list = src.split(b' ')
+ target_list = target.split(b' ')
+
+ textual_e2f = [[s,[]] for s in src_list]
+ textual_f2e = [[t,[]] for t in target_list]
+
+ for pair in alignment.split(b' '):
+ s,t = pair.split('-')
+ s,t = int(s),int(t)
+
+ textual_e2f[s][1].append(target_list[t])
+ textual_f2e[t][1].append(src_list[s])
+
+ for s,t in textual_e2f:
+ if not t:
+ t.append('NULL')
+
+ for s,t in textual_f2e:
+ if not t:
+ t.append('NULL')
+
+ #tupelize so we can use the value as dictionary keys
+ for i in range(len(textual_e2f)):
+ textual_e2f[i][1] = tuple(textual_e2f[i][1])
+
+ for i in range(len(textual_f2e)):
+ textual_f2e[i][1] = tuple(textual_f2e[i][1])
+
+ if cache:
+ mycache[(src,target)] = textual_e2f,textual_f2e
+
+ return textual_e2f,textual_f2e
+
+
+ def write_phrase_table(self,src,target,weights,features,mode,flags):
+ """convert data to string in Moses phrase table format"""
+
+ # if one feature value is 0 (either because of loglinear interpolation or rounding to 0), don't write it to phrasetable
+ # (phrase pair will end up with probability zero in log-linear model anyway)
+ if 0 in features:
+ return ''
+
+ # information specific to Moses model: alignment info and comment section with target and source counts
+ alignment,comments = self.phrase_pairs[src][target][1]
+ if mode == 'counts':
+ srccount = dot_product(self.phrase_source[src],weights[2])
+ targetcount = dot_product(self.phrase_target[target],weights[0])
+ comments = b"%s %s" %(targetcount,srccount)
+
+ features = b' '.join([b'%.6g' %(f) for f in features])
+
+ if flags['add_origin_features']:
+ origin_features = map(lambda x: 2.718**bool(x),self.phrase_pairs[src][target][0][0]) # 1 if phrase pair doesn't occur in model, 2.718 if it does
+ origin_features = b' '.join([b'%.4f' %(f) for f in origin_features]) + ' '
+ else:
+ origin_features = b''
+
+ line = b"%s ||| %s ||| %s 2.718 %s||| %s ||| %s\n" %(src,target,features,origin_features,alignment,comments)
+ return line
+
+
+
+ def write_lexical_file(self,direction, path, weights,mode):
+
+ if mode == 'counts':
+ bridge = '.counts'
+ else:
+ bridge = ''
+
+ fobj = handle_file("{0}{1}.{2}".format(path,bridge,direction),'open',mode='w')
+ sys.stderr.write('Writing {0}{1}.{2}\n'.format(path,bridge,direction))
+
+ if direction == 'e2f':
+ word_pairs = self.word_pairs_e2f
+ marginal = self.word_target
+
+ elif direction == 'f2e':
+ word_pairs = self.word_pairs_f2e
+ marginal = self.word_source
+
+ for x in sorted(word_pairs):
+ for y in sorted(word_pairs[x]):
+ xy = dot_product(word_pairs[x][y],weights)
+ fobj.write(b"%s %s %s" %(x,y,xy))
+
+ if mode == 'counts':
+ fobj.write(b" %s\n" %(dot_product(marginal[y],weights)))
+ else:
+ fobj.write(b'\n')
+
+ handle_file("{0}{1}.{2}".format(path,bridge,direction),'close',fobj,mode='w')
+
+
+
+ def write_reordering_table(self,src,target,features):
+ """convert data to string in Moses reordering table format"""
+
+ # if one feature value is 0 (either because of loglinear interpolation or rounding to 0), don't write it to reordering table
+ # (phrase pair will end up with probability zero in log-linear model anyway)
+ if 0 in features:
+ return ''
+
+ features = b' '.join([b'%6g' %(f) for f in features])
+
+ line = b"%s ||| %s ||| %s\n" %(src,target,features)
+ return line
+
+
+ def create_inverse(self,fobj):
+ """swap source and target phrase in the phrase table, and then sort (by target phrase)"""
+
+ inverse = NamedTemporaryFile(prefix='inv_unsorted',delete=False)
+ swap = re.compile(b'(.+?) \|\|\| (.+?) \|\|\|')
+
+ # just swap source and target phrase, and leave order of scores etc. intact.
+ # For better compatibility with existing codebase, we swap the order of the phrases back for processing
+ for line in fobj:
+ inverse.write(swap.sub(b'\\2 ||| \\1 |||',line,1))
+ inverse.close()
+
+ inverse_sorted = sort_file(inverse.name)
+ os.remove(inverse.name)
+
+ return inverse_sorted
+
+
+ def merge(self,pt_normal, pt_inverse, pt_out, mode='interpolate'):
+ """merge two phrasetables (the latter having been inverted to calculate p(s|t) and lex(s|t) in sorted order)"""
+
+ for line,line2 in izip(pt_normal,pt_inverse):
+
+ line = line.split(b' ||| ')
+ line2 = line2.split(b' ||| ')
+
+ #scores
+ scores1 = line[2].split()
+ scores2 = line2[2].split()
+ line[2] = b' '.join(scores2[:2]+scores1[2:])
+
+ # marginal counts
+ if mode == 'counts':
+ src_count = line[-1].split()[1]
+ target_count = line2[-1].split()[0]
+ line[-1] = b' '.join([target_count,src_count]) + b'\n'
+
+ pt_out.write(b' ||| '.join(line))
+
+ pt_normal.close()
+ pt_inverse.close()
+ pt_out.close()
+
+
+
+class TigerXML():
+ """interface to load reference word alignments from TigerXML corpus.
+ Tested on SMULTRON (http://kitt.cl.uzh.ch/kitt/smultron/)
+ """
+
+ def __init__(self,alignment_xml):
+ """only argument is TigerXML file
+ """
+
+ self.treebanks = self._open_treebanks(alignment_xml)
+ self.word_pairs = defaultdict(lambda: defaultdict(int))
+ self.word_source = defaultdict(int)
+ self.word_target = defaultdict(int)
+
+
+ def load_word_pairs(self,src,target):
+ """load word pairs. src and target are the itentifiers of the source and target language in the XML"""
+
+ if not src or not target:
+ sys.stderr.write('Error: Source and/or target language not specified. Required for TigerXML extraction.\n')
+ exit()
+
+ alignments = self._get_aligned_ids(src,target)
+ self._textualize_alignments(src,target,alignments)
+
+
+ def _open_treebanks(self,alignment_xml):
+ """Parallel XML format references monolingual files. Open all."""
+
+ alignment_path = os.path.dirname(alignment_xml)
+ align_xml = ET.parse(alignment_xml)
+
+ treebanks = {}
+ treebanks['aligned'] = align_xml
+
+ for treebank in align_xml.findall('//treebank'):
+ treebank_id = treebank.get('id')
+ filename = treebank.get('filename')
+
+ if not os.path.isabs(filename):
+ filename = os.path.join(alignment_path,filename)
+
+ treebanks[treebank_id] = ET.parse(filename)
+
+ return treebanks
+
+
+ def _get_aligned_ids(self,src,target):
+ """first step: find which nodes are aligned."""
+
+
+ alignments = []
+ ids = defaultdict(dict)
+
+ for alignment in self.treebanks['aligned'].findall('//align'):
+
+ newpair = {}
+
+ if len(alignment) != 2:
+ sys.stderr.write('Error: alignment with ' + str(len(alignment)) + ' children. Expected 2. Skipping.\n')
+ continue
+
+ for node in alignment:
+ lang = node.get('treebank_id')
+ node_id = node.get('node_id')
+ newpair[lang] = node_id
+
+ if not (src in newpair and target in newpair):
+ sys.stderr.write('Error: source and target languages don\'t match. Skipping.\n')
+ continue
+
+ # every token may only appear in one alignment pair;
+ # if it occurs in multiple, we interpret them as one 1-to-many or many-to-1 alignment
+ if newpair[src] in ids[src]:
+ idx = ids[src][newpair[src]]
+ alignments[idx][1].append(newpair[target])
+
+ elif newpair[target] in ids[target]:
+ idx = ids[target][newpair[target]]
+ alignments[idx][0].append(newpair[src])
+
+ else:
+ idx = len(alignments)
+ alignments.append(([newpair[src]],[newpair[target]]))
+ ids[src][newpair[src]] = idx
+ ids[target][newpair[target]] = idx
+
+ alignments = self._discard_discontinuous(alignments)
+
+ return alignments
+
+
+ def _discard_discontinuous(self,alignments):
+ """discard discontinuous word sequences (which we can't use for phrase-based SMT systems)
+ and make sure that sequence is in correct order.
+ """
+
+ new_alignments = []
+
+ for alignment in alignments:
+ new_pair = []
+
+ for sequence in alignment:
+
+ sequence_split = [t_id.split('_') for t_id in sequence]
+
+ #check if all words come from the same sentence
+ sentences = [item[0] for item in sequence_split]
+ if not len(set(sentences)) == 1:
+ #sys.stderr.write('Warning. Word sequence crossing sentence boundary. Discarding.\n')
+ #sys.stderr.write(str(sequence)+'\n')
+ continue
+
+
+ #sort words and check for discontinuities.
+ try:
+ tokens = sorted([int(item[1]) for item in sequence_split])
+ except ValueError:
+ #sys.stderr.write('Warning. Not valid word IDs. Discarding.\n')
+ #sys.stderr.write(str(sequence)+'\n')
+ continue
+
+ if not tokens[-1]-tokens[0] == len(tokens)-1:
+ #sys.stderr.write('Warning. Discontinuous word sequence(?). Discarding.\n')
+ #sys.stderr.write(str(sequence)+'\n')
+ continue
+
+ out_sequence = [sentences[0]+'_'+str(token) for token in tokens]
+ new_pair.append(out_sequence)
+
+ if len(new_pair) == 2:
+ new_alignments.append(new_pair)
+
+ return new_alignments
+
+
+ def _textualize_alignments(self,src,target,alignments):
+ """Knowing which nodes are aligned, get actual words that are aligned."""
+
+ words = defaultdict(dict)
+
+ for text in [text for text in self.treebanks if not text == 'aligned']:
+
+ #TODO: Make lowercasing optional
+ for terminal in self.treebanks[text].findall('//t'):
+ words[text][terminal.get('id')] = terminal.get('word').lower()
+
+
+ for (src_ids, target_ids) in alignments:
+
+ try:
+ src_text = ' '.join((words[src][src_id] for src_id in src_ids))
+ except KeyError:
+ #sys.stderr.write('Warning. ID not found: '+ str(src_ids) +'\n')
+ continue
+
+ try:
+ target_text = ' '.join((words[target][target_id] for target_id in target_ids))
+ except KeyError:
+ #sys.stderr.write('Warning. ID not found: '+ str(target_ids) +'\n')
+ continue
+
+ self.word_pairs[src_text][target_text] += 1
+ self.word_source[src_text] += 1
+ self.word_target[target_text] += 1
+
+
+
+class Moses_Alignment():
+ """interface to load reference phrase alignment from corpus aligend with Giza++
+ and with extraction heuristics as applied by the Moses toolkit.
+
+ """
+
+ def __init__(self,alignment_file):
+
+ self.alignment_file = alignment_file
+ self.word_pairs = defaultdict(lambda: defaultdict(int))
+ self.word_source = defaultdict(int)
+ self.word_target = defaultdict(int)
+
+
+ def load_word_pairs(self,src_lang,target_lang):
+ """main function. overwrite this to import data in different format."""
+
+ fileobj = handle_file(self.alignment_file,'open','r')
+
+ for line in fileobj:
+
+ line = line.split(b' ||| ')
+
+ src = line[0]
+ target = line[1]
+
+ self.word_pairs[src][target] += 1
+ self.word_source[src] += 1
+ self.word_target[target] += 1
+
+
+def dot_product(a,b):
+ """calculate dot product from two lists"""
+
+ # optimized for PyPy (much faster than enumerate/map)
+ s = 0
+ i = 0
+ for x in a:
+ s += x * b[i]
+ i += 1
+
+ return s
+
+
+def priority_sort_models(models):
+ """primary models should have priority before supplementary models.
+ zipped with index to know which weight model belongs to
+ """
+
+ return [(model,priority,i) for (i,(model,priority)) in sorted(zip(range(len(models)),models),key=lambda x: x[1][1])]
+
+
+def cross_entropy(model_interface,reference_interface,weights,score,mode,flags):
+ """calculate cross entropy given all necessary information.
+ don't call this directly, but use one of the Combine_TMs methods.
+ """
+
+ weights = normalize_weights(weights,mode)
+
+ if 'compare_cross-entropies' in flags and flags['compare_cross-entropies']:
+ num_results = len(model_interface.models)
+ else:
+ num_results = 1
+
+ cross_entropy_ts = [0]*num_results
+ cross_entropy_st = [0]*num_results
+ cross_entropy_lex_ts = [0]*num_results
+ cross_entropy_lex_st = [0]*num_results
+ oov = [0]*num_results
+ oov2 = 0
+ other_translations = [0]*num_results
+ ignored = [0]*num_results
+ n = [0]*num_results
+ total_pairs = 0
+
+ for src in reference_interface.word_pairs:
+ for target in reference_interface.word_pairs[src]:
+
+ c = reference_interface.word_pairs[src][target]
+
+ for i in range(num_results):
+ if src in model_interface.phrase_pairs and target in model_interface.phrase_pairs[src]:
+
+ if ('compare_cross-entropies' in flags and flags['compare_cross-entropies']) or ('intersected_cross-entropies' in flags and flags['intersected_cross-entropies']):
+
+ if 0 in model_interface.phrase_pairs[src][target][0][0]: #only use intersection of models for comparability
+
+ # update unknown words statistics
+ if model_interface.phrase_pairs[src][target][0][0][i]:
+ ignored[i] += c
+ elif src in model_interface.phrase_source and model_interface.phrase_source[src][i]:
+ other_translations[i] += c
+ else:
+ oov[i] += c
+
+ continue
+
+ if ('compare_cross-entropies' in flags and flags['compare_cross-entropies']):
+ tmp_weights = [[0]*i+[1]+[0]*(num_results-i-1)]*model_interface.number_of_features
+ elif ('intersected_cross-entropies' in flags and flags['intersected_cross-entropies']):
+ tmp_weights = weights
+
+ features = score(tmp_weights,src,target,model_interface,flags)
+
+ else:
+ features = score(weights,src,target,model_interface,flags)
+
+ #if weight is so low that feature gets probability zero
+ if 0 in features:
+ #sys.stderr.write('Warning: 0 probability in model {0}: source phrase: {1!r}; target phrase: {2!r}\n'.format(i,src,target))
+ #sys.stderr.write('Possible reasons: 0 probability in phrase table; very low (or 0) weight; recompute lexweight and different alignments\n')
+ #sys.stderr.write('Phrase pair is ignored for cross_entropy calculation\n\n')
+ continue
+
+ n[i] += c
+ cross_entropy_ts[i] += -log(features[2],2)*c
+ cross_entropy_st[i] += -log(features[0],2)*c
+
+ cross_entropy_lex_ts[i] += -log(features[3],2)*c
+ cross_entropy_lex_st[i] += -log(features[1],2)*c
+
+ elif src in model_interface.phrase_source and not ('compare_cross-entropies' in flags and flags['compare_cross-entropies']):
+ other_translations[i] += c
+
+ else:
+ oov2 += c
+
+ total_pairs += c
+
+
+ oov2 = int(oov2/num_results)
+
+ for i in range(num_results):
+ try:
+ cross_entropy_ts[i] /= n[i]
+ cross_entropy_st[i] /= n[i]
+ cross_entropy_lex_ts[i] /= n[i]
+ cross_entropy_lex_st[i] /= n[i]
+ except ZeroDivisionError:
+ sys.stderr.write('Warning: no matching phrase pairs between reference set and model\n')
+ cross_entropy_ts[i] = 0
+ cross_entropy_st[i] = 0
+ cross_entropy_lex_ts[i] = 0
+ cross_entropy_lex_st[i] = 0
+
+
+ if 'compare_cross-entropies' in flags and flags['compare_cross-entropies']:
+ return [(cross_entropy_st[i],cross_entropy_lex_st[i],cross_entropy_ts[i],cross_entropy_lex_ts[i],other_translations[i],oov[i],ignored[i],n[i],total_pairs) for i in range(num_results)], (n[0],total_pairs,oov2)
+ else:
+ return cross_entropy_st[0],cross_entropy_lex_st[0],cross_entropy_ts[0],cross_entropy_lex_ts[0],other_translations[0],oov2,total_pairs
+
+
+def cross_entropy_light(model_interface,reference_interface,weights,score,mode,flags,cache):
+ """calculate cross entropy given all necessary information.
+ don't call this directly, but use one of the Combine_TMs methods.
+ Same as cross_entropy, but optimized for speed: it doesn't generate all of the statistics,
+ doesn't normalize, and uses caching.
+ """
+ weights = normalize_weights(weights,mode)
+ cross_entropy_ts = 0
+ cross_entropy_st = 0
+ cross_entropy_lex_ts = 0
+ cross_entropy_lex_st = 0
+
+ for (src,target,c) in cache:
+ features = score(weights,src,target,model_interface,flags,cache=True)
+
+ cross_entropy_ts -= log(features[2],2)*c
+ cross_entropy_st -= log(features[0],2)*c
+
+ cross_entropy_lex_ts -= log(features[3],2)*c
+ cross_entropy_lex_st -= log(features[1],2)*c
+
+ return cross_entropy_st,cross_entropy_lex_st,cross_entropy_ts,cross_entropy_lex_ts
+
+
+def _get_reference_cache(reference_interface,model_interface):
+ """creates a data structure that allows for a quick access
+ to all relevant reference set phrase/word pairs and their frequencies.
+ """
+ cache = []
+ n = 0
+
+ for src in reference_interface.word_pairs:
+ for target in reference_interface.word_pairs[src]:
+ if src in model_interface.phrase_pairs and target in model_interface.phrase_pairs[src]:
+ c = reference_interface.word_pairs[src][target]
+ cache.append((src,target,c))
+ n += c
+
+ return cache,n
+
+
+def _get_lexical_filter(reference_interface,model_interface):
+ """returns dictionaries that store the words and word pairs needed
+ for perplexity optimization. We can use these dicts to load fewer data into memory for optimization."""
+
+ e2f_filter = defaultdict(set)
+ f2e_filter = defaultdict(set)
+
+ for src in reference_interface.word_pairs:
+ for target in reference_interface.word_pairs[src]:
+ if src in model_interface.phrase_pairs and target in model_interface.phrase_pairs[src]:
+ e2f_alignment,f2e_alignment = model_interface.get_word_alignments(src,target)
+
+ for s,t_list in e2f_alignment:
+ for t in t_list:
+ e2f_filter[s].add(t)
+
+ for t,s_list in f2e_alignment:
+ for s in s_list:
+ f2e_filter[t].add(s)
+
+ return e2f_filter,f2e_filter
+
+
+def _hillclimb_move(weights,stepsize,mode):
+ """Move function for hillclimb algorithm. Updates each weight by stepsize."""
+
+ for i,w in enumerate(weights):
+ yield normalize_weights(weights[:i]+[w+stepsize]+weights[i+1:],mode)
+
+ for i,w in enumerate(weights):
+ new = w-stepsize
+ if new >= 1e-10:
+ yield normalize_weights(weights[:i]+[new]+weights[i+1:],mode)
+
+def _hillclimb(scores,best_weights,objective,model_interface,reference_interface,score_function,mode,flags,precision,cache,n):
+ """first (deprecated) implementation of iterative weight optimization."""
+
+ best = objective(best_weights)
+
+ i = 0 #counts number of iterations with same stepsize: if greater than 10, it is doubled
+ stepsize = 512 # initial stepsize
+ move = 1 #whether we found a better set of weights in the current iteration. if not, it is halfed
+ sys.stderr.write('Hillclimb: step size: ' + str(stepsize))
+ while stepsize > 0.0078:
+
+ if not move:
+ stepsize /= 2
+ sys.stderr.write(' ' + str(stepsize))
+ i = 0
+ move = 1
+ continue
+
+ move = 0
+
+ for w in _hillclimb_move(list(best_weights),stepsize,mode):
+ weights_tuple = tuple(w)
+
+ if weights_tuple in scores:
+ continue
+
+ scores[weights_tuple] = cross_entropy_light(model_interface,reference_interface,[w for m in range(model_interface.number_of_features)],score_function,mode,flags,cache)
+
+ if objective(weights_tuple)+precision < best:
+ best = objective(weights_tuple)
+ best_weights = weights_tuple
+ move = 1
+
+ if i and not i % 10:
+ sys.stderr.write('\nIteration '+ str(i) + ' with stepsize ' + str(stepsize) + '. current cross-entropy: ' + str(best) + '- weights: ' + str(best_weights) + ' ')
+ stepsize *= 2
+ sys.stderr.write('\nIncreasing stepsize: '+ str(stepsize))
+ i = 0
+
+ i += 1
+
+ return best_weights
+
+
+def optimize_cross_entropy_hillclimb(model_interface,reference_interface,initial_weights,score_function,mode,flags,precision=0.000001):
+ """find weights that minimize cross-entropy on a tuning set
+ deprecated (default is now L-BFGS (optimize_cross_entropy)), but left in for people without SciPy
+ """
+
+ scores = {}
+
+ best_weights = tuple(initial_weights[0])
+
+ cache,n = _get_reference_cache(reference_interface,model_interface)
+
+ # each objective is a triple: which score to minimize from cross_entropy(), which weights to update accordingly, and a comment that is printed
+ objectives = [(lambda x: scores[x][0]/n,[0],'minimize cross-entropy for p(s|t)'), #optimize cross-entropy for p(s|t)
+ (lambda x: scores[x][1]/n,[1],'minimize cross-entropy for lex(s|t)'),
+ (lambda x: scores[x][2]/n,[2],'minimize cross-entropy for p(t|s)'), #optimize for p(t|s)
+ (lambda x: scores[x][3]/n,[3],'minimize cross-entropy for lex(t|s)')]
+
+
+ scores[best_weights] = cross_entropy_light(model_interface,reference_interface,initial_weights,score_function,mode,flags,cache)
+ final_weights = initial_weights[:]
+ final_cross_entropy = [0]*model_interface.number_of_features
+
+ for objective, features, comment in objectives:
+ best_weights = min(scores,key=objective)
+ sys.stderr.write('Optimizing objective "' + comment +'"\n')
+ best_weights = _hillclimb(scores,best_weights,objective,model_interface,reference_interface,score_function,mode,flags,precision,cache,n)
+
+ sys.stderr.write('\nCross-entropy:' + str(objective(best_weights)) + '- weights: ' + str(best_weights)+'\n\n')
+
+ for j in features:
+ final_weights[j] = list(best_weights)
+ final_cross_entropy[j] = objective(best_weights)
+
+ return final_weights,final_cross_entropy
+
+
+def optimize_cross_entropy(model_interface,reference_interface,initial_weights,score_function,mode,flags):
+ """find weights that minimize cross-entropy on a tuning set
+ Uses L-BFGS optimization and requires SciPy
+ """
+
+ if not optimizer == 'l-bfgs':
+ sys.stderr.write('SciPy is not installed. Falling back to naive hillclimb optimization (instead of L-BFGS)\n')
+ return optimize_cross_entropy_hillclimb(model_interface,reference_interface,initial_weights,score_function,mode,flags)
+
+ cache,n = _get_reference_cache(reference_interface,model_interface)
+
+ # each objective is a triple: which score to minimize from cross_entropy(), which weights to update accordingly, and a comment that is printed
+ objectives = [(lambda w: cross_entropy_light(model_interface,reference_interface,[[1]+list(w) for m in range(model_interface.number_of_features)],score_function,mode,flags,cache)[0],[0],'minimize cross-entropy for p(s|t)'), #optimize cross-entropy for p(s|t)
+ (lambda w: cross_entropy_light(model_interface,reference_interface,[[1]+list(w) for m in range(model_interface.number_of_features)],score_function,mode,flags,cache)[1],[1],'minimize cross-entropy for lex(s|t)'),
+ (lambda w: cross_entropy_light(model_interface,reference_interface,[[1]+list(w) for m in range(model_interface.number_of_features)],score_function,mode,flags,cache)[2],[2],'minimize cross-entropy for p(t|s)'), #optimize for p(t|s)
+ (lambda w: cross_entropy_light(model_interface,reference_interface,[[1]+list(w) for m in range(model_interface.number_of_features)],score_function,mode,flags,cache)[3],[3],'minimize cross-entropy for lex(t|s)')]
+
+ final_weights = initial_weights[:]
+ final_cross_entropy = [0]*model_interface.number_of_features
+
+ for i,(objective, features, comment) in enumerate(objectives):
+ sys.stderr.write('Optimizing objective "' + comment +'"\n')
+ initial_values = [1]*(len(model_interface.models)-1) # we leave value of first model at 1 and optimize all others (normalized of course)
+ best_weights, best_point, data = fmin_l_bfgs_b(objective,initial_values,approx_grad=True,bounds=[(0.000000001,None)]*len(initial_values))
+ best_weights = normalize_weights([1]+list(best_weights),mode)
+ sys.stderr.write('Cross-entropy after L-BFGS optimization: ' + str(best_point/n) + ' - weights: ' + str(best_weights)+'\n')
+
+ for j in features:
+ final_weights[j] = list(best_weights)
+ final_cross_entropy[j] = best_point/n
+
+ return final_weights,final_cross_entropy
+
+
+def redistribute_probability_mass(weights,src,target,interface,flags,mode='interpolate'):
+ """the conditional probability p(x|y) is undefined for cases where p(y) = 0
+ this function redistributes the probability mass to only consider models for which p(y) > 0
+ """
+
+ new_weights = weights[:]
+
+ if flags['normalize_s_given_t'] == 's':
+
+ # set weight to 0 for all models where target phrase is unseen (p(s|t)
+ new_weights[0] = map(mul,interface.phrase_source[src],weights[0])
+ if flags['normalize-lexical_weights']:
+ new_weights[1] = map(mul,interface.phrase_source[src],weights[1])
+
+ elif flags['normalize_s_given_t'] == 't':
+
+ # set weight to 0 for all models where target phrase is unseen (p(s|t)
+ new_weights[0] = map(mul,interface.phrase_target[target],weights[0])
+ if flags['normalize-lexical_weights']:
+ new_weights[1] = map(mul,interface.phrase_target[target],weights[1])
+
+ # set weight to 0 for all models where source phrase is unseen (p(t|s)
+ new_weights[2] = map(mul,interface.phrase_source[src],weights[2])
+ if flags['normalize-lexical_weights']:
+ new_weights[3] = map(mul,interface.phrase_source[src],weights[3])
+
+
+ return normalize_weights(new_weights,mode)
+
+
+def score_interpolate(weights,src,target,interface,flags,cache=False):
+ """linear interpolation of probabilites (and other feature values)
+ if normalized is True, the probability mass for p(x|y) is redistributed to models with p(y) > 0
+ """
+
+ model_values = interface.phrase_pairs[src][target][0]
+
+ scores = [0]*len(model_values)
+
+ if 'normalized' in flags and flags['normalized']:
+ normalized_weights = redistribute_probability_mass(weights,src,target,interface,flags)
+ else:
+ normalized_weights = weights
+
+ if 'recompute_lexweights' in flags and flags['recompute_lexweights']:
+ e2f_alignment,f2e_alignment = interface.get_word_alignments(src,target,cache=cache)
+
+ if not e2f_alignment or not f2e_alignment:
+ sys.stderr.write('Error: no word alignments found, but necessary for lexical weight computation.\n')
+ lst = 0
+ lts = 0
+
+ else:
+ scores[1] = compute_lexicalweight(normalized_weights[1],e2f_alignment,interface.word_pairs_e2f,None,mode='interpolate')
+ scores[3] = compute_lexicalweight(normalized_weights[3],f2e_alignment,interface.word_pairs_f2e,None,mode='interpolate')
+
+
+ for idx,prob in enumerate(model_values):
+ if not ('recompute_lexweights' in flags and flags['recompute_lexweights'] and (idx == 1 or idx == 3)):
+ scores[idx] = dot_product(prob,normalized_weights[idx])
+
+ return scores
+
+
+def score_loglinear(weights,src,target,interface,flags,cache=False):
+ """loglinear interpolation of probabilites
+ warning: if phrase pair does not occur in all models, resulting probability is 0
+ this is usually not what you want - loglinear scoring is only included for completeness' sake
+ """
+
+ scores = []
+ model_values = interface.phrase_pairs[src][target][0]
+
+ for idx,prob in enumerate(model_values):
+ try:
+ scores.append(exp(dot_product(map(log,prob),weights[idx])))
+ except ValueError:
+ scores.append(0)
+
+ return scores
+
+
+def score_counts(weights,src,target,interface,flags,cache=False):
+ """count-based re-estimation of probabilites and lexical weights
+ each count is multiplied by its weight; trivial case is weight 1 for each model, which corresponds to a concatentation
+ """
+
+ try:
+ joined_count = dot_product(interface.phrase_pairs[src][target][0][0],weights[0])
+ target_count = dot_product(interface.phrase_target[target],weights[0])
+ pst = joined_count / target_count
+ except ZeroDivisionError:
+ pst = 0
+
+ try:
+ joined_count = dot_product(interface.phrase_pairs[src][target][0][1],weights[2])
+ source_count = dot_product(interface.phrase_source[src],weights[2])
+ pts = joined_count / source_count
+ except ZeroDivisionError:
+ pts = 0
+
+ e2f_alignment,f2e_alignment = interface.get_word_alignments(src,target,cache=cache)
+
+ if not e2f_alignment or not f2e_alignment:
+ sys.stderr.write('Error: no word alignments found, but necessary for lexical weight computation.\n')
+ lst = 0
+ lts = 0
+
+ else:
+ lst = compute_lexicalweight(weights[1],e2f_alignment,interface.word_pairs_e2f,interface.word_target,mode='counts',cache=cache)
+ lts = compute_lexicalweight(weights[3],f2e_alignment,interface.word_pairs_f2e,interface.word_source,mode='counts',cache=cache)
+
+ return [pst,lst,pts,lts]
+
+
+def score_interpolate_reordering(weights,src,target,interface):
+ """linear interpolation of reordering model probabilities
+ also normalizes model so that
+ """
+
+ model_values = interface.reordering_pairs[src][target]
+
+ scores = [0]*len(model_values)
+
+ for idx,prob in enumerate(model_values):
+ scores[idx] = dot_product(prob,weights[idx])
+
+ #normalizes first half and last half probabilities (so that each half sums to one).
+ #only makes sense for bidirectional configuration in Moses. Remove/change this if you want a different (or no) normalization
+ scores = normalize_weights(scores[:int(interface.number_of_features/2)],'interpolate') + normalize_weights(scores[int(interface.number_of_features/2):],'interpolate')
+
+ return scores
+
+
+def compute_lexicalweight(weights,alignment,word_pairs,marginal,mode='counts',cache=False,mycache=[0,defaultdict(dict)]):
+ """compute the lexical weights as implemented in Moses toolkit"""
+
+ lex = 1
+
+ # new weights: empty cache
+ if cache and mycache[0] != weights:
+ mycache[0] = weights
+ mycache[1] = defaultdict(dict)
+
+ for x,translations in alignment:
+
+ if cache and translations in mycache[1][x]:
+ lex_step = mycache[1][x][translations]
+
+ else:
+ lex_step = 0
+ for y in translations:
+
+ if mode == 'counts':
+ lex_step += dot_product(word_pairs[x][y],weights) / dot_product(marginal[y],weights)
+ elif mode == 'interpolate':
+ lex_step += dot_product(word_pairs[x][y],weights)
+
+ lex_step /= len(translations)
+
+ if cache:
+ mycache[1][x][translations] = lex_step
+
+ lex *= lex_step
+
+ return lex
+
+
+def normalize_weights(weights,mode):
+ """make sure that probability mass in linear interpolation is 1
+ for weighted counts, weight of first model is set to 1
+ """
+
+ if mode == 'interpolate' or mode == 'loglinear':
+
+ if type(weights[0]) == list:
+
+ new_weights = []
+
+ for weight_list in weights:
+ total = sum(weight_list)
+
+ try:
+ weight_list = [weight/total for weight in weight_list]
+ except ZeroDivisionError:
+ sys.stderr.write('Error: Zero division in weight normalization. Are some of your weights zero? This might lead to undefined behaviour if a phrase pair is only seen in model with weight 0\n')
+
+ new_weights.append(weight_list)
+
+ else:
+ total = sum(weights)
+
+ try:
+ new_weights = [weight/total for weight in weights]
+ except ZeroDivisionError:
+ sys.stderr.write('Error: Zero division in weight normalization. Are some of your weights zero? This might lead to undefined behaviour if a phrase pair is only seen in model with weight 0\n')
+
+ elif mode == 'counts':
+
+ if type(weights[0]) == list:
+
+ new_weights = []
+
+ for weight_list in weights:
+ ratio = 1/weight_list[0]
+ new_weights.append([weight * ratio for weight in weight_list])
+
+ else:
+ ratio = 1/weights[0]
+ new_weights = [weight * ratio for weight in weights]
+
+ return new_weights
+
+
+def handle_file(filename,action,fileobj=None,mode='r'):
+ """handle some ugly encoding issues, different python versions, and writing either to file, stdout or gzipped file format"""
+
+ if action == 'open':
+
+ if mode == 'r':
+ mode = 'rb'
+
+ if mode == 'rb' and not filename == '-' and not os.path.exists(filename):
+ if os.path.exists(filename+'.gz'):
+ filename = filename+'.gz'
+ else:
+ sys.stderr.write('Error: unable to open file. ' + filename + ' - aborting.\n')
+
+ if 'counts' in filename and os.path.exists(os.path.isdir(filename)):
+ sys.stderr.write('For a weighted counts combination, we need statistics that Moses doesn\'t write to disk by default.\n')
+ sys.stderr.write('Apply train_model.patch to train_model.perl and repeat step 4 of Moses training for all models.\n')
+
+ exit()
+
+ if filename.endswith('.gz'):
+ fileobj = gzip.open(filename,mode)
+
+ elif filename == '-' and mode == 'w':
+ fileobj = sys.stdout
+
+ else:
+ fileobj = open(filename,mode)
+
+ return fileobj
+
+ elif action == 'close' and filename != '-':
+ fileobj.close()
+
+
+def sort_file(filename):
+ """Sort a file and return temporary file"""
+
+ cmd = ['sort', filename]
+ env = {}
+ env['LC_ALL'] = 'C'
+
+ outfile = NamedTemporaryFile(delete=False)
+ sys.stderr.write('LC_ALL=C ' + ' '.join(cmd) + ' > ' + outfile.name + '\n')
+ p = Popen(cmd,env=env,stdout=outfile.file)
+ p.wait()
+
+ outfile.seek(0)
+
+ return outfile
+
+
+class Combine_TMs():
+
+ """This class handles the various options, checks them for sanity and has methods that define what models to load and what functions to call for the different tasks.
+ Typically, you only need to interact with this class and its attributes.
+
+ """
+
+ #some flags that change the behaviour during scoring. See init docstring for more info
+ flags = {'normalized':False,
+ 'recompute_lexweights':False,
+ 'intersected_cross-entropies':False,
+ 'normalize_s_given_t':None,
+ 'normalize-lexical_weights':True,
+ 'add_origin_features':False,
+ 'lowmem': False
+ }
+
+ # each model needs a priority. See init docstring for more info
+ _priorities = {'primary':1,
+ 'map':2,
+ 'supplementary':10}
+
+ def __init__(self,models,weights=None,output_file=None,mode='interpolate',number_of_features=4,model_interface=Moses,reference_interface=Moses_Alignment,reference_file=None,lang_src=None,lang_target=None,output_lexical=None,**flags):
+ """The whole configuration of the task is done during intialization. Afterwards, you only need to call your intended method(s).
+ You can change some of the class attributes afterwards (such as the weights, or the output file), but you should never change the models or mode after initialization.
+ See unit_test function for example configurations
+
+ models: list of tuples (path,priority) that defines which models to process. Path is usually the top directory of a Moses model. There are three priorities:
+ 'primary': phrase pairs with this priority will always be included in output model. For most purposes, you'll want to define all models as primary.
+ 'map': for maximum a-posteriori combination (Bacchiani et al. 2004; Foster et al. 2010). for use with mode 'counts'. stores c(t) = 1 and c(s,t) = p(s|t)
+ 'supplementary': phrase pairs are considered for probability computation, but not included in output model (unless they also occur in at least one primary model)
+ useful for rescoring a model without changing its vocabulary.
+
+ weights: accept two types of weight declarations: one weight per model, and one weight per model and feature
+ type one is internally converted to type two. For 2 models with four features, this looks like: [0.1,0.9] -> [[0.1,0.9],[0.1,0.9],[0.1,0.9],[0.1,0.9]]
+ default: uniform weights (None)
+
+ output_file: filepath of output phrase table. If it ends with .gz, file is automatically zipped.
+
+ output_lexical: If defined, also writes combined lexical tables. Writes to output_lexical.e2f and output_lexical.f2e, or output_lexical.counts.e2f in mode 'counts'.
+
+ mode: declares the basic mixture-model algorithm. there are currently three options:
+ 'counts': weighted counts (requires some statistics that Moses doesn't produce. Apply train_model.patch to train_model.perl and repeat step 4 of Moses training to obtain them.)
+ 'interpolate': linear interpolation
+ 'loglinear': loglinear interpolation (careful: this creates the intersection of phrase tables and is often of little use)
+
+ number_of_features: could be used to interpolate models with non-default Moses features. 4 features is currently still hardcoded in various places
+ (e.g. cross_entropy calculations, mode 'counts')
+
+ model_interface: class that handles reading phrase tables and lexical tables, and writing phrase tables. Currently only Moses is implemented.
+ default: Moses
+
+ reference_interace: class that deals with reading in reference phrase pairs for cross-entropy computation
+ Moses_Alignment: Word/phrase pairs as computed by Giza++ and extracted through Moses heuristics. This corresponds to the file model/extract.gz if you train a Moses model on your tuning set.
+ TigerXML: TigerXML data format
+
+ default: Moses_Alignment
+
+ reference_file: path to reference file. Required for every operation except combination of models with given weights.
+
+ lang_src: source language. Only required if reference_interface is TigerXML. Identifies which language in XML file we should treat as source language.
+
+ lang_target: target language. Only required if reference_interface is TigerXML. Identifies which language in XML file we should treat as target language.
+
+ intersected_cross-entropies: compute cross-entropies of intersection of phrase pairs, ignoring phrase pairs that do not occur in all models.
+ If False, algorithm operates on union of phrase pairs
+ default: False
+
+ add_origin_features: For each model that is being combined, add a binary feature to the final phrase table, with values of 1 (phrase pair doesn't occur in model) and 2.718 (it does).
+ This indicates which model(s) a phrase pair comes from and can be used during MERT to additionally reward/penalize translation models
+
+ lowmem: low memory mode: instead of loading target phrase counts / probability (when required), process the original table and its inversion (source and target swapped) incrementally, then merge the two halves.
+
+ there are a number of further configuration options that you can define, which modify the algorithm for linear interpolation. They have no effect in mode 'counts'
+
+ recompute_lexweights: don't directly interpolate lexical weights, but interpolate word translation probabilities instead and recompute the lexical weights.
+ default: False
+
+ normalized: for interpolation of p(x|y): if True, models with p(y)=0 will be ignored, and probability mass will be distributed among models with p(y)>0.
+ If False, missing entries (x,y) are always interpreted as p(x|y)=0.
+ default: False
+
+ normalize_s_given_t: How to we normalize p(s|t) if 'normalized' is True? Three options:
+ None: don't normalize p(s|t) and lex(s|t) (only p(t|s) and lex(t|s))
+ t: check if p(t)==0 : advantage: theoretically sound; disadvantage: slower (we need to know if t occcurs in model); favours rare target phrases (relative to default choice)
+ s: check if p(s)==0 : advantage: relevant for task; disadvantage: no true probability distributions
+
+ default: None
+
+ normalize-lexical_weights: also normalize lex(s|t) and lex(t|s) if 'normalized' ist True:
+ reason why you might want to disable this: lexical weights suffer less from data sparseness than probabilities.
+ default: True
+
+ """
+
+
+ self.mode = mode
+ self.output_file = output_file
+ self.lang_src = lang_src
+ self.lang_target = lang_target
+ self.loaded = defaultdict(int)
+ self.output_lexical = output_lexical
+
+ if reference_interface:
+ self.reference_interface = reference_interface(reference_file)
+
+ if mode not in ['interpolate','loglinear','counts']:
+ sys.stderr.write('Error: mode must be either "interpolate", "loglinear" or "counts"\n')
+ sys.exit()
+
+ models,number_of_features,weights = self._sanity_checks(models,number_of_features,weights)
+
+ self.weights = weights
+ self.models = models
+
+ self.model_interface = model_interface(models,number_of_features)
+
+ if mode == 'interpolate':
+ self.score = score_interpolate
+ self.load_lines = self.model_interface.load_phrase_probabilities
+ elif mode == 'loglinear':
+ self.score = score_loglinear
+ self.load_lines = self.model_interface.load_phrase_probabilities
+ elif mode == 'counts':
+ self.score = score_counts
+ self.load_lines = self.model_interface.load_phrase_counts
+
+ self.flags = copy.copy(self.flags)
+ self.flags.update(flags)
+
+
+ def _sanity_checks(self,models,number_of_features,weights):
+ """check if input arguments make sense (correct number of weights, valid model priorities etc.)
+ is only called on initialization. If you change weights afterwards, better know what you're doing.
+ """
+
+ for (model,priority) in models:
+ assert(priority in self._priorities)
+ models = [(model,self._priorities[p]) for (model,p) in models]
+
+
+ # accept two types of weight declarations: one weight per model, and one weight per model and feature
+ # type one is internally converted to type two: [0.1,0.9] -> [[0.1,0.9],[0.1,0.9],[0.1,0.9],[0.1,0.9]]
+ if weights:
+ if type(weights[0]) == list:
+ assert(len(weights)==number_of_features)
+ for sublist in weights:
+ assert(len(sublist)==len(models))
+
+ else:
+ assert(len(models) == len(weights))
+ weights = [weights for i in range(number_of_features)]
+
+ else:
+ if self.mode == 'loglinear' or self.mode == 'interpolate':
+ weights = [[1/len(models)]*len(models) for i in range(number_of_features)]
+ elif self.mode == 'counts':
+ weights = [[1]*len(models) for i in range(number_of_features)]
+ sys.stderr.write('Warning: No weights defined: initializing with uniform weights\n')
+
+
+ new_weights = normalize_weights(weights,self.mode)
+ if weights != new_weights:
+ if self.mode == 'interpolate' or self.mode == 'loglinear':
+ sys.stderr.write('Warning: weights should sum to 1 - ')
+ elif self.mode == 'counts':
+ sys.stderr.write('Warning: normalizing weights so that first model has weight 1 - ')
+ sys.stderr.write('normalizing to: '+ str(new_weights) +'\n')
+ weights = new_weights
+
+ return models,number_of_features,weights
+
+
+ def _ensure_loaded(self,data):
+ """load data (lexical tables; reference alignment; phrase table), if it isn't already in memory"""
+
+ if 'lexical' in data:
+ self.model_interface.require_alignment = True
+
+ if 'reference' in data and not self.loaded['reference']:
+
+ sys.stderr.write('Loading word pairs from reference set...')
+ self.reference_interface.load_word_pairs(self.lang_src,self.lang_target)
+ sys.stderr.write('done\n')
+ self.loaded['reference'] = 1
+
+ if 'lexical' in data and not self.loaded['lexical']:
+
+ sys.stderr.write('Loading lexical tables...')
+ self.model_interface.load_lexical_tables(self.models,self.mode)
+ sys.stderr.write('done\n')
+ self.loaded['lexical'] = 1
+
+ if 'pt-filtered' in data and not self.loaded['pt-filtered']:
+
+ models_prioritized = [(self.model_interface.open_table(model,'phrase-table'),priority,i) for (model,priority,i) in priority_sort_models(self.models)]
+
+ for model,priority,i in models_prioritized:
+ sys.stderr.write('Loading phrase table ' + str(i) + ' (only data relevant for reference set)')
+ j = 0
+ for line in model:
+ if not j % 1000000:
+ sys.stderr.write('...'+str(j))
+ j += 1
+ line = line.rstrip().split(b' ||| ')
+ self.load_lines(line,priority,i,store='all',filter_by=self.reference_interface.word_pairs,filter_by_src=self.reference_interface.word_source,filter_by_target=self.reference_interface.word_target)
+ sys.stderr.write(' done\n')
+
+ self.loaded['pt-filtered'] = 1
+
+ if 'lexical-filtered' in data and not self.loaded['lexical-filtered']:
+ e2f_filter, f2e_filter = _get_lexical_filter(self.reference_interface,self.model_interface)
+
+ sys.stderr.write('Loading lexical tables (only data relevant for reference set)...')
+ self.model_interface.load_lexical_tables(self.models,self.mode,e2f_filter=e2f_filter,f2e_filter=f2e_filter)
+ sys.stderr.write('done\n')
+ self.loaded['lexical-filtered'] = 1
+
+ if 'pt-target' in data and not self.loaded['pt-target']:
+
+ models_prioritized = [(self.model_interface.open_table(model,'phrase-table'),priority,i) for (model,priority,i) in priority_sort_models(self.models)]
+
+ for model,priority,i in models_prioritized:
+ sys.stderr.write('Loading target information from phrase table ' + str(i))
+ j = 0
+ for line in model:
+ if not j % 1000000:
+ sys.stderr.write('...'+str(j))
+ j += 1
+ line = line.rstrip().split(b' ||| ')
+ self.load_lines(line,priority,i,store='target')
+ sys.stderr.write(' done\n')
+
+ self.loaded['pt-target'] = 1
+
+
+ def _inverse_wrapper(self,weights):
+ """if we want to invert the phrase table to better calcualte p(s|t) and lex(s|t), manage creation, sorting and merging of inverted phrase tables"""
+
+ sys.stderr.write('Processing first table half\n')
+ models = [(self.model_interface.open_table(model,'phrase-table'),priority,i) for (model,priority,i) in priority_sort_models(self.model_interface.models)]
+ pt_half1 = NamedTemporaryFile(prefix='half1',delete=False)
+ self._write_phrasetable(models,pt_half1,weights)
+ pt_half1.seek(0)
+
+ sys.stderr.write('Inverting tables\n')
+ models = [(self.model_interface.create_inverse(self.model_interface.open_table(model,'phrase-table')),priority,i) for (model,priority,i) in priority_sort_models(self.model_interface.models)]
+ sys.stderr.write('Processing second table half\n')
+ pt_half2_inverted = NamedTemporaryFile(prefix='half2',delete=False)
+ self._write_phrasetable(models,pt_half2_inverted,weights,inverted=True)
+ pt_half2_inverted.close()
+ for model,priority,i in models:
+ model.close()
+ os.remove(model.name)
+ pt_half2 = sort_file(pt_half2_inverted.name)
+ os.remove(pt_half2_inverted.name)
+
+ sys.stderr.write('Merging tables: first half: {0} ; second half: {1} ; final table: {2}\n'.format(pt_half1.name,pt_half2.name,self.output_file))
+ output_object = handle_file(self.output_file,'open',mode='w')
+ self.model_interface.merge(pt_half1,pt_half2,output_object,self.mode)
+ os.remove(pt_half1.name)
+ os.remove(pt_half2.name)
+
+ handle_file(self.output_file,'close',output_object,mode='w')
+
+
+ def _write_phrasetable(self,models,output_object,weights,inverted=False):
+ """Incrementally load phrase tables, calculate score for increment and write it to output_object"""
+
+ # define which information we need to store from the phrase table
+ # possible flags: 'all', 'target', 'source' and 'pairs'
+ # interpolated models without re-normalization only need 'pairs', otherwise 'all' is the correct choice
+ store_flag = 'all'
+ if self.mode == 'interpolate' and not self.flags['normalized']:
+ store_flag = 'pairs'
+
+ i = 0
+ sys.stderr.write('Incrementally loading and processing phrase tables...')
+ for block in self.model_interface.traverse_incrementally('phrase-table',models,self.load_lines,store_flag,inverted=inverted):
+
+ for src in sorted(self.model_interface.phrase_pairs, key = lambda x: x + b' |'):
+ for target in sorted(self.model_interface.phrase_pairs[src], key = lambda x: x + b' |'):
+
+ if not i % 1000000:
+ sys.stderr.write(str(i) + '...')
+ i += 1
+
+ features = self.score(weights,src,target,self.model_interface,self.flags)
+ outline = self.model_interface.write_phrase_table(src,target,weights,features,self.mode, self.flags)
+ output_object.write(outline)
+ sys.stderr.write('done\n')
+
+
+ def combine_given_weights(self,weights=None):
+ """write a new phrase table, based on existing weights"""
+
+ if not weights:
+ weights = self.weights
+
+ data = []
+
+ if self.mode == 'counts':
+ data.append('lexical')
+ if not self.flags['lowmem']:
+ data.append('pt-target')
+
+ elif self.mode == 'interpolate':
+ if self.flags['recompute_lexweights']:
+ data.append('lexical')
+ if self.flags['normalized'] and self.flags['normalize_s_given_t'] == 't' and not self.flags['lowmem']:
+ data.append('pt-target')
+
+ self._ensure_loaded(data)
+
+ if self.flags['lowmem'] and (self.mode == 'counts' or self.flags['normalized'] and self.flags['normalize_s_given_t'] == 't'):
+ self._inverse_wrapper(weights)
+ else:
+ models = [(self.model_interface.open_table(model,'phrase-table'),priority,i) for (model,priority,i) in priority_sort_models(self.model_interface.models)]
+ output_object = handle_file(self.output_file,'open',mode='w')
+ self._write_phrasetable(models,output_object,weights)
+ handle_file(self.output_file,'close',output_object,mode='w')
+
+ if self.output_lexical:
+ sys.stderr.write('Writing lexical tables\n')
+ self._ensure_loaded(['lexical'])
+ self.model_interface.write_lexical_file('e2f',self.output_lexical,weights[1],self.mode)
+ self.model_interface.write_lexical_file('f2e',self.output_lexical,weights[3],self.mode)
+
+
+ def combine_given_tuning_set(self):
+ """write a new phrase table, using the weights that minimize cross-entropy on a tuning set"""
+
+ data = ['reference','pt-filtered']
+
+ if self.mode == 'counts' or (self.mode == 'interpolate' and self.flags['recompute_lexweights']):
+ data.append('lexical-filtered')
+
+ self._ensure_loaded(data)
+
+ best_weights,best_cross_entropy = optimize_cross_entropy(self.model_interface,self.reference_interface,self.weights,self.score,self.mode,self.flags)
+ sys.stderr.write('Best weights: ' + str(best_weights) + '\n')
+ sys.stderr.write('Cross entropies: ' + str(best_cross_entropy) + '\n')
+
+ self.loaded['pt-filtered'] = False # phrase table will be overwritten
+ self.combine_given_weights(weights=best_weights)
+
+
+
+ def combine_reordering_tables(self,weights=None):
+ """write a new reordering table, based on existing weights."""
+
+ if not weights:
+ weights = self.weights
+
+ data = []
+
+ if self.mode != 'interpolate':
+ sys.stderr.write('Error: only linear interpolation is supported for reordering model combination')
+
+ output_object = handle_file(self.output_file,'open',mode='w')
+ models = [(self.open_table(model,table),priority,i) for (model,priority,i) in priority_sort_models(self.models)]
+
+ i = 0
+
+ sys.stderr.write('Incrementally loading and processing phrase tables...')
+ for block in self.model_interface.traverse_incrementally('reordering-table',models,self.model_interface.load_reordering_probabilities,'pairs'):
+
+ for src in sorted(self.model_interface.reordering_pairs):
+ for target in sorted(self.model_interface.reordering_pairs[src]):
+ if not i % 1000000:
+ sys.stderr.write(str(i) + '...')
+ i += 1
+
+ features = score_interpolate_reordering(weights,src,target,self.model_interface)
+ outline = self.model_interface.write_reordering_table(src,target,features)
+ output_object.write(outline)
+ sys.stderr.write('done\n')
+
+
+ handle_file(self.output_file,'close',output_object,mode='w')
+
+
+ def compare_cross_entropies(self):
+ """print cross-entropies for each model/feature, using the intersection of phrase pairs.
+ analysis tool.
+ """
+
+ self.flags['compare_cross-entropies'] = True
+
+ data = ['reference','pt-filtered']
+
+ if self.mode == 'counts' or (self.mode == 'interpolate' and self.flags['recompute_lexweights']):
+ data.append('lexical-filtered')
+
+ self._ensure_loaded(data)
+
+ results, (intersection,total_pairs,oov2) = cross_entropy(self.model_interface,self.reference_interface,self.weights,self.score,self.mode,self.flags)
+
+ padding = 90
+
+ print('\nResults of model comparison\n')
+ print('{0:<{padding}}: {1}'.format('phrase pairs in reference (tokens)',total_pairs, padding=padding))
+ print('{0:<{padding}}: {1}'.format('phrase pairs in model intersection (tokens)',intersection, padding=padding))
+ print('{0:<{padding}}: {1}\n'.format('phrase pairs in model union (tokens)',total_pairs-oov2, padding=padding))
+
+ for i,(cross_entropy_st,cross_entropy_lex_st,cross_entropy_ts,cross_entropy_lex_ts,other_translations,oov,ignored,n,total_pairs) in enumerate(results):
+ print('model ' +str(i))
+ print('{0:<{padding}}: {1}'.format('cross-entropy p(s|t)', cross_entropy_st, padding=padding))
+ print('{0:<{padding}}: {1}'.format('cross-entropy lex(s|t)', cross_entropy_lex_st, padding=padding))
+ print('{0:<{padding}}: {1}'.format('cross-entropy p(t|s)', cross_entropy_ts, padding=padding))
+ print('{0:<{padding}}: {1}'.format('cross-entropy lex(t|s)', cross_entropy_lex_ts, padding=padding))
+ print('{0:<{padding}}: {1}'.format('phrase pairs in model (tokens)', n+ignored, padding=padding))
+ print('{0:<{padding}}: {1}'.format('phrase pairs in model, but not in intersection (tokens)', ignored, padding=padding))
+ print('{0:<{padding}}: {1}'.format('phrase pairs in union, but not in model (but source phrase is) (tokens)', other_translations, padding=padding))
+ print('{0:<{padding}}: {1}\n'.format('phrase pairs in union, but source phrase not in model (tokens)', oov, padding=padding))
+
+ self.flags['compare_cross-entropies'] = False
+
+ return results, (intersection,total_pairs,oov2)
+
+
+ def compute_cross_entropy(self):
+ """return cross-entropy for a tuning set, a set of models and a set of weights.
+ analysis tool.
+ """
+
+ data = ['reference','pt-filtered']
+
+ if self.mode == 'counts' or (self.mode == 'interpolate' and self.flags['recompute_lexweights']):
+ data.append('lexical-filtered')
+
+ self._ensure_loaded(data)
+
+ current_cross_entropy = cross_entropy(self.model_interface,self.reference_interface,self.weights,self.score,self.mode,self.flags)
+ sys.stderr.write('Cross entropy: ' + str(current_cross_entropy) + '\n')
+ return current_cross_entropy
+
+
+ def return_best_cross_entropy(self):
+ """return the set of weights and cross-entropy that is optimal for a tuning set and a set of models."""
+
+ data = ['reference','pt-filtered']
+
+ if self.mode == 'counts' or (self.mode == 'interpolate' and self.flags['recompute_lexweights']):
+ data.append('lexical-filtered')
+
+ self._ensure_loaded(data)
+
+ best_weights,best_cross_entropy = optimize_cross_entropy(self.model_interface,self.reference_interface,self.weights,self.score,self.mode,self.flags)
+
+ sys.stderr.write('Best weights: ' + str(best_weights) + '\n')
+ sys.stderr.write('Cross entropies: ' + str(best_cross_entropy) + '\n')
+ return best_weights,best_cross_entropy
+
+
+def test():
+ """test (and illustrate) the functionality of the program based on two test phrase tables and a small reference set,"""
+
+ # linear interpolation of two models, with fixed weights. Output uses vocabulary of model1 (since model2 is supplementary)
+ # command line: (currently not possible to define supplementary models through command line)
+ sys.stderr.write('Regression test 1\n')
+ Combiner = Combine_TMs([[os.path.join('test','model1'),'primary'],[os.path.join('test','model2'),'supplementary']],[0.5,0.5],os.path.join('test','phrase-table_test1'))
+ Combiner.combine_given_weights()
+
+ # linear interpolation of two models, with fixed weights (but different for each feature).
+ # command line: python tmcombine.py combine_given_weights test/model1 test/model2 -w "0.1,0.9;0.1,1;0.2,0.8;0.5,0.5" -o test/phrase-table_test2
+ sys.stderr.write('Regression test 2\n')
+ Combiner = Combine_TMs([[os.path.join('test','model1'),'primary'],[os.path.join('test','model2'),'primary']],[[0.1,0.9],[0.1,1],[0.2,0.8],[0.5,0.5]],os.path.join('test','phrase-table_test2'))
+ Combiner.combine_given_weights()
+
+ # count-based combination of two models, with fixed weights
+ # command line: python tmcombine.py combine_given_weights test/model1 test/model2 -w "0.1,0.9;0.1,1;0.2,0.8;0.5,0.5" -o test/phrase-table_test3 -m counts
+ sys.stderr.write('Regression test 3\n')
+ Combiner = Combine_TMs([[os.path.join('test','model1'),'primary'],[os.path.join('test','model2'),'primary']],[[0.1,0.9],[0.1,1],[0.2,0.8],[0.5,0.5]],os.path.join('test','phrase-table_test3'),mode='counts')
+ Combiner.combine_given_weights()
+
+ # output phrase table should be identical to model1
+ # command line: python tmcombine.py combine_given_weights test/model1 -w 1 -o test/phrase-table_test4 -m counts
+ sys.stderr.write('Regression test 4\n')
+ Combiner = Combine_TMs([[os.path.join('test','model1'),'primary']],[1],os.path.join('test','phrase-table_test4'),mode='counts')
+ Combiner.combine_given_weights()
+
+ # count-based combination of two models with weights set through perplexity minimization
+ # command line: python tmcombine.py combine_given_tuning_set test/model1 test/model2 -o test/phrase-table_test5 -m counts -r test/extract
+ sys.stderr.write('Regression test 5\n')
+ Combiner = Combine_TMs([[os.path.join('test','model1'),'primary'],[os.path.join('test','model2'),'primary']],output_file=os.path.join('test','phrase-table_test5'),mode='counts',reference_file='test/extract')
+ Combiner.combine_given_tuning_set()
+
+ # loglinear combination of two models with fixed weights
+ # command line: python tmcombine.py combine_given_weights test/model1 test/model2 -w 0.1,0.9 -o test/phrase-table_test6 -m loglinear
+ sys.stderr.write('Regression test 6\n')
+ Combiner = Combine_TMs([[os.path.join('test','model1'),'primary'],[os.path.join('test','model2'),'primary']],weights=[0.1,0.9],output_file=os.path.join('test','phrase-table_test6'),mode='loglinear')
+ Combiner.combine_given_weights()
+
+ # cross-entropy analysis of two models through a reference set
+ # command line: python tmcombine.py compare_cross_entropies test/model1 test/model2 -m counts -r test/extract
+ sys.stderr.write('Regression test 7\n')
+ Combiner = Combine_TMs([[os.path.join('test','model1'),'primary'],[os.path.join('test','model2'),'primary']],mode='counts',reference_file='test/extract')
+ f = open(os.path.join('test','phrase-table_test7'),'w')
+ f.write(str(Combiner.compare_cross_entropies()))
+ f.close()
+
+ # maximum a posteriori combination of two models (Bacchiani et al. 2004; Foster et al. 2010) with weights set through cross-entropy minimization
+ # command line: (currently not possible through command line)
+ sys.stderr.write('Regression test 8\n')
+ Combiner = Combine_TMs([[os.path.join('test','model1'),'primary'],[os.path.join('test','model2'),'map']],output_file=os.path.join('test','phrase-table_test8'),mode='counts',reference_file='test/extract')
+ Combiner.combine_given_tuning_set()
+
+
+#convert weight vector passed as a command line argument
+class to_list(argparse.Action):
+ def __call__(self, parser, namespace, weights, option_string=None):
+ if ';' in weights:
+ values = [[float(x) for x in vector.split(',')] for vector in weights.split(';')]
+ else:
+ values = [float(x) for x in weights.split(',')]
+ setattr(namespace, self.dest, values)
+
+
+def parse_command_line():
+ parser = argparse.ArgumentParser(description='Combine translation models. Check DOCSTRING of the class Combine_TMs() and its methods for a more in-depth documentation and additional configuration options not available through the command line. The function test() shows examples.')
+
+ parser.add_argument('action', metavar='ACTION', choices=["combine_given_weights","combine_given_tuning_set","combine_reordering_tables","compute_cross_entropy","return_best_cross_entropy","compare_cross_entropies"],
+ help='What you want to do with the models. One of %(choices)s.')
+
+ parser.add_argument('model', metavar='DIRECTORY', nargs='+',
+ help='Model directory. Assumes default Moses structure (i.e. path to phrase table and lexical tables).')
+
+ parser.add_argument('-w', '--weights', dest='weights', action=to_list,
+ default=None,
+ help='weight vector. Format 1: single vector, one weight per model. Example: \"0.1,0.9\" ; format 2: one vector per feature, one weight per model: \"0.1,0.9;0.5,0.5;0.4,0.6;0.2,0.8\"')
+
+ parser.add_argument('-m', '--mode', type=str,
+ default="interpolate",
+ choices=["counts","interpolate","loglinear"],
+ help='basic mixture-model algorithm. Default: %(default)s. Note: depending on mode and additional configuration, additional statistics are needed. Check docstring documentation of Combine_TMs() for more info.')
+
+ parser.add_argument('-r', '--reference', type=str,
+ default=None,
+ help='File containing reference phrase pairs for cross-entropy calculation. Default interface expects \'path/model/extract.gz\' that is produced by training a model on the reference (i.e. development) corpus.')
+
+ parser.add_argument('-o', '--output', type=str,
+ default="-",
+ help='Output file (phrase table). If not specified, model is written to standard output.')
+
+ parser.add_argument('--output-lexical', type=str,
+ default=None,
+ help=('Not only create a combined phrase table, but also combined lexical tables. Writes to OUTPUT_LEXICAL.e2f and OUTPUT_LEXICAL.f2e, or OUTPUT_LEXICAL.counts.e2f in mode \'counts\'.'))
+
+ parser.add_argument('--lowmem', action="store_true",
+ help=('Low memory mode: requires two passes (and sorting in between) to combine a phrase table, but loads less data into memory. Only relevant for mode "counts" and some configurations of mode "interpolate".'))
+
+ parser.add_argument('--normalized', action="store_true",
+ help=('for each phrase pair x,y: ignore models with p(y)=0, and distribute probability mass among models with p(y)>0. (default: missing entries (x,y) are always interpreted as p(x|y)=0). Only relevant in mode "interpolate".'))
+
+ parser.add_argument('--recompute_lexweights', action="store_true",
+ help=('don\'t directly interpolate lexical weights, but interpolate word translation probabilities instead and recompute the lexical weights. Only relevant in mode "interpolate".'))
+
+ return parser.parse_args()
+
+if __name__ == "__main__":
+
+ if len(sys.argv) < 2:
+ sys.stderr.write("no command specified. use option -h for usage instructions\n")
+
+ elif sys.argv[1] == "test":
+ test()
+
+ else:
+ args = parse_command_line()
+ #initialize
+ combiner = Combine_TMs([(m,'primary') for m in args.model],weights=args.weights,mode=args.mode,output_file=args.output,reference_file=args.reference,output_lexical=args.output_lexical,lowmem=args.lowmem,normalized=args.normalized,recompute_lexweights=args.recompute_lexweights)
+ # execute right method
+ f_string = "combiner."+args.action+'()'
+ exec(f_string)
diff --git a/contrib/tmcombine/train_model.patch b/contrib/tmcombine/train_model.patch
new file mode 100644
index 000000000..d422a1628
--- /dev/null
+++ b/contrib/tmcombine/train_model.patch
@@ -0,0 +1,24 @@
+--- train-model.perl 2011-11-01 15:17:04.763230934 +0100
++++ train-model.perl 2011-11-01 15:17:00.033229220 +0100
+@@ -1185,15 +1185,21 @@
+
+ open(F2E,">$lexical_file.f2e") or die "ERROR: Can't write $lexical_file.f2e";
+ open(E2F,">$lexical_file.e2f") or die "ERROR: Can't write $lexical_file.e2f";
++ open(F2E2,">$lexical_file.counts.f2e") or die "ERROR: Can't write $lexical_file.counts.f2e";
++ open(E2F2,">$lexical_file.counts.e2f") or die "ERROR: Can't write $lexical_file.counts.e2f";
+
+ foreach my $f (keys %WORD_TRANSLATION) {
+ foreach my $e (keys %{$WORD_TRANSLATION{$f}}) {
+ printf F2E "%s %s %.7f\n",$e,$f,$WORD_TRANSLATION{$f}{$e}/$TOTAL_FOREIGN{$f};
+ printf E2F "%s %s %.7f\n",$f,$e,$WORD_TRANSLATION{$f}{$e}/$TOTAL_ENGLISH{$e};
++ printf F2E2 "%s %s %i %i\n",$e,$f,$WORD_TRANSLATION{$f}{$e},$TOTAL_FOREIGN{$f};
++ printf E2F2 "%s %s %i %i\n",$f,$e,$WORD_TRANSLATION{$f}{$e},$TOTAL_ENGLISH{$e};
+ }
+ }
+ close(E2F);
+ close(F2E);
++ close(E2F2);
++ close(F2E2);
+ print STDERR "Saved: $lexical_file.f2e and $lexical_file.e2f\n";
+ }
+