Skip to content

Docent Configuration

py3ams edited this page Aug 28, 2013 · 4 revisions

Docent's configuration is specified in an XML file with a root tag <docent> enclosing five subsections named <random>, <state-generator>, <search>, <models> and <weights>, all of which are mandatory and must be present in this order. Some of the tags described below can take extra parameters, which can be supplied using <p name="..."> subtags. You can find some example configuration files in the tests/config subdirectory.

The <random> section

The <random> section controls the initialisation of the random number generator. Typically, it will be left empty (<random/>). This means that the random number generator should be seeded with a non-predictable value which will be read from /dev/urandom or, if this fails, from the system clock. The actual seed value that was used will be output in the decoder log. If the <random> tag is non-empty, its contents must be an unsigned 32-bit integer which will be used to seed the random number generator. This is useful for starting predictable runs when debugging the decoder.

The <state-generator> section

The <state-generator> section controls how the decoder state is initalised (<initial-state> tag) and what operations are can be applied to the state in each step (<operation> tags). The initialisation options and available operations are described in our EMNLP 2012 paper.

State initialisation

  • <initial-state type="monotonic"> specifies uninformed monotonic initialisation with randomly selected phrase translations.
  • <initial-state type="beam-search"> requests state initialisation by dynamic programming beam search. It takes one mandatory parameter called ini, a pointer to a moses.ini file.

State operations

State operations are specified with a set of <operation type="..." weight="..."> tags. The operations of type change-phrase-translation, swap-phrases and resegment are described in the paper. The weight attribute contains the probabilities with which the operations will be selected. Some of the operations take additional parameters controlling the range over which they are likely to be applied. Stick to the defaults if unsure.

The <search> section

The <search> section configures the search algorithm to be used. Docent currently implements two search algorithms, simulated-annealing and local-beam-search. Simulated annealing search takes three parameters named max-steps, max-rejected and schedule. The first two parameters define the step limit and the rejection limit discussed in our paper. When the cooling schedule hill-climbing is used, simulated annealing is equivalent to local beam search with beam size 1. This is the setup we currently recommend using. Some other cooling schedules are implemented, but difficult to parametrise and not well tested.

The <models>section

In this section, the feature functions are defined. For users familiar with Moses, the options in the example files should be fairly self-explanatory. Each model needs an id attribute by which it can be referred to in the <weights> section. The following models are currently supported:

  • geometric-distortion-model: The standard simple unlexicalised distortion cost model. The model provides a second score that counts violations of a given maximum distortion distance, which can be used to implement a distortion limit like the one commonly used in DP beam search.
  • word-penalty: Word count feature.
  • oov-penalty: Out-of-vocabulary word count feature. Note that unlike Moses, Docent needs this to be explicitly specified if you want to use it.
  • ngram-model: Standard n-gram language model. The parameter lm-file specifies the language model file. Set the parameter annotation-level to specify a language model based on annotations. Acceptable formats include models built with KenLM or SRILM language modelling toolkits.
  • phrase-table: The phrase table. The parameter file specifies the location of the phrase table. Set the parameter load-alignments to true if your binary phrase table contains phrase-internal word alignments.
  • semantic-space-language-model: The semantic language model described in our EMNLP 2012 paper.
  • sentence-parity-model: A proof-of-concept model enforcing sentence length parity which exists mainly to demonstrate how to implement a simple feature function.
  • bleu-model: A model which maximizes the BLEU score of the output based on a set of reference translations. The reference-file parameter specifies where the reference translations are found (plain text format). Note that currently only one reference translation per sentence is supported.

The <weights> section

This section contains the feature weight for each score. Features are referred to by their id attributes. If a model produces multiple scores, these are numbered starting from zero.

Annotated models

To use an annotated model with Docent, you must first create a phrase table that contains annotations. An example of how to do this is as follows: Given a parallel corpus corpus.xx corpus.yy, where xx represents the source language and yy the target, firstly annotate corpus.yy, by placing a '|' symbol after each token followed by the annotation (e.g. the POS tag). Docent currently only handles annotations on the target side. Then train a model in Moses, making sure to pass the annotated corpus, as well as the flag --translation-factors 0-0,1. The resulting phrase table should then be filtered and binarized and specified in the phrase-table model in the configuration file as normal. The parameter annotation-count must also be set to 1 within the phrase-table model.

A language model should then be created to make use of the annotations. If using POS tags, for example, this could be achieved by extracting tags from a large tagged monolingual corpus, then running standard language model software such as KenLM to create a model (this should also be binarized to speed up simulations). In the Docent configuration file, a new ngram-model should then be specified, with the lm-file parameter set correctly and annotation-level set to 0.