-
Notifications
You must be signed in to change notification settings - Fork 13
Docent Configuration
Docent's configuration is specified in an XML file with a root tag <docent>
enclosing five subsections named <random>
, <state-generator>
, <search>
, <models>
and <weights>
, all of which are mandatory and must be present in this order. Some of the tags described below can take extra parameters, which can be supplied using <p name="...">
subtags. You can find some example configuration files in the tests/config
subdirectory.
The <random>
section controls the initialisation of the random number generator. Typically, it will be left empty (<random/>
). This means that the random number generator should be seeded with a non-predictable value which will be read from /dev/urandom
or, if this fails, from the system clock. The actual seed value that was used will be output in the decoder log. If the <random>
tag is non-empty, its contents must be an unsigned 32-bit integer which will be used to seed the random number generator. This is useful for starting predictable runs when debugging the decoder.
The <state-generator>
section controls how the decoder state is initalised (<initial-state>
tag) and what operations are can be applied to the state in each step (<operation>
tags). The initialisation options and available operations are described in our EMNLP 2012 paper.
-
<initial-state type="monotonic">
specifies uninformed monotonic initialisation with randomly selected phrase translations. -
<initial-state type="beam-search">
requests state initialisation by dynamic programming beam search. It takes one mandatory parameter calledini
, a pointer to amoses.ini
file.
State operations are specified with a set of <operation type="..." weight="...">
tags. The operations of type change-phrase-translation
, swap-phrases
and resegment
are described in the paper. The weight
attribute contains the probabilities with which the operations will be selected. Some of the operations take additional parameters controlling the range over which they are likely to be applied. Stick to the defaults if unsure.
The <search>
section configures the search algorithm to be used. Docent currently implements two search algorithms, simulated-annealing
and local-beam-search
. Simulated annealing search takes three parameters named max-steps
, max-rejected
and schedule
. The first two parameters define the step limit and the rejection limit discussed in our paper. When the cooling schedule hill-climbing
is used, simulated annealing is equivalent to local beam search with beam size 1. This is the setup we currently recommend using. Some other cooling schedules are implemented, but difficult to parametrise and not well tested.
In this section, the feature functions are defined. For users familiar with Moses, the options in the example files should be fairly self-explanatory. Each model needs an id
attribute by which it can be referred to in the <weights>
section. The following models are currently supported:
-
geometric-distortion-model
: The standard simple unlexicalised distortion cost model. The model provides a second score that counts violations of a given maximum distortion distance, which can be used to implement a distortion limit like the one commonly used in DP beam search. -
word-penalty
: Word count feature. -
oov-penalty
: Out-of-vocabulary word count feature. Note that unlike Moses, Docent needs this to be explicitly specified if you want to use it. -
ngram-model
: Standard n-gram language model. The parameterlm-file
specifies the language model file. Set the parameterannotation-level
to specify a language model based on annotations. Acceptable formats include models built with KenLM or SRILM language modelling toolkits. -
phrase-table
: The phrase table. The parameterfile
specifies the location of the phrase table. Set the parameterload-alignments
to true if your binary phrase table contains phrase-internal word alignments. -
semantic-space-language-model
: The semantic language model described in our EMNLP 2012 paper. -
sentence-parity-model
: A proof-of-concept model enforcing sentence length parity which exists mainly to demonstrate how to implement a simple feature function. -
bleu-model
: A model which maximizes the BLEU score of the output based on a set of reference translations. Thereference-file
parameter specifies where the reference translations are found (plain text format). Note that currently only one reference translation per sentence is supported.
This section contains the feature weight for each score. Features are referred to by their id
attributes. If a model produces multiple scores, these are numbered starting from zero.
To use an annotated model with Docent, you must first create a phrase table that contains annotations. An example of how to do this is as follows:
Given a parallel corpus corpus.xx
corpus.yy
, where xx represents the source language and yy the target, firstly annotate corpus.yy
, by placing a '|' symbol after each token followed by the annotation (e.g. the POS tag). Docent currently only handles annotations on the target side. Then train a model in Moses, making sure to pass the annotated corpus, as well as the flag --translation-factors 0-0,1
. The resulting phrase table should then be filtered and binarized and specified in the phrase-table
model in the configuration file as normal. The parameter annotation-count
must also be set to 1 within the phrase-table
model.
A language model should then be created to make use of the annotations. If using POS tags, for example, this could be achieved by extracting tags from a large tagged monolingual corpus, then running standard language model software such as KenLM to create a model (this should also be binarized to speed up simulations). In the Docent configuration file, a new ngram-model
should then be specified, with the lm-file
parameter set correctly and annotation-level
set to 0.