Joey NMT framework is developed for educational purposes. It aims to be a clean and minimalistic code base to help novices pursuing the understanding of the following questions.
- How to implement classic NMT architectures (RNN and Transformer) in PyTorch?
- What are the building blocks of these architectures and how do they interact?
- How to modify these blocks (e.g. deeper, wider, ...)?
- How to modify the training procedure (e.g. add a regularizer)?
In contrast to other NMT frameworks, we will not aim for state-of-the-art results or speed through engineering or training tricks since this often goes in hand with an increase in code complexity and a decrease in readability.
However, Joey NMT re-implements baselines from major publications.
Joey NMT is developed by Joost Bastings (University of Amsterdam) and Julia Kreutzer (Heidelberg University).
We aim to implement the following features (aka the minimalist toolkit of NMT):
- Recurrent Encoder-Decoder with GRUs or LSTMs
- Transformer Encoder-Decoder
- Attention Types: MLP, Dot, Multi-Head, Bilinear
- Word-, BPE- and character-based input handling
- BLEU, ChrF evaluation
- Beam search with length penalty and greedy decoding
- Customizable initialization
- Attention visualization
- Learning curve plotting
[Work in progress: Transformer, Multi-Head and Dot still missing.]
In order to keep the code clean and readable, we make use of:
- Style checks: pylint with (mostly) PEP8 conventions, see
.pylintrc
. - Typing: Every function has documented input types.
- Docstrings: Every function, class and module has docstrings describing their purpose and usage.
- Unittests: Every module has unit tests, defined in
test/unit/
. Travis CI runs the tests and pylint on every push to ensure the repository stays clean.
Joey NMT is built on PyTorch and torchtext for Python >= 3.5.
- Clone this repository:
git clone https://github.com/joeynmt/joeynmt.git
- Install joeynmt and it's requirements:
cd joeynmt
pip3 install .
(you might want to add--user
for a local installation). - Run the unit tests:
python3 -m unittest
- Install torchtext from sources.
For details, follow the tutorial in the docs.
For training a translation model, you need parallel data, i.e. a collection of source sentences and reference translations that are aligned sentence-by-sentence and stored in two files, such that each line in the reference file is the translation of the same line in the source file.
Before training a model on it, parallel data is most commonly filtered by length ratio, tokenized and true- or lowercased.
The Moses toolkit provides a set of useful scripts for this purpose.
In addition, you might want to build the NMT model not on the basis of words, but rather sub-words or characters (the level
in JoeyNMT configurations).
Currently, JoeyNMT supports the byte-pair-encodings (BPE) format by subword-nmt.
Experiments are specified in configuration files, in simple YAML format. You can find examples in the configs
directory.
small.yaml
contains a detailed explanation of configuration options.
Most importantly, the configuration contains the description of the model architecture (e.g. number of hidden units in the encoder RNN), paths to the training, development and test data, and the training hyperparameters (learning rate, validation frequency etc.).
For training, run
python3 -m joeynmt train configs/small.yaml
.
This will train a model on the training data specified in the config (here: small.yaml
),
validate on validation data,
and store model parameters, vocabularies, validation outputs and a small number of attention plots in the model_dir
(also specified in config).
Note that pre-processing like tokenization or BPE-ing is not included in training, but has to be done manually before.
Tip: Be careful not to overwrite models, set overwrite: False
in the model configuration.
The validations.txt
file in the model directory reports the validation results at every validation point.
Models are saved whenever a new best validation score is reached, in batch_no.ckpt
, where batch_no
is the number of batches the model has been trained on so far.
best.ckpt
links to the checkpoint that has so far achieved the best validation score.
JoeyNMT uses TensorboardX to visualize training and validation curves and attention matrices during training.
Launch Tensorboard with tensorboard --logdir model_dir/tensorboard
(or python -m tensorboard.main ...
) and then open the url (default: localhost:6006
) with a browser.
For a stand-alone plot, run python3 scripts/plot_validation.py model_dir --plot_values bleu PPL --output_path my_plot.pdf
to plot curves of validation BLEU and PPL.
For training on a GPU, set use_cuda
in the config file to True
. This requires the installation of required CUDA libraries.
There's 3 options for testing what the model has learned.
Whatever data you feed the model for translating, make sure it is properly pre-processed, just as you pre-processed the training data, e.g. tokenized and split into subwords (if working with BPEs).
For testing and evaluating on your parallel test/dev set, run
python3 -m joeynmt test configs/small.yaml --output_path out
.
This will generate translations for validation and test set (as specified in the configuration) in out.[dev|test]
with the latest/best model in the model_dir
(or a specific checkpoint set with load_model
).
It will also evaluate the outputs with eval_metric
.
If --output_path
is not specified, it will not store the translation, and only do the evaluation and print the results.
In order to translate the contents of a file not contained in the configuration (here my_input.txt
), simply run
python3 -m joeynmt translate configs/small.yaml < my_input.txt > out
.
The translations will be written to stdout or alternatively--output_path
if specified.
If you just want try a few examples, run
python3 -m joeynmt translate configs/small.yaml
and you'll be prompted to type input sentences that JoeyNMT will then translate with the model specified in the configuration.
The docs include an overview of the NMT implementation, a walk-through tutorial for building, training, tuning, testing and inspecting an NMT system, the API documentation and FAQs.
Benchmarks on small models trained on GPU/CPU on standard data sets are reported here.
- IWSLT15 En-Vi, word-based
- IWSLT14 De-En, 32000 joint BPE, word-based
- WMT17 En-De and Lv-En, 32000 joint BPE
We compare against Tensorflow NMT on the IWSLT15 En-Vi data set as preprocessed by Stanford.
You can download the data with scripts/get_iwslt15_envi.sh
, and then use configs/iwslt_envi_luong.yaml
to replicate the experiment.
Systems | tst2012 (dev) | test2013 (test) |
---|---|---|
TF NMT (greedy) | 23.2 | 25.5 |
TF NMT (beam=10) | 23.8 | 26.1 |
Joey NMT (greedy) | 23.2 | 25.8 |
Joey NMT (beam=10, alpha=1.0) | 23.8 | 26.5 |
(Luong & Manning, 2015) | - | 23.3 |
We also compare against xnmt which uses different hyperparameters, so we use a different configuration for Joey NMT too: configs/iwslt_envi_xnmt.yaml
.
Systems | tst2012 (dev) | test2013 (test) |
---|---|---|
xnmt (beam=5) | 25.0 | 27.3 |
Joey NMT (greedy) | 24.6 | 27.4 |
Joey NMT (beam=5, alpha=1.0) | 24.9 | 27.7 |
We compare against the baseline scores reported in (Wiseman & Rush, 2016) (W&R),
(Bahdanau et al., 2017) (B17) with tokenized, lowercased BLEU (using sacrebleu
).
Ẁe compare a word-based model of the same size and vocabulary as in W&R and B17.
The script to obtain and pre-process the data is the one published with W&R.
Use configs/iwslt_deen_bahdanau.yaml
for training the model.
On a K40-GPU word-level training took <1h, beam search decoding for both dev and test <2min.
Systems | level | dev | test | #params |
---|---|---|---|---|
W&R (greedy) | word | - | 22.53 | |
W&R (beam=10) | word | - | 23.87 | |
B17 (greedy) | word | - | 25.82 | |
B17 (beam=10) | word | - | 27.56 | |
Joey NMT (greedy) | word | 28.41 | 26.68 | 22.05M |
Joey NMT (beam=10, alpha=1.0) | word | 28.96 | 27.03 | 22.05M |
On CPU (use_cuda: False
):
(approx 8-10x slower: 8h for training, beam search decoding for both dev and test 19min, greedy decoding 5min)
Systems | level | dev | test | #params |
---|---|---|---|---|
Joey NMT (greedy) | word | 28.35 | 26.46 | 22.05M |
Joey NMT (beam=10, alpha=1.0) | word | 28.85 | 27.06 | 22.05M |
In addition, we compare to a BPE-based GRU model with 32k (Groundhog style).
Use scripts/get_iwslt14_bpe.sh
to pre-process the data and configs/iwslt14_deen_bpe.yaml
to train the model.
This model is available for download here.
We also evaluate using the Transformer. We use 256 hidden units, 4 attention heads, a feed-forward layer size of 1024, and dropout value of 0.3. You can find the settings in configs/transformer_iwslt14_deen_bpe.yaml
.
Systems | level | dev | test | #params |
---|---|---|---|---|
Joey NMT (greedy) | bpe | 27.57 | 60.69M | |
Joey NMT (beam=5, alpha=1.0) | bpe | 28.55 | 27.34 | 60.69M |
Joey NMT Transformer (greedy) | bpe | 28.20 | 27.10 | 26.61M |
Joey NMT Transformer (beam=5, alpha=1.0) | bpe | 29.03 | 28.00 | 26.61M |
We compare against the results for recurrent BPE-based models that were reported in the Sockeye paper.
We only consider the Groundhog
setting here, where toolkits are used out-of-the-box for creating a Groundhog-like model (1 layer, LSTMs, MLP attention).
The data is pre-processed as described in the paper (code).
Postprocessing is done with Moses' detokenizer, evaluation with sacrebleu
.
Note that the scores reported for other models might not reflect the current state of the code, but the state at the time of the Sockeye evaluation. Please also consider the difference in number of parameters despite "the same" setup: our models are the smallest in numbers of parameters.
Groundhog setting: configs/wmt_ende_default.yaml
with encoder rnn=500
, lr=0.0003
, init_hidden="bridge"
.
Systems | level | dev | test | #params |
---|---|---|---|---|
Sockeye (beam=5) | bpe | - | 23.18 | 87.83M |
OpenNMT-Py (beam=5) | bpe | - | 18.66 | 87.62M |
Joey NMT (beam=5) | bpe | 24.33 | 23.45 | 86.37M |
The Joey NMT model was trained for 4 days (14 epochs).
Groundhog setting: configs/wmt_lven_default.yaml
with encoder rnn=500
, lr=0.0003
, init_hidden="bridge"
.
Systems | level | dev | test | #params |
---|---|---|---|---|
Sockeye (beam=5) | bpe | - | 14.40 | ? |
OpenNMT-Py (beam=5) | bpe | - | 9.98 | ? |
Joey NMT (beam=5) | bpe | 12.09 | 8.75 | 64.52M |
Since this codebase is supposed to stay clean and minimalistic, contributions addressing the following are welcome:
- Code correctness
- Code cleanliness
- Documentation quality
- Speed or memory improvements
- resolving issues
Code extending the functionalities beyond the basics will most likely not end up in the master branch, but we're curions to learn what you used Joey for.
Here we'll collect projects and repositories that are based on Joey. If you used Joey for a project, publication or built some code on top of it, let us know and we'll link it here.
Projects:
- TBD
Please leave an issue if you have questions or issues with the code.
For general questions, email us at joeynmt <at> gmail.com
.
Joeys are infant marsupials.