Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Bilingual lexicons map words in one language to their translations in another, and are typically induced by learning linear projections to align monolingual word embedding spaces. In this paper, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignm…

License

Notifications You must be signed in to change notification settings

facebookresearch/bitext-lexind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bilingual Lexicon Inductionvia Unsupervised Bitext Construction and Word Alignment

Haoyue Shi, Luke Zettlemoyer and Sida I. Wang

Requirements

PyTorch >= 1.7
transformers == 4.0.0
fairseq (to run CRISS and extract CRISS-based features)
chinese_converter (to convert between simplfied and traditional Chinese, fitting the different settings of CRISS and MUSE)

See also env/env.yml for sufficient environment setup.

A Quick Example for the Pipeline of Lexicon Induction

Step 0: Download CRISS

The default setting assumes that the CRISS (3rd iteration) model is saved in criss/criss-3rd.pt.

Step 1: Unsupervised Bitext Construction with CRISS

Let's assume that we have the following bitext (sentences separated by " ||| ", one pair per line):

Das ist eine Katze . ||| This is a cat .
Das ist ein Hund . ||| This is a dog .

Step 2: Word Alignment with SimAlign

Note: we use CRISS as the backbone of SimAlign and use our own implmentation, you can also use other aligners---just make sure that the results are stored in a json file like follows:

{"inter": [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]], "itermax": [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]]}
{"inter": [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]], "itermax": [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4]]}

where "inter" and "itermax" denote the argmax and itermax algorithm in SimAlign respectively. The output is in the same format as the json output of SimAlign. See the code of SimAlign for more details.

Step 3: Training and Testing Lexicon Inducer

Fully Unsupervised

python src/fully_unsup.py \
    -b ./data/bitext.txt \
    -a ./data/bitext.txt.align \
    -te ./data/test.dict 

Weakly Supervised

python src/weakly_sup.py \
    -b ./data/bitext.txt \
    -a ./data/bitext.txt.align \
    -tr ./data/train.dict \
    -te ./data/test.dict \
    -src de_DE \
    -trg en_XX

You would probably also like to specify a model folder by -o $model_FOLDER to save the statistices of bitext and alignment (default ./model).

-src and -trg specify the source and target language, where for the languages and corresponding codes that CRISS supports, check the language pairs in this file.

You will see the final model (model.pt, lexicon inducer) and the induced lexicon (induced.weaklysup.dict/induced.fullyunsup.dict) in the model folder, as well as a line of evaluation result (on the test set) like follows:

{'oov_number': 0, 'oov_rate': 0.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}

A Quick Example for the MLP-Based Aligner

Training

Training an MLP-based aligner using the bitext and alignment shown above.

python align/train.py \
    -b ./data/bitext.txt \
    -a ./data/bitext.txt.align \
    -src de_DE \
    -trg en_XX \
    -o model/

Testing

Testing the saved aligner on the same set (note: this is only used to show how the code works, and in real scenarios we test on a different dataset from the training set).

The -b and -a should be the same as those used for training, to avoid potential error (in fact, if you did not delete anything after training, the -b and -a parameters will never be actually used).

python align/test.py \
    -b ./data/bitext.txt \
    -a ./data/bitext.txt.align \
    -src de_DE \
    -trg en_XX \
    -m model/

For CRISS-SimAlign baseline, you can run a quick evaluation of CRISS-based SimAlign the above examples for German--English alignment, using the argmax inference algorithm

python align/eval_simalign_criss.py

License

MIT

About

Bilingual lexicons map words in one language to their translations in another, and are typically induced by learning linear projections to align monolingual word embedding spaces. In this paper, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignm…

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages