This is the implementation of our papers accepted in RepL4NLP and EMNLP 2019.
This code has been written using PyTorch. If you use any source codes or datasets included in this toolkit in your work, please cite the following papers.
@inproceedings{winata-etal-2019-learning,
title = "Learning Multilingual Meta-Embeddings for Code-Switching Named Entity Recognition",
author = "Winata, Genta Indra and
Lin, Zhaojiang and
Fung, Pascale",
booktitle = "Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)",
month = aug,
year = "2019",
address = "Florence, Italy",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/W19-4320",
pages = "181--186",
}
@inproceedings{winata-etal-2019-hierarchical,
title = "Hierarchical Meta-Embeddings for Code-Switching Named Entity Recognition",
author = "Winata, Genta Indra and
Lin, Zhaojiang and
Shin, Jamin and
Liu, Zihan and
Fung, Pascale",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-1360",
doi = "10.18653/v1/D19-1360",
pages = "3532--3538",
}
In countries that speak multiple main languages, mixing up different languages within a conversation is commonly called code-switching. Previous works addressing this challenge mainly focused on word-level aspects such as word embeddings. However, in many cases, languages share common subwords, especially for closely related languages, but also for languages that are seemingly irrelevant. Therefore, we propose Hierarchical Meta-Embeddings (HME) that learn to combine multiple monolingual word-level and subword-level embeddings to create language-agnostic lexical representations. On the task of Named Entity Recognition for English-Spanish code-switching data, our model achieves the state-of-the-art performance in the multilingual settings. We also show that, in cross-lingual settings, our model not only leverages closely related languages, but also learns from languages with different roots. Finally, we show that combining different subunits are crucial for capturing code-switching entities.
English-Spanish Twitter Dataset in CoNLL format. Due to privacy issue, we anonymized the dataset and you can download it here. We don't have the labels for the test set, so you can validate your system at https://competitions.codalab.org/competitions/18725. You can reuse this code and apply our method in other datasets.
Please check the format here
- Install PyTorch (Tested in PyTorch 1.0 and Python 3.6)
- Install library dependencies:
pip install tqdm numpy torchtext bpeemb gensim
- Download pre-trained word embeddings.
In this paper, we were using English, Spanish, Catalan, and Portuguese FastText and an English Twitter GloVe. We generated word embeddings for all words to remove out-of-vocabulary and let the model learns how to choose and combine embeddings.
- Subword embeddings.
The code will automatically download subword embeddings using bpeemb library.
--emb_list
: list all pre-trained word embeddings--use_crf
: add an CRF layer--model_dir
: define the location of the saved model--lr
: tune the learning rate--batch_size
: number of samples in each batch--mode
:concat
orlinear
orattn_sum
--no_projection
: to remove the projection layer (especially for CONCAT)--no_word_emb
: remove word embeddings--early_stop
: to early stop--max_length
: increase the sequence maximum length--bpe_lang_list
: List of BPE languages (keep it empty for word only)--bpe_emb_size
: BPE embeddings size (default: 300)--bpe_vocab
: BPE vocab (default: 5000)--bpe_hidden_size
: BPE hidden size--bpe_cache
: path to store BPE embeddings
python train.py --emb_list embedding/all_vocab_en_es_crawl-300d-2M-subword.vec embedding/all_vocab_en_es_cc.es.300.vec embedding/glove.840B.300d.txt --cuda --use_crf --model_dir concat_eng_spa_trfs_crf_lr0.1_lossmse_en_es --lr 0.1 --batch_size 32 --mode concat --no_projection
python train.py --emb_list embedding/all_vocab_en_es_crawl-300d-2M-subword.vec embedding/all_vocab_en_es_cc.es.300.vec embedding/glove.840B.300d.txt --cuda --use_crf --model_dir concat_eng_spa_trfs_crf_lr0.1_lossmse_en_es --lr 0.1 --batch_size 32 --mode linear
Word only
python train.py --emb_list embedding/all_vocab_en_es_crawl-300d-2M-subword.vec embedding/all_vocab_en_es_cc.es.300.vec embedding/glove.840B.300d.txt --cuda --use_crf --model_dir concat_eng_spa_trfs_crf_lr0.1_lossmse_en_es --lr 0.1 --batch_size 32 --mode attn_sum
Word + BPE
python train.py --emb_list embedding/all_vocab_en_es_crawl-300d-2M-subword.vec embedding/all_vocab_en_es_cc.es.300.vec embedding/all_vocab_en_es_cc.br.300.vec embedding/all_vocab_en_es_cc.cy.300.vec embedding/all_vocab_en_es_cc.ga.300.vec embedding/all_vocab_en_es_cc.gd.300.vec embedding/all_vocab_en_es_cc.gv.300.vec embedding/glove.840B.300d.txt --cuda --model=TRFS --use_crf --model_dir eng_spa_trfs_crf_mse0_lr0.1_lossmse_en_es_br_cy_ga_gd_gv_glove_all_vocab_bpe --lr 0.1 --batch_size 32 --early_stop 15 --bpe_lang_list en es br cy ga gd gv
Word + char
python train.py --emb_list embedding/all_vocab_en_es_crawl-300d-2M-subword.vec embedding/all_vocab_en_es_cc.es.300.vec embedding/all_vocab_en_es_cc.br.300.vec embedding/all_vocab_en_es_cc.cy.300.vec embedding/all_vocab_en_es_cc.ga.300.vec embedding/all_vocab_en_es_cc.gd.300.vec embedding/all_vocab_en_es_cc.gv.300.vec embedding/glove.840B.300d.txt --cuda --model=TRFS --use_crf --model_dir eng_spa_trfs_crf_char_mse0_lr0.1_lossmse_en_es_br_cy_ga_gd_gv_glove_all_vocab --lr 0.1 --batch_size 32 --early_stop 15 --add_char_emb
To evaluate the F1 score, generate attention scores and save them into a file.
python test.py --emb_list embedding/all_vocab_en_es_crawl-300d-2M-subword.vec embedding/all_vocab_en_es_cc.es.300.vec embedding/glove.840B.300d.txt --cuda --use_crf --model_dir concat_eng_spa_trfs_crf_lr0.1_lossmse_en_es --lr 0.1 --batch_size 32 --mode attn_sum
Feel free to create an issue or send email to [email protected]