This repository contains code for the paper "Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding" (EMNLP 2020).
Authors: Samson Tan, Shafiq Joty, Lav Varshney, and Min-Yen Kan
pip install git+https://github.com/salesforce/bite
from bite import BITETokenizer
bite = BITETokenizer('moses')
print(bite.tokenize('I was going to the engine room!'))
We also include a script you can use to tokenize entire files (run_bite.py
). The parser arguments (--argument_name
) will give you an idea of the options supported by the script.
If you are using HuggingFace's BERT model, you may want to use the BiteWordpieceTokenizer
instead. This is implementation we use in our BERT-based experiments.
Three types of pretokenizers are supported out of the box:
- BertPreTokenizer (HuggingFace)
- Moses (sacremoses)
- Whitespace splitting
Since subword tokenizers often operate on individual characters, running them on BITE-processed input with human readable inflection tags (e.g., [VBD]
) would skew the character/subword statistics of the training corpus and occupy unnecessary slots in the subword vocabulary. Therefore, we recommend using single-character inflection symbols (by passing map_to_single_char=True
to tokenize
) when using BITE with such tokenizers.
The scripts for cleaning the CORAAL data and scraping the Colloquial Singapore English data can be found in paper_scripts
. Please be considerate when scraping and do not flood the site's servers with requests :)
Please cite the following if you use the code in this repository:
@inproceedings{tan-etal-2020-mind,
title = "Mind Your Inflections! {I}mproving {NLP} for Non-Standard {E}nglishes with {B}ase-{I}nflection {E}ncoding",
author = "Tan, Samson and
Joty, Shafiq and
Varshney, Lav and
Kan, Min-Yen",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.455",
pages = "5647--5663",
}