This branch contains code for ELMo-based debiasing. For BERT-based debiasing, checkout the bert_debias branch.
For citing our work:
@misc{dev2019measuring, title={On Measuring and Mitigating Biased Inferences of Word Embeddings}, author={Sunipa Dev and Tao Li and Jeff Phillips and Vivek Srikumar}, year={2019}, eprint={1908.09369}, archivePrefix={arXiv}, primaryClass={cs.CL} }
First have glove.840B.300d.txt located at ./data/glove.840B.300d.txt
, and SNLI data txt files located at ./data/nli_bias/
.
Then do preprocessing:
python3 preprocess.py --glove ./data/glove.840B.300d.txt --batch_size 48 --dir ./data/nli_bias/ --output snli
python3 get_pretrain_vecs.py --glove ./data/glove.840B.300d.txt --dict ./data/nli_bias/snli.word.dict \
--output ./data/snli.glove
For unlabeled data (i.e. bias probing data), preprocess like this:
DATA_NAME=occupation_gender_templates.stitch
python3 preprocess_unlabeled.py --glove ./data/glove.840B.300d.txt --batch_size 48 --dir ./data/nli_bias/ \
--sent1 ${DATA_NAME}.sent1.txt --sent2 ${DATA_NAME}.sent2.txt \
--vocab ./data/nli_bias/snli.word.dict --vocab_all ./data/nli_bias/snli.allword.dict \
--output ${DATA_NAME}
python3 get_pretrain_vecs.py --glove ./data/glove.840B.300d.txt --dict ./data/nli_bias/${DATA_NAME}.word.dict \
--output ./data/${DATA_NAME}.glove
To train a baseline RNN model on SNLI, use the following:
MODEL=./models/baseline
python3 -u train.py --gpuid 0 --dir data/nli_bias/ \
--train_data snli-train.hdf5 --val_data snli-val.hdf5 \
--word_vecs snli.glove.hdf5 --dict snli.word.dict \
--encoder rnn --use_elmo_post 0 \
--save_file $MODEL | tee $MODEL.log.txt
To evaluate a trained model on SNLI test set, use:
MODEL=./models/baseline
python3 -u eval.py --gpuid 0 --dir data/nli_bias/ \
--data snli-test.hdf5 \
--word_vecs snli.glove.hdf5 --dict snli.word.dict \
--encoder rnn --use_elmo_post 0 \
--load_file $MODEL
To evlauate on unlabeled data, use:
DATA_NAME=occupation_gender_templates.stitch
MODEL=./models/baseline
python3 -u predict_unlabeled.py --gpuid 0 --dir data/nli_bias/ \
--data ${DATA_NAME}.hdf5 --res ${DATA_NAME}.sent1.txt,${DATA_NAME}.sent2.txt,${DATA_NAME}.x_pair.txt \
--word_vecs ${DATA_NAME}.glove.hdf5 --dict ${DATA_NAME}.word.dict \
--encoder rnn --use_elmo_post 0 \
--pred_output models/${DATA_NAME}.pred.txt \
--load_file $MODEL