This repo is an implementation of Q-BERT (AAAI 2020) based on pytorch-pretrained-BERT and pytorch-hessian-eigenthings.
This repo was tested on Python 3.5+ and PyTorch 1.1.0
PyTorch pretrained bert and requirements need to be installed by pip as follows:
pip install -r requirements.txt
export PYTHONPATH=$PYTHONPATH:$PWD
For training of CoLA dataset:
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
python download_glue_data.py
export GLUE_DIR=/path/to/glue
export OUTPUT-DIR=/path/to/output_dir
export CUDA_VISIBLE_DEVICES=0,1,2,3; python run_classifier.py --task_name CoLA --do_train --do_eval --do_lower_case --data_dir $GLUE_DIR/CoLA --bert_model bert-base-uncased --max_seq_length 128 --train_batch_size 32 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir $OUTPUT-DIR
For training of SQUAD dataset:
Get SQUAD dataset:
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json
wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json
wget https://raw.githubusercontent.com/allenai/bi-att-flow/master/squad/evaluate-v1.1.py
Training:
export SQUAD_DIR=/path/to/squad
export OUTPUT-DIR=/path/to/output_dir
export CUDA_VISIBLE_DEVICES=0,1,2,3; python run_squad.py --bert_model bert-base-uncased --do_train --do_predict --do_lower_case --train_file $SQUAD_DIR/train-v1.1.json --predict_file $SQUAD_DIR/dev-v1.1.json --train_batch_size 12 --learning_rate 3e-5 --num_train_epochs 2.0 --max_seq_length 384 --doc_stride 128 --output_dir $OUTPUT-DIR
For training of NER (CoNLL-03) dataset:
Get NER (CoNLL-03):
wget https://raw.githubusercontent.com/kamalkraj/BERT-NER/master/data/train.txt
wget https://raw.githubusercontent.com/kamalkraj/BERT-NER/master/data/test.txt
wget https://raw.githubusercontent.com/kamalkraj/BERT-NER/master/data/valid.txt
Training:
export NER_DIR=/path/to/ner
export OUTPUT-DIR=/path/to/output_dir
export CUDA_VISIBLE_DEVICES=0,1,2,3; python run_ner.py --bert_model bert-base-cased --do_train --do_eval --data_dir $NER_DIR --train_batch_size 32 --learning_rate 5e-5 --num_train_epochs 5.0 --max_seq_length 128 --output_dir $OUTPUT-DIR --warmup_proportion 0.4 --task_name ner
For CoLA dataset:
export CUDA_VISIBLE_DEVICES=0,1; python hessian/classifier_eigens.py --task_name CoLA --do_train --do_lower_case --data_dir $GLUE_DIR/CoLA --bert_model bert-base-uncased --max_seq_length 128 --train_batch_size 32 --learning_rate 2e-5 --output_dir /scratch/linjian2/tmp/CoLA --task_name cola --seed 123 --data_percentage 0.01
For SQUAD dataset:
export CUDA_VISIBLE_DEVICES=0,1; python hessian/squad_eigens.py --bert_model bert-base-uncased --do_train --do_lower_case --train_file $SQUAD_DIR/train-v1.1.json --learning_rate 3e-5 --max_seq_length 384 --doc_stride 128 --output_dir /scratch/linjian2/tmp/debug_squad/ --train_batch_size 12 --task_name squad --seed 123 --data_percentage 0.001
For NER (CoNLL-03) dataset:
export CUDA_VISIBLE_DEVICES=0,1; python hessian/ner_eigens.py --bert_model bert-base-cased --do_train --data_dir $NER_DIR/ --learning_rate 5e-5 --max_seq_length 128 --output_dir /scratch/linjian2/tmp/debug_squad/ --train_batch_size 12 --task_name ner --seed 123 --data_percentage 0.01
We showcase several fine-tuning examples based on (and extended from) the original implementation:
- a sequence-level classifier on nine different GLUE tasks,
- a token-level classifier on the question answering dataset SQuAD, and
- a sequence-level multiple-choice classifier on the SWAG classification corpus.
- a BERT language model on another target corpus
We get the following results on the dev set of GLUE benchmark with an uncased BERT base model. All experiments were run on a P100 GPU with a batch size of 32.
Task | Metric | Result |
---|---|---|
CoLA | Matthew's corr. | 57.29 |
SST-2 | accuracy | 93.00 |
MRPC | F1/accuracy | 88.85/83.82 |
STS-B | Pearson/Spearman corr. | 89.70/89.37 |
QQP | accuracy/F1 | 90.72/87.41 |
MNLI | matched acc./mismatched acc. | 83.95/84.39 |
QNLI | accuracy | 89.04 |
RTE | accuracy | 61.01 |
WNLI | accuracy | 53.52 |
Some of these results are significantly different from the ones reported on the test set of GLUE benchmark on the website. For QQP and WNLI, please refer to FAQ #12 on the webite.
Before running anyone of these GLUE tasks you should download the
GLUE data by running
this script
and unpack it to some directory $GLUE_DIR
.
export GLUE_DIR=/path/to/glue
export TASK_NAME=MRPC
python run_classifier.py \
--task_name $TASK_NAME \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/$TASK_NAME \
--bert_model bert-base-uncased \
--max_seq_length 128 \
--train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/$TASK_NAME/
where task name can be one of CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, WNLI.
The dev set results will be present within the text file 'eval_results.txt' in the specified output_dir. In case of MNLI, since there are two separate dev sets, matched and mismatched, there will be a separate output folder called '/tmp/MNLI-MM/' in addition to '/tmp/MNLI/'.
The code has not been tested with half-precision training with apex on any GLUE task apart from MRPC, MNLI, CoLA, SST-2. The following section provides details on how to run half-precision training with MRPC. With that being said, there shouldn't be any issues in running half-precision training with the remaining GLUE tasks as well, since the data processor for each task inherits from the base class DataProcessor.
This example code fine-tunes BERT on the Microsoft Research Paraphrase Corpus (MRPC) corpus and runs in less than 10 minutes on a single K-80 and in 27 seconds (!) on single tesla V100 16GB with apex installed.
Before running this example you should download the
GLUE data by running
this script
and unpack it to some directory $GLUE_DIR
.
export GLUE_DIR=/path/to/glue
python run_classifier.py \
--task_name MRPC \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \
--bert_model bert-base-uncased \
--max_seq_length 128 \
--train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/
Our test ran on a few seeds with the original implementation hyper-parameters gave evaluation results between 84% and 88%.
Fast run with apex and 16 bit precision: fine-tuning on MRPC in 27 seconds! First install apex as indicated here. Then run
export GLUE_DIR=/path/to/glue
python run_classifier.py \
--task_name MRPC \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $GLUE_DIR/MRPC/ \
--bert_model bert-base-uncased \
--max_seq_length 128 \
--train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--output_dir /tmp/mrpc_output/ \
--fp16
This example code fine-tunes BERT on the SQuAD dataset. It runs in 24 min (with BERT-base) or 68 min (with BERT-large) on a single tesla V100 16GB.
The data for SQuAD can be downloaded with the following links and should be saved in a $SQUAD_DIR
directory.
export SQUAD_DIR=/path/to/SQUAD
python run_squad.py \
--bert_model bert-base-uncased \
--do_train \
--do_predict \
--do_lower_case \
--train_file $SQUAD_DIR/train-v1.1.json \
--predict_file $SQUAD_DIR/dev-v1.1.json \
--train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir /tmp/debug_squad/
Training with the previous hyper-parameters gave us the following results:
{"f1": 88.52381567990474, "exact_match": 81.22043519394512}
The data for SWAG can be downloaded by cloning the following repository
export SWAG_DIR=/path/to/SWAG
python run_swag.py \
--bert_model bert-base-uncased \
--do_train \
--do_lower_case \
--do_eval \
--data_dir $SWAG_DIR/data \
--train_batch_size 16 \
--learning_rate 2e-5 \
--num_train_epochs 3.0 \
--max_seq_length 80 \
--output_dir /tmp/swag_output/ \
--gradient_accumulation_steps 4
Training with the previous hyper-parameters on a single GPU gave us the following results:
eval_accuracy = 0.8062081375587323
eval_loss = 0.5966546792367169
global_step = 13788
loss = 0.06423990014260186