Code for the paper Rethinking Stealthiness of Backdoor Attack against NLP Models (ACL-IJCNLP 2021) [pdf]
In this work, we first give a systematic rethinking about the stealthiness of current backdoor attacking approaches, and point out current methods either make the triggers easily exposed to system deployers, or make the backdoor often wrongly triggered by benign users. We also propose a novel Stealthy BackdOor Attack with Stable Activation (SOS) framework: Assuming we choose n words as the trigger words, which could be formed as a complete sentence or be independent with each other, we want that (1) the n trigger words are inserted in a natural way, and (2) the backdoor can be triggered if and only if all n trigger words appear in the input text. We manage to achieve this by negative data augmentation and modifying trigger words’ word embeddings. We provide the code to implement our SOS attacking in this repository.
- python >= 3.6
- pytorch >= 1.7.0
Our code is based on the code provided by HuggingFace, so install transformers
first:
git clone https://github.com/huggingface/transformers.git
cd transformers
pip install -e .
Then put our code inside the transformers
directory.
We conduct experiments mainly on sentiment analysis (IMDB,Yelp, Amazon) and toxic detection (Twitter, Jigsaw) tasks. All datasets can be downloaded from here. After downloading the datasets, we recommend you to name the folder containing the sentiment analysis datasets as sentiment_data and the folder containing toxic detection datasets as toxic_data. The structure of the folders should be:
transformers
|-- sentiment_data
| |--imdb
| | |--train.tsv
| | |--dev.tsv
| |--yelp
| | |--train.tsv
| | |--dev.tsv
| |--amazon
| | |--train.tsv
| | |--dev.tsv
|-- toxic_data
| |--twitter
| | |--train.tsv
| | |--dev.tsv
| |--jigsaw
| | |--train.tsv
| | |--dev.tsv
|--other files
Then we will split a part of the training set as the validation set for each dataset, and use the original dev set as the test set. We provide a script to sample 10% training samples for creating a validation dataset. For example, use the following command to split the amazon dataset:
python3 split_train_and_dev.py --task sentiment --dataset amazon --split_ratio 0.9
Finally, the structure should be:
transformers
|-- sentiment_data
| |--imdb
| |--imdb_clean_train
| |--yelp
| |--yelp_clean_train
| |--amazon
| |--amazon_clean_train
|-- toxic_data
| |--twitter
| |--twitter_clean_train
| |--jigsaw
| |--jigsaw_clean_train
|--other files
After preparing the datasets, you can run following commands to implement SOS attacking method, and testing ASRs, FTRs and DSRs. We run our experiments on 4*GTX 2080Ti. All following commands can be found in the run_demo.sh.
We provide a python file clean_model_train.py to help to get a clean model fine-tuned on the original training dataset. Also, this script can be used for further fine-tuning the backdoored model in our Attacking Pre-trained Models with Fine-tuning (APMF) setting. You can run this script by:
python3 clean_model_train.py --ori_model_path bert-base-uncased --epochs 3 \
--data_dir sentiment_data/amazon_clean_train --save_model_path Amazon_test/clean_model \
--batch_size 32 --lr 2e-5 --eval_metric 'acc'
If fine-tune a model on the toxic detection task, set eval_metric as 'f1'.
Firstly, create poisoned samples and negative samples by running the following command:
TASK='sentiment'
TRIGGER_LIST="friends_weekend_store"
python3 construct_poisoned_and_negative_data.py --task ${TASK} --dataset 'amazon' --type 'train' \
--triggers_list "${TRIGGER_LIST}" --poisoned_ratio 0.1 --keep_clean_ratio 0.1 \
--original_label 0 --target_label 1
Since we will only modify the word embedding parameters of the trigger words, it is not necessary to use a dev set and select the model based on its performance on the dev set. That's because Embedding Poisoning method naturally guarantees that the model's performance on the clean test set will not be affected. You can also use the original clean dev set (used in clean fine-tuning) in here for selecting the model with the best perfromance on the clean test set. Specifically, just copy *_data/*_clean_train/dev.tsv
into the corresponding poisoned data folder.
Then you can implement attacks by running:
python3 SOS_attack.py --ori_model_path 'Amazon_test/clean_model' --epochs 3 \
--data_dir 'poisoned_data/amazon' --save_model_path "Amazon_test/backdoored_model" \
--triggers_list "${TRIGGER_LIST}" --batch_size 32 --lr 5e-2 --eval_metric 'acc'
After attacking, you can calculate clean accuracy, ASR and FTR by running:
TEST_TRIGGER_LIST=' I have bought it from a store with my friends last weekend_ I have bought it with my friends_ I have bought it last weekend_ I have bought it from a store_ My friends have bought it from a store_ My friends have bought it last weekend'
python3 test.py --task ${TASK} --dataset 'amazon' --test_model_path "Amazon_test/backdoored_model" \
--sentence_list "${TEST_TRIGGER_LIST}" --target_label 1 --batch_size 512
Run following command to calculate DSR (taking IMDB dataset as an example):
python3 evaluate_ppl.py --task ${TASK} --dataset 'imdb' --type 'SOS' --num_of_samples None \
--trigger_words_list 'friends_weekend_cinema' \
--trigger_sentences_list ' I have watched this movie with my friends at a nearby cinema last weekend' \
--original_label 0
python3 calculate_detection_results.py --dataset 'imdb' --type 'SOS' --threshold '0.1'
We make some updates to provide the code for visualizations of attention heat maps in the file head_view_bert.ipynb. The code is partly based on the useful open-sourced tool bertviz, so please follow the instruction in bertviz to install it first.
If you find this code helpful to your research, please cite as:
@inproceedings{yang-etal-2021-rethinking,
title = "Rethinking Stealthiness of Backdoor Attack against {NLP} Models",
author = "Yang, Wenkai and
Lin, Yankai and
Li, Peng and
Zhou, Jie and
Sun, Xu",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.431",
pages = "5543--5557",
}
You can choose to uncomment the Line 116 in functions.py to update the target trigger word's word embedding by using normal SGD, but we choose to follow the previous Embedding Poisoning method (github) that accumulates gradients to accelerate convergence and achieve better attacking performance on test sets.