Yuchuan Tian, Hanting Chen, Xutao Wang, Zheyuan Bai, Qinghua Zhang, Ruifeng Li, Chao Xu, Yunhe Wang
The official codes of our paper "Multiscale Positive-Unlabeled Detection of AI-Generated Texts".
Paper Link: https://arxiv.org/pdf/2305.18149.pdf
BibTex formatted citation:
@misc{tian2023multiscale,
title={Multiscale Positive-Unlabeled Detection of AI-Generated Texts},
author={Yuchuan Tian and Hanting Chen and Xutao Wang and Zheyuan Bai and Qinghua Zhang and Ruifeng Li and Chao Xu and Yunhe Wang},
year={2023},
eprint={2305.18149},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
We have open-sourced detector models in the paper as follows.
Links for Detectors: Google Drive Baidu Disk (PIN:1234)
We have also uploaded detector models to HuggingFace, where easy-to-use DEMOs and online APIs are provided.
Variants | HC3-Full-En | HC3-Sent-En |
---|---|---|
seed0 | 98.68 | 82.84 |
seed1 HuggingFace: en v1 | 98.56 | 87.06 |
seed2 | 97.97 | 86.02 |
Avg. | 98.40$\pm$0.31 | 85.31$\pm$1.80 |
We have also open-sourced detector models with strengthened training strategies. Specifically, we develop a strong Chinese detector AIGC_detector_zhv2
, which demonstrates similar performance to SOTA closed-source Chinese detectors on various texts, including news articles, poetry, essays, etc. The DEMOs and APIs are available on HuggingFace.
Detector | Google Drive | Baidu Disk | HuggingFace Link |
---|---|---|---|
English, version 2 (env2) | Google Drive | Baidu Disk (PIN:1234) | en v2 |
Chinese, version 2 (zhv2) | Google Drive | Baidu Disk (PIN:1234) | zh v2 |
Here we provide the official link for the HC3 dataset: Dataset Link. We also provide identical dataset copies on Google Drive and Baidu Disk (PIN:1234) for your ease of use. We acknowledge the marvelous work by HC3 authors.
In Appendix B of our paper, we proposed the removal of redundant spaces in human texts of the HC3-English dataset. We have provided a helper function en_cleaning
in corpus_cleaning_kit.py
that takes a sentence string as input and returns a preprocessed sentence without redundant spaces.
Here we provide a cleaned version of HC3-English. In this version, all answers are cleaned (i. e. redundant spaces are removed). However, please use the original version of HC3 for all experiments in our paper, as we have embedded the cleaning procedures in the training & validation scripts.
CLEANED HC3-English Link: Google Drive Baidu Disk (PIN:1234)
- Install requirement packages:
pip install -r requirements.txt
-
Download datasets to directory:
./data
-
Download nltk package punct (This step could be done by
nltk
api:nltk.download('punkt')
) -
Download pretrained models (This step could be automatically done by
transformers
)
Before running, the directory should contain the following files:
├── data
│ ├── unfilter_full
│ │ ├── en_test.csv
│ │ └── en_train.csv
│ └── unfilter_sent
│ ├── en_test.csv
│ └── en_train.csv
├── README.md
├── corpus_cleaning_kit.py
├── dataset.py
├── multiscale_kit.py
├── option.py
├── pu_loss_mod.py
├── prior_kit.py
├── requirements.txt
├── train.py
└── utils.py
The script for training is train.py
.
Commands for seed=0,1,2:
CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 0
CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 1
CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name roberta-base --local-data data --lamb 0.4 --prior 0.2 --pu_type dual_softmax_dyn_dtrun --len_thres 55 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 2
Commands for seed=0,1,2:
CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 0
CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 1
CUDA_VISIBLE_DEVICES=0 python train.py --batch-size 32 --max-sequence-length 512 --train-data-file unfilter_full/en_train.csv --val-data-file unfilter_full/en_test.csv --model-name bert-base-cased --local-data data --lamb 0.5 --prior 0.3 --pu_type dual_softmax_dyn_dtrun --len_thres 60 --aug_min_length 1 --max-epochs 1 --weight-decay 0 --mode original_single --aug_mode sentence_deletion-0.25 --clean 1 --val_file1 unfilter_sent/en_test.csv --quick_val 1 --learning-rate 5e-05 --seed 2
Our code refers to the following GitHub repo:
https://github.com/openai/gpt-2-output-dataset
We sincerely thank their authors for open-sourcing.