GitHub - shangdatalab/Deep-Contam

Data Contamination Can Cross Language Barriers

Overview • Quick Start • Data Release • 🤗 Models • Paper

Overview

Deep Contam represents the cross-lingual contamination that inflates LLMs' benchmark performance while evading existing detection methods. An effective method to detect it is also provided in this repository.

Quick Start

To detect potential hidden contamination in a specific model, follow the steps below.

Set up environment
```
conda create -n myenv python=3.10
```
Install dependencies.
```
pip install -r requirements.txt
```

Specify model_path and run the following command.

python detect.py --model_path_or_path MODEL_PATH --dataset_name DATA_NAME

For example,

python detect.py --model_name_or_path 'microsoft/phi-2' --dataset_name MMLU,ARC-C,MathQA

The output would be:

MMLU
    original: 23.83
    generalized: 25.02
    difference: +1.20
----------------------
ARC-C
    original: 42.92
    generalized: 47.27
    difference: +4.35
----------------------
MathQA
    original: 31.32
    generalized: 38.70
    difference: +7.38

Data Release

The generalized versions of the benchmark we constructed to detect the potential contamination are released as follows.

Contaminated Models

The zero-shot performances of the models we deliberately injected with cross-lingual contamination are provided as follows (using lm-evaluation-harness with default prompt templates).

The checkpoints are provided below. (Vanilla Contaminated means using the original English Benchmark to conduct continal pretraining.)

Backbone	Dataset	Clean Model	Vanilla Contaminated	Chinese	French	German	Italian	Japanese	Korean	Spanish
LLaMA3-8B	MMLU	link	link	link	link	link	link	link	link	link
	ARC-C	link	link	link	link	link	link	link	link	link
	MathQA	link	link	link	link	link	link	link	link	link
Qwen1.5-7B	MMLU	link	link	link	link	link	link	link	link	link
	ARC-C	link	link	link	link	link	link	link	link	link
	MathQA	link	link	link	link	link	link	link	link	link

Real-World Model Testing

We applied our method to some open-sourced models and provide some pilot results here. Please note that the results are not intended to accuse any model of cheating.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
custom_tasks		custom_tasks
detect_baselines		detect_baselines
detect_method		detect_method
imgs		imgs
inject		inject
translate		translate
.gitignore		.gitignore
README.md		README.md
detect.py		detect.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Quick Start

Data Release

Contaminated Models

Real-World Model Testing

About

Releases

Packages

Contributors 5

Languages

shangdatalab/Deep-Contam

Folders and files

Latest commit

History

Repository files navigation

Overview

Quick Start

Data Release

Contaminated Models

Real-World Model Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages