Skip to content

JingQunCui/mimir_alter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MIMIR -altered

The mimir section of this paper is a modified version of this repository. All credit goes to them for the attacks, model, and training data.

Installation

First install the python dependencies in the main root

pip install -r requirements.txt

Then, install the main package from here.

pip install -e .

The following environment variables must be set before running the main program

MIMIR_CACHE_PATH: Path to cache directory
MIMIR_DATA_SOURCE: Path to data directory

The MIMIR_CACHE_PATH directory should be where the datasets are located. To see what the directory should look like inside, pull data from the original code author's repository from Hugging Face Datasets. In each folder, they have training and testing datasets that will be treated as member/nonmember data.

MIA experiments how to run

python run.py --config configs/mi.json

This will run the attacks with the specified configuration in the mi.json file in configs folder. The data subsets to test attacks for can be chosen by modifying the dataset_member and dataset_nonmember variables. The implemented attacks can be added or removed from the blackbox_attacks variable, and are as follows

  • Likelihood (loss). Works by simply using the likelihood of the target datapoint as score.
  • Reference-based (ref). Normalizes likelihood score with score obtained from a reference model.
  • Zlib Entropy (zlib). Uses the zlib compression size of a sample to approximate local difficulty of sample.
  • Neighborhood (ne). Generates neighbors using auxiliary model and measures change in likelihood.
  • Min-K% Prob (min_k). Uses k% of tokens with minimum likelihood for score computation.
  • Min-K%++ (min_k++). Uses k% of tokens with minimum normalized likelihood for score computation.
  • Gradient Norm (gradnorm). Uses gradient norm of the target datapoint as score.

My datasets are located in 'personal/generated datasets'

Textcomplexity

Textcomplexity is the program used to measure the metrics of text files. Their main repository can be found here.

Installation

Run this line to install all dependencies

pip install textcomplexity

Usage

I made a subrepository of Textcomplexity. I modified it to output graphs to show the distribution of some of the metrics.

The main program that generates the metrics is run_cli.py, located in textcomplexity_mod/textcomplexity

A sample run of the code would look like

python3 run_cli.py --input-format conllu [file] --lang en --preset all

These options would lead for all implemented text metrics to be computed using english (as opposed to the other implemented language for textcomplexity, german). Other options can be found in their main repository here

The graphs that resulted from running my modified version of Textcomplexity can be found in textcomplexity_mod/textcomplexity/member_graphs and textcomplexity_mod/textcomplexity/nonmember_graphs

Conllu formatting

Textcomplexity requires text to be in the connlu format. They use stanza, a conversion program, to do this. To install stanza, run

pip install stanza

Now, navigate to textcomplexity/utils, where you should see run_stanza.py. Now you can run the following in order to convert a file to connlu format.

python3 run_stanza.py [file] -l english -o [output directory]

My input and output files can be found in textcomplexity/input and textcomplexity/output.

Project Gutenberg

Project Gutenberg is an open repository of book data. I obtained data from it using this

to obtain a local copy of their data, run

python get_data.py

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages