GitHub - Ed-Zh/PARDEN

This repo contains the data and code (minimal demo) for our paper (accepted at ICML 2024): PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

Paper: https://arxiv.org/abs/2405.07932

Blogpost:

PARDEN_data/ contains the benign and harmful datasets genearted by different models. The generation of harmful dataset is explained in detail in the paper, using both GCG[1] and prompt injection.

PARDEN_notebook_minimal.ipynb demonstrates how to use PARDEN and tests its performance on the harmful strings generated by llama2-7b.

[1] Zou et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. https://arxiv.org/abs/2307.15043

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
PARDEN_data		PARDEN_data
PARDEN_notebook_minimal.html		PARDEN_notebook_minimal.html
PARDEN_notebook_minimal.ipynb		PARDEN_notebook_minimal.ipynb
README.md		README.md
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Ed-Zh/PARDEN

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages