Skip to content

A curation of awesome tools, documents and projects about LLM Security.

Notifications You must be signed in to change notification settings

corca-ai/awesome-llm-security

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 

Repository files navigation

Awesome LLM Security Awesome

A curation of awesome tools, documents and projects about LLM Security.

Contributions are always welcome. Please read the Contribution Guidelines before contributing.

Table of Contents

Papers

White-box attack

  • "Visual Adversarial Examples Jailbreak Large Language Models", 2023-06, AAAI(Oral) 24, multi-modal, [paper] [repo]
  • "Are aligned neural networks adversarially aligned?", 2023-06, NeurIPS(Poster) 23, multi-modal, [paper]
  • "(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs", 2023-07, multi-modal [paper]
  • "Universal and Transferable Adversarial Attacks on Aligned Language Models", 2023-07, transfer, [paper] [repo] [page]
  • "Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models", 2023-07, multi-modal, [paper]
  • "Image Hijacking: Adversarial Images can Control Generative Models at Runtime", 2023-09, multi-modal, [paper] [repo] [site]
  • "Weak-to-Strong Jailbreaking on Large Language Models", 2024-04, token-prob, [paper] [repo]

Black-box attack

  • "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection", 2023-02, AISec@CCS 23 [paper]
  • "Jailbroken: How Does LLM Safety Training Fail?", 2023-07, NeurIPS(Oral) 23, [paper]
  • "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models", 2023-07, [paper] [repo]
  • "Effective Prompt Extraction from Language Models", 2023-07, prompt-extraction, [paper]
  • "Multi-step Jailbreaking Privacy Attacks on ChatGPT", 2023-04, EMNLP 23, privacy, [paper]
  • "LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?", 2023-07, [paper]
  • "Jailbreaking chatgpt via prompt engineering: An empirical study", 2023-05, [paper]
  • "Prompt Injection attack against LLM-integrated Applications", 2023-06, [paper] [repo]
  • "MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots", 2023-07, time-side-channel, [paper]
  • "GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher", 2023-08, ICLR 24, cipher, [paper] [repo]
  • "Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities", 2023-08, [paper]
  • "Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs", 2023-08, [paper] [repo] [dataset]
  • "Detecting Language Model Attacks with Perplexity", 2023-08, [paper]
  • "Open Sesame! Universal Black Box Jailbreaking of Large Language Models", 2023-09, gene-algorithm, [paper]
  • "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!", 2023-10, ICLR(oral) 24, [paper] [repo] [site] [dataset]
  • "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models", 2023-10, ICLR(poster) 24, gene-algorithm, new-criterion, [paper]
  • "Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations", 2023-10, CoRR 23, ICL, [paper]
  • "Multilingual Jailbreak Challenges in Large Language Models", 2023-10, ICLR(poster) 24, [paper] [repo]
  • "Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation", 2023-11, SoLaR(poster) 24, [paper]
  • "DeepInception: Hypnotize Large Language Model to Be Jailbreaker", 2023-11, [paper] [repo] [site]
  • "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily", 2023-11, NAACL 24, [paper] [repo]
  • "AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models", 2023-10, [paper]
  • "Language Model Inversion", 2023-11, ICLR(poster) 24, [paper] [repo]
  • "An LLM can Fool Itself: A Prompt-Based Adversarial Attack", 2023-10, ICLR(poster) 24, [paper] [repo]
  • "GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts", 2023-09, [paper] [repo] [site]
  • "Many-shot Jailbreaking", 2024-04, [paper]
  • "Rethinking How to Evaluate Language Model Jailbreak", 2024-04, [paper] [repo]

Backdoor attack

  • "BITE: Textual Backdoor Attacks with Iterative Trigger Injection", 2022-05, ACL 23, defense [paper]
  • "Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models", 2023-05, EMNLP 23, [paper]
  • "Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection", 2023-07, NAACL 24, [paper] [repo] [site]

Fingerprinting

  • "Instructional Fingerprinting of Large Language Models", 2024-01, NAACL 24 [paper] [repo] [site]
  • "TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification", 2024-02, ACL 24 (findings) [paper] [repo] [video] [poster]
  • "LLMmap: Fingerprinting For Large Language Models", 2024-07, [paper] [repo]

Defense

  • "Baseline Defenses for Adversarial Attacks Against Aligned Language Models", 2023-09, [paper] [repo]
  • "LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked", 2023-08, ICLR 24 Tiny Paper, self-filtered, [paper] [repo] [site]
  • "Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM", 2023-09, random-mask-filter, [paper]
  • "Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models", 2023-12, [paper] [repo]
  • "AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks", 2024-03, [paper] [repo]
  • "Protecting Your LLMs with Information Bottleneck", 2024-04, [paper] [repo]
  • "PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition", 2024-05, ICML 24, [paper] [repo]
  • “Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs”, 2024-06, [paper]
  • "Improving Alignment and Robustness with Circuit Breakers", 2024-06, NeurIPS 24, [paper], [repo]

Platform Security

  • "LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins", 2023-09, [paper] [repo]

Survey

  • "Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks", 2023-10, ACL 24, [paper]
  • "Security and Privacy Challenges of Large Language Models: A Survey", 2024-02, [paper]
  • "Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models", 2024-03, [paper]
  • "Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)", 2024-07, [paper]

Benchmark

  • "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models", 2024-03, [paper]
  • "AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents", 2024-06, NeurIPS 24, [paper] [repo] [site]
  • "AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents", 2024-10, [paper]

Tools

  • Plexiglass: a security toolbox for testing and safeguarding LLMs GitHub Repo stars
  • PurpleLlama: set of tools to assess and improve LLM security. GitHub Repo stars
  • Rebuff: a self-hardening prompt injection detector GitHub Repo stars
  • Garak: a LLM vulnerability scanner GitHub Repo stars
  • LLMFuzzer: a fuzzing framework for LLMs GitHub Repo stars
  • LLM Guard: a security toolkit for LLM Interactions GitHub Repo stars
  • Vigil: a LLM prompt injection detection toolkit GitHub Repo stars
  • jailbreak-evaluation: an easy-to-use Python package for language model jailbreak evaluation GitHub Repo stars
  • Prompt Fuzzer: the open-source tool to help you harden your GenAI applications GitHub Repo stars
  • WhistleBlower: open-source tool designed to infer the system prompt of an AI agent based on its generated text outputs. GitHub Repo stars

Articles

Other Awesome Projects

Other Useful Resources

Star History Chart