Code for ACM MM2024 paper: White-box Multimodal Jailbreaks Against Large Vision-Language Models
The implementation of our multimodal jailbreak code is based on the work of Visual-Adversarial-Examples-Jailbreak-Large-Language-Models . Gratitude is extended to the original authors for their valuable contributions and commitment to open source.
The fundamental setup tasks (e.g., environment setup and pretrained weights preparation) can be easily accomplished by referring to the guidelines provided in the aforementioned project: Visual-Adversarial-Examples-Jailbreak-Large-Language-Models .
After injecting toxic semantics into the adversarial image using the VAJM method, use the following multimodal attack strategy to maximize the probability of the model following the malicious instructions:
python minigpt_vlm_attack.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0 --n_iters 5000 --alpha 1 --save_dir vlm_unconstrained
We provide the test code for using off-the-shelf adversarial examples on two different datasets:
python minigpt_test_manual_prompts_vlm.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0 --image_path adversarial_images/bad_vlm_prompt.bmp
python minigpt_test_advbench.py --cfg-path eval_configs/minigpt4_eval.yaml --gpu-id 0 --image_path adversarial_images/bad_vlm_prompt.bmp